Product Reliability Engineer - Defense
New York, NYHybrid
Summary
Owns end-to-end service reliability at Palantir, troubleshooting outages, improving observability, and enhancing codebases for stability. Requires engineering background, Java experience, and US security clearance eligibility.
About the role
Core Responsibilities
- Continuously invest in documentation, metrics, monitors and other troubleshooting tools
- Participate in on-call rotations during business hours and occasional weekends
- Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues
- Improve observability by refactoring codepaths and introducing telemetry
- Identify and implement data-driven opportunities for improved service resilience
- Develop strategic opinions on stability investments and inform the vision for long-term product stability
What We Require
- Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
- Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
- Experience producing code in backend languages such as Java, as part of a past role or personal projects
- Familiarity with storage and data processing systems and cloud infrastructure
- Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
- Eligibility and willingness to obtain a US Security clearance
What We Value
- Comfortable with and curious about large scale production systems and technologies (e.g., load balancing, monitoring, distributed systems, configuration management)
- Confidence in troubleshooting complex issues independently using observability tools and stack traces
- Familiarity with monitoring tools such as Prometheus and health checks
- Experience coding with Java, Go and/or web technologies (e.g., HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails)
- Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
- Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy
Skills
JavaGoPrometheusDistributed SystemsObservabilityCloud InfrastructureLoad BalancingMonitoringStorage SystemsData Processing