Skip to content

Product Reliability Engineer - Defense

New York, NYHybrid
Summary

Owns end-to-end service reliability at Palantir, troubleshooting outages, improving observability, and enhancing codebases for stability. Requires engineering background, Java experience, and US security clearance eligibility.

About the role

Core Responsibilities

  • Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  • Participate in on-call rotations during business hours and occasional weekends
  • Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues
  • Improve observability by refactoring codepaths and introducing telemetry
  • Identify and implement data-driven opportunities for improved service resilience
  • Develop strategic opinions on stability investments and inform the vision for long-term product stability

What We Require

  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance

What We Value

  • Comfortable with and curious about large scale production systems and technologies (e.g., load balancing, monitoring, distributed systems, configuration management)
  • Confidence in troubleshooting complex issues independently using observability tools and stack traces
  • Familiarity with monitoring tools such as Prometheus and health checks
  • Experience coding with Java, Go and/or web technologies (e.g., HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails)
  • Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
  • Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy
Skills
JavaGoPrometheusDistributed SystemsObservabilityCloud InfrastructureLoad BalancingMonitoringStorage SystemsData Processing