Skip to content

Product Reliability Engineer - Defense

Owns end-to-end service reliability at Palantir, troubleshooting outages, improving observability, and enhancing codebases for stability. Requires engineering background, Java experience, and US security clearance eligibility.

Washington, DCDevOps / SREOnsite

About the role

Core Responsibilities

  • Continuously invest in documentation, metrics, monitors and other troubleshooting tools
  • Participate in on-call rotations during business hours and occasional weekends
  • Diagnose, resolve, and prevent issues encountered in the field. Deliver end-to-end improvements to core products based on these issues
  • Improve observability by refactoring codepaths and introducing telemetry
  • Identify and implement data-driven opportunities for improved service resilience
  • Develop strategic opinions on stability investments and inform the vision for long-term product stability

What We Require

  • Engineering background in Computer Science, Mathematics, Software Engineering, Physics or similar field
  • Ability to work with a high degree of ownership and a strong sense of urgency in a dynamic environment
  • Experience producing code in backend languages such as Java, as part of a past role or personal projects
  • Familiarity with storage and data processing systems and cloud infrastructure
  • Strong written and verbal communication and ability to iterate quickly with teammates and incorporate feedback
  • Eligibility and willingness to obtain a US Security clearance

What We Value

  • Comfortable with and curious about large scale production systems and technologies (e.g., load balancing, monitoring, distributed systems, configuration management)
  • Confidence in troubleshooting complex issues independently using observability tools and stack traces
  • Familiarity with monitoring tools such as Prometheus and health checks
  • Experience coding with Java, Go and/or web technologies (e.g. HTML, CSS, JavaScript, Python/Ruby, Django/Flask/Ruby on Rails)
  • Track record of identifying bugs in codebases and contributing fixes leading to long term service stability
  • Demonstrated ability making data-driven decisions and engaging with stakeholders on strategy

Skills

JavaGoPrometheusDistributed SystemsObservabilityCloud InfrastructureLoad BalancingMonitoringStorage SystemsData Processing

Similar roles

DevOps / SRE jobs

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo