Site Reliability Engineer - US Government
Washington, DCHybrid4+ YOE
Summary
Site Reliability Engineer builds, operates, and maintains scalable infrastructure for air-gapped production environments, focusing on Linux servers, cloud/on-prem systems, automation, and troubleshooting. Requires 4+ years Linux admin experience, active security clearance, and proficiency in programming/scripting.
About the role
Core Responsibilities
- Maintaining availability of cloud & physical Linux servers that power the Palantir platform in air-gapped production environments
- Design, deploy, and operate infrastructure to support customer & product requirements via modern orchestration & monitoring platforms.
- Collaborate closely with product teams on requirements & SLOs for deploying software into air-gapped environments.
- Identifying, troubleshooting, and solving network & systems issues
- Scripting to automate away routine operational tasks
- Provide technical troubleshooting support for production issues, ensuring timely resolution and minimal impact on operations. Participate in a support on-call schedule
What We Value
- Confidence in troubleshooting complex systems issues independently using stack traces and observability & systems tools
- Comfort with managing large scale production systems and technologies with configuration management, load balancing, monitoring & alerting infrastructure, and container orchestration
- Demonstrated ability to continuously learn and work independently, making decisions with minimal supervision while working in secure facilities
- Experience with containers (Docker/Podman) and orchestration (OpenShift/Kubernetes) at scale is a plus
- Preferred Certifications: DOD 8570 IAT Level II or greater (CISSP, Sec+), Unix/Linux Computing Environment (e.g Linux+, RHCE)
What We Require
- Active security clearance
- 4+ years of experience with Linux system administration (RHEL or equivalent preferred)
- Experience with cloud-based hosting platforms like AWS, Azure, or GCP and/or experience with hardware-based environments
- Familiarity with monitoring systems using tools like Prometheus and writing health checks
- Proficiency with at least one programming language, such as Java, Go, Python, JavaScript, Bash, or similar languages.
- Strong engineering background, preferred in fields such as Computer Science, Mathematics, Software Engineering, Physics, and Data Science
Skills
LinuxRHELKubernetesOpenShiftDockerPodmanAWSAzureGCPPrometheusPythonGoJavaBashJavaScript