Responsibilities
- Design, implement, and maintain scalable CI/CD pipelines
- Develop and manage infrastructure as code (e.g., Terraform, Pulumi)
- Improve system reliability through monitoring, alerting, logging, and failover strategies
- Work with platform and backend teams to identify and resolve performance bottlenecks
- Contribute to deployment workflows, environment automation, and developer tooling
- Ensure infrastructure security and compliance practices are in place
Qualifications
Education: Degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience
Experience: 5+ years of experience in a DevOps, Infrastructure, or SRE role at a fast-paced tech company or startup
Tooling: Expert-level proficiency with CI/CD systems (GitHub Actions, ArgoCD, etc.), Docker, and Kubernetes
Infrastructure: Expert with cloud providers (AWS/GCP), distributed systems architecture and implementation, IaC tools (Terraform, Pulumi), and secrets management (Vault, SSM, etc.)
Observability: Strong understanding of logging, metrics, and monitoring in large-scale distributed systems (e.g., Grafana, Prometheus, ELK, Datadog)
Collaboration: Effective at partnering with backend and ML teams to deliver stable, high-velocity systems
Security: Experience building with best practices in cloud and application-level security
Bonus Points
- Experience supporting AI or ML workloads in production
- Experience with ephemeral environments and preview deployments
- Contributions to internal platform tools or DevOps open-source projects
- Past ownership of high-uptime systems or regulated environments