Software Engineer, Productivity - Model Performance
Builds and improves developer tools, CI/CD pipelines, and testing workflows to boost productivity for OpenAI's model performance engineering teams. Requires strong Python skills, experience with developer infrastructure, and ability to work in ambiguous environments.
Responsibilities
- Improve development workflows for engineers working on model performance infrastructure
- Design and improve CI/CD, release, validation, and testing pipelines
- Build and maintain tools that improve reliability, iteration speed, and engineering confidence
- Partner closely with engineers to identify friction in testing, debugging, deployment, and development workflows
- Contribute to infrastructure efforts that support performance-critical training and inference systems
- Help improve developer experience across Python-heavy codebases and performance-oriented infrastructure
- Work in a high-context, ambiguous environment where ownership and good judgment matter
Requirements
- Motivated by enabling other engineers and helping them do their best work
- Strong experience with CI/CD, developer infrastructure, testing systems, tooling, or build/release workflows
- Highly collaborative, empathetic, and comfortable partnering deeply with technical teams
- Strong in Python and enjoy building reliable, scalable developer tools and infrastructure
- Experience improving large-scale engineering workflows, especially around CI reliability, test infrastructure, and debugging velocity
- Self-directed and comfortable operating with ambiguity
- Excited to learn model performance domain
Nice-to-haves
- Experience in the PyTorch ecosystem
- Experience with C++ or Rust
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.
Senior Software Engineer, Platform
Lead architecture and implementation of multi-cloud Kubernetes platform across AWS, Azure, and GCP. Own infrastructure provisioning, access management, networking, and lifecycle systems while mentoring engineers and defining org-wide standards.
Senior Software Engineer - Internal Observability
Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.