Platform Engineer, Model Shaping
Build and operate backend services and infrastructure for model customization and evaluation at Together AI. Requires 3+ years building production infrastructure, strong Python/Go skills, and deep experience with Kubernetes, Linux, and cloud platforms.
Responsibilities
- Design and build Together’s systems and infrastructure for model customization, including user-facing features and internal improvements
- Contribute to reliability improvements for the platform, participating in an on-call rotation and improving processes for incident response
- Create and improve internal tooling for deployment, continuous integration, and observability
- Build a job orchestration platform spanning multiple datacenters, supporting a highly heterogeneous hardware landscape
- Partner with teams developing internal services, co-designing these services and incorporating them in systems built within Together
Requirements
- 3+ years of experience in building infrastructure or backend components of production services
- Extensive experience designing, operating, and troubleshooting production Linux environments and Kubernetes-based platforms
- Strong software engineering background in Python or Go
- Experienced with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
- Cloud environment (e.g., AWS/GCP/Azure) administration experience, preferably with a hybrid bare-metal/cloud environment
- Strong communication skills, be willing to document systems and processes and collaborate with peers of varying technical expertise
- Comfortable operating across the stack, from cluster operations and infrastructure automation to backend service development
Nice-to-Haves
- Developing large-scale production systems with high reliability requirements
- Pipeline orchestration frameworks (e.g., Kubeflow, Argo Workflows, Flyte)
- Managing GPU workloads on HPC clusters, ideally with hands-on experience in operating NVIDIA’s networking stack (e.g., NCCL, Mellanox firmware, GPUDirect RDMA)
- Deployment of services for AI training or inference
- Networking fundamentals, including TCP/IP, DNS, routing, load balancing, TLS, and network debugging tools
- Maintaining or contributing to open-source projects
Compensation & Benefits
- Competitive compensation, startup equity, health insurance, and other benefits
- Flexibility in terms of remote work
- US base salary range: $200,000 - $290,000
Software Engineer, Dev Velocity
Build internal developer platform, tooling, and automation to accelerate engineering velocity. Focus on CI/CD pipelines, test infrastructure, build systems, and metrics to help engineers ship faster and more reliably.
Senior Software Engineer - Developer Platform
Senior engineer building and scaling internal developer platforms with strong focus on AI tooling, reliability, and developer experience. Requires 4+ years in backend/infrastructure and proven project leadership.
Senior Software Engineer, Platform Engineering
Senior Software Engineer building and evolving an internal developer platform including CI/CD, observability, and tooling to improve developer productivity and reliability. Requires 4+ years of production experience in platform/devtools/infrastructure.