Skip to content

Senior Software Platform Engineer

Designs and maintains cloud-native AI/ML infrastructure, including MLOps pipelines on AWS and Databricks. Builds scalable data pipelines, integrates LLMs with RAG, and ensures production reliability. Requires 7+ years experience with TypeScript/Python, IaC, and ML workflows.

United StatesDevOps / SRERemote7+ YOE

About the role

What You Will Do

  • Design, implement, and maintain cloud-native platform to support AI and data workloads, with a focus on AI and data platforms such as Databricks and AWS Bedrock.
  • Build and manage scalable data pipelines to ingest, transform, and serve data for ML and analytics.
  • Develop infrastructure-as-code using tools like Cloudformation, AWS CDK to ensure repeatable and secure deployments.
  • Collaborate with AI engineers, data engineers, and platform teams to improve the performance, reliability, and cost-efficiency of AI models in production.
  • Drive best practices for observability, including monitoring, alerting, and logging for AI platforms.
  • Contribute to the design and evolution of our AI platform to support new ML frameworks, workflows, and data types.
  • Stay current with new tools and technologies to recommend improvements to architecture and operations.
  • Integrate AI models and large language models (LLMs) into production systems to enable use cases using architectures like retrieval-augmented generation (RAG).

Requirements

  • 7+ years of professional experience in software engineering and infrastructure engineering.
  • Extensive experience building and maintaining AI/ML infrastructure in production, including model, deployment, and lifecycle management.
  • Strong knowledge of AWS and infrastructure-as-code frameworks, ideally with CDK.
  • Expert-level coding skills in TypeScript and Python building robust APIs and backend services.
  • Production-level experience with Databricks MLFlow, including model registration, versioning, asset bundles, and model serving workflows.
  • Expert level understanding of containerization (Docker), and hands on experience with CI/CD pipelines, orchestration tools (e.g., ECS) is a plus.
  • Proven ability to design reliable, secure, and scalable infrastructure for both real-time and batch ML workloads.
  • Ability to articulate ideas clearly, present findings persuasively, and build rapport with clients and team members.
  • Strong collaboration skills and the ability to partner effectively with cross-functional teams.

Nice to Have

  • Familiarity with emerging LLM frameworks such as DSPy for advanced prompt orchestration and programmatic LLM pipelines.
  • Understanding of LLM cost monitoring, latency optimization, and usage analytics in production environments.
  • Knowledge of vector databases / embeddings stores (e.g., OpenSearch) to support semantic search and RAG.

Benefits

  • 100% employer-paid benefits for all eligible employees and immediate family members
  • Unlimited paid time off (PTO)
  • 401K
  • Flexible working arrangements - Remote work
  • Company paid Life Insurance, LTD/STD
  • A culture of continuous improvement where you can grow your career and get coaching

Skills

AWSDatabricksMLflowAws CdkCloudFormationTypeScriptPythonDockerCI/CDKubernetesMLOpsLLMsRAGDspyOpensearch

Similar roles

DevOps / SRE jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOEGoAWS

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk