Staff Software Engineer, Scaling AI Systems

228k – 290kSan Francisco, CADevOps / SREHybrid10+ YOEJun 9

Summary

Staff-level engineer improving performance, stability, and scalability of AI healthcare systems. Focuses on distributed systems, Kubernetes, observability, and cloud-native tooling to support hyperscale growth.

About the role

What You'll Do

Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all systems, and make changes to application code to resolve them.
Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments.
Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability.
Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization.
Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure.
Work with application teams to establish and adopt SLOs and error budgets, and drive better metrics for application health that can drive automated canary releases, improved health monitoring, and better engineering practices.
Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization.
Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools.
Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale.

Who You Are

10+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling.
Experience as a back-end engineer focused on system performance and scalability.
Experience reducing latency in software by multiples through leveraging observability and profiling tools.
Experience building on Kubernetes and scaling compute services on Kubernetes; experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc.
Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies.
Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust).
Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations.
Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure.
Must be willing to travel up to 10%.

Skills

KubernetesGoogle Cloud PlatformPythonGoNode.jsRustPrometheusOpenTelemetryGrafanaArgoCDIstioInfrastructure as Code

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 24

Senior Staff Engineer, Platform R&D

Senior individual contributor embedded in Crusoe's Managed Platform Services team to accelerate delivery through rapid AI-augmented R&D, prototyping, and cross-domain technical leadership. Requires 10+ years experience with systems languages and cloud-native infrastructure.

245k – 295kSan Francisco, CADevOps / SREOn-site10+ YOEGoC++

Airbnb

Jun 24

Senior Software Engineer, Dev Tools

Senior engineer building and operating cloud dev environments, Kubernetes platforms, and tooling for engineers and AI agents at Airbnb. Requires 5-9+ years building high-scale distributed systems on AWS.

196k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Pave

Jun 24

Senior Software Engineer - Developer Platform

Senior engineer building and scaling internal developer platforms with strong focus on AI tooling, reliability, and developer experience. Requires 4+ years in backend/infrastructure and proven project leadership.

196k – 265kSan Francisco, CA +1DevOps / SREHybrid4+ YOEGCPNode.js

Decagon

Jun 23

Senior Software Engineer, Platform Engineering

Senior Software Engineer building and evolving an internal developer platform including CI/CD, observability, and tooling to improve developer productivity and reliability. Requires 4+ years of production experience in platform/devtools/infrastructure.

200k – 400kNew York, NYDevOps / SREOn-site4+ YOECI/CDPython

Decagon

Jun 23

Senior Software Engineer, Platform Engineering

Senior Software Engineer building and evolving an internal developer platform including CI/CD, observability, and tooling to improve engineer productivity at a conversational AI company.

200k – 400kSan Francisco, CADevOps / SREOn-site4+ YOECI/CDPython

Apply