Skip to content

Senior Software Engineer - Observability Visibility

175k – 240kNew York, NYHybrid5+ YOE
Summary

Senior engineer building observability and resilience standards, tooling, and automation to make reliability the default across Datadog services. Requires 5+ years experience, Go/Python skills, and AI feature delivery experience.

About the role

What You'll Do

  • Define and evolve observability and resilience baselines, ensuring alignment with measurable risk reduction goals across Datadog services.
  • Measure service compliance against established standards, assess risk and remediation complexity and drive sustainable solutions to close identified gaps.
  • Design and deliver scalable observability and reliability capabilities across the software development lifecycle, leveraging automation and AI-driven solutions where appropriate to enable service owners to meet established standards by default while partnering closely with platform, SRE, product and engineering teams to ensure adoption and sustained coverage.
  • Provide technical leadership and day-to-day coaching to team members, accelerating their growth through design reviews, collaborative problem-solving and operational excellence best practices.

Who You Are

  • 5+ years of experience in software engineering, site reliability engineering, or a related discipline supporting production systems at scale.
  • Hands-on experience with observability and resilience practices, including expertise in identifying, analyzing, and mitigating service and system failure modes.
  • Strong programming skills in Go and/or Python and can design and build reliable, maintainable systems.
  • Comfortable navigating complex technical challenges and proposing efficient, scalable, and easy-to-adopt solutions.
  • Experience delivering AI-enabled software features end-to-end, including design, evaluation, deployment and monitoring and can articulate when AI is the appropriate solution and when it is not.
  • Strong communication, collaboration, and mentorship skills with experience influencing technical direction across multiple engineering teams.
Skills
GoPythonObservabilitySite Reliability EngineeringAI/ML IntegrationAutomationSystem DesignMentorship
Similar roles at this salary range
All DevOps / SRE jobs →
Alembic

Senior Network & Site Reliability Engineer

Design, operate, and automate the global network and reliability layer for a high-performance NVIDIA DGX SuperPOD supporting ML workloads. Own architecture, observability, incident response, and security for mission-critical infrastructure.

210k – 240kSan Francisco, CADevOps / SREOn-site8+ YOEBGPVPN
Komodo Health

Senior Data Engineer, Sentinel (Pacific Time Zone)

Senior Infrastructure Engineer building and operating AWS cloud infrastructure for healthcare data platform. Requires Python, Terraform, CI/CD expertise, and big data tools experience.

153k – 210kUnited StatesDevOps / SRERemote5+ YOEAWSVPC
Shield AI

Senior Manager, DevOps Engineering

Lead and mentor a team of DevOps and Infrastructure Engineers responsible for build pipelines, CI/CD systems, developer tooling, and release infrastructure across Hivemind Solutions. Drive modernization of C++/Python build ecosystems and ensure scalable, secure software delivery pipelines.

180k – 280kWashington, DCDevOps / SREOn-site7+ YOENixCMake
Retool

Software Engineer, Developer Experience

Build internal AI tools and autonomous agents that embed into Retool's engineering workflows to boost developer productivity and reduce toil. Requires shipping real AI-powered developer tools and infrastructure.

155k – 315kSan Francisco, CADevOps / SREHybrid5+ YOELLMsAI agents
Pump.co

DevOps Engineer

Hands-on DevOps role owning AWS infrastructure, building developer tooling, and driving technical roadmap at an early-stage YC startup. Requires 6+ years infra/DevOps experience and strong AWS/K8s/Terraform skills.

140k – 200kSan Francisco, CADevOps / SREOn-site6+ YOEAWSSQL