Senior Software Engineer - Observability Visibility

175k – 240kNew York, NYHybrid5+ YOEJun 12

Summary

Senior engineer building observability and resilience standards, tooling, and automation to make reliability the default across Datadog services. Requires 5+ years experience, Go/Python skills, and AI feature delivery experience.

About the role

What You'll Do

Define and evolve observability and resilience baselines, ensuring alignment with measurable risk reduction goals across Datadog services.
Measure service compliance against established standards, assess risk and remediation complexity and drive sustainable solutions to close identified gaps.
Design and deliver scalable observability and reliability capabilities across the software development lifecycle, leveraging automation and AI-driven solutions where appropriate to enable service owners to meet established standards by default while partnering closely with platform, SRE, product and engineering teams to ensure adoption and sustained coverage.
Provide technical leadership and day-to-day coaching to team members, accelerating their growth through design reviews, collaborative problem-solving and operational excellence best practices.

Who You Are

5+ years of experience in software engineering, site reliability engineering, or a related discipline supporting production systems at scale.
Hands-on experience with observability and resilience practices, including expertise in identifying, analyzing, and mitigating service and system failure modes.
Strong programming skills in Go and/or Python and can design and build reliable, maintainable systems.
Comfortable navigating complex technical challenges and proposing efficient, scalable, and easy-to-adopt solutions.
Experience delivering AI-enabled software features end-to-end, including design, evaluation, deployment and monitoring and can articulate when AI is the appropriate solution and when it is not.
Strong communication, collaboration, and mentorship skills with experience influencing technical direction across multiple engineering teams.

Skills

GoPythonObservabilitySite Reliability EngineeringAI/ML IntegrationAutomationSystem DesignMentorship

Similar roles at this salary range

All DevOps / SRE jobs →

Alembic

Jun 12

Senior Network & Site Reliability Engineer

Design, operate, and automate the global network and reliability layer for a high-performance NVIDIA DGX SuperPOD supporting ML workloads. Own architecture, observability, incident response, and security for mission-critical infrastructure.

210k – 240kSan Francisco, CADevOps / SREOn-site8+ YOEBGPVPN

Komodo Health

Jun 12

Senior Data Engineer, Sentinel (Pacific Time Zone)

Senior Infrastructure Engineer building and operating AWS cloud infrastructure for healthcare data platform. Requires Python, Terraform, CI/CD expertise, and big data tools experience.

153k – 210kUnited StatesDevOps / SRERemote5+ YOEAWSVPC

Shield AI

Jun 12

Senior Manager, DevOps Engineering

Lead and mentor a team of DevOps and Infrastructure Engineers responsible for build pipelines, CI/CD systems, developer tooling, and release infrastructure across Hivemind Solutions. Drive modernization of C++/Python build ecosystems and ensure scalable, secure software delivery pipelines.

180k – 280kWashington, DCDevOps / SREOn-site7+ YOENixCMake

Retool

Jun 11

Software Engineer, Developer Experience

Build internal AI tools and autonomous agents that embed into Retool's engineering workflows to boost developer productivity and reduce toil. Requires shipping real AI-powered developer tools and infrastructure.

155k – 315kSan Francisco, CADevOps / SREHybrid5+ YOELLMsAI agents

Pump.co

Jun 11

DevOps Engineer

Hands-on DevOps role owning AWS infrastructure, building developer tooling, and driving technical roadmap at an early-stage YC startup. Requires 6+ years infra/DevOps experience and strong AWS/K8s/Terraform skills.

140k – 200kSan Francisco, CADevOps / SREOn-site6+ YOEAWSSQL

Apply