Networking Operating System Firmware Engineer

266k – 445kSan Francisco, CAHybridMay 6

Summary

Develops and maintains custom networking operating system firmware for AI supercomputers, integrating Linux kernel, switch ASICs, and control-plane services. Requires deep expertise in SONiC, SAI, routing protocols, and platform bring-up across hardware and software boundaries.

About the role

Responsibilities

Design, develop, and maintain custom NOS images for large-scale AI fabrics, using open source components from SONiC, FRR, and related networking stacks.
Integrate, build and configure Linux kernel components, device drivers, switch ASIC SDKs, and SAI layers.
Bring up new switch platforms, including thermal and fan control, power monitoring, transceiver management, watchdogs, OSFP CMIS, LEDs, CPLDs, and board-specific platform logic.
Extend and customize NOS services for routing, telemetry, control-plane state, and distributed automation.
Implement and debug route, neighbor, next-hop, and ECMP programming flows from control-plane intent through ASIC hardware state.
Build software mechanisms that distinguish control-plane acceptance, SAI/SDK acceptance, and explicit hardware programming acknowledgement.
Work with hardware teams to validate ASIC configurations, link bring-up, SerDes tuning, buffer profiles, and performance baselines.
Evaluate switch silicon SDK releases, track vendor deliverables, and validate platform requirements with vendors and ASIC partners.
Debug complex issues spanning kernel drivers, platform monitoring, NOS services, routing agents, orchestration services, hardware signals, ASIC state, and network topology.
Integrate switches into fleet-wide monitoring, remote diagnostics, telemetry pipelines, and automated lifecycle workflows.
Develop robust CI/CD pipelines for reproducible NOS builds and controlled rollout across the fleet.
Support factory bring-up and qualification all the way through mass deployment.
Collaborate on networking protocols and technologies that improve performance and reliability at AI factory scale.

Requirements

Proven experience working with SONiC or comparable NOS stacks such as FBOSS, Cumulus Linux, Arista EOS, Junos PFE-level integration, or equivalent platform software.
Strong software engineering fundamentals: clear interfaces, data models, state-machine design, error handling, testing, observability, performance debugging, and maintainable C/C++, Python, Go or Rust code.
Experience with Linux kernel internals, network device drivers, platform drivers, hwmon, I2C/SMBus, CPLDs, or board-level platform software.
Experience integrating or debugging Broadcom, Marvell, NVIDIA, Intel, or comparable switch ASIC SDKs and SAI implementations.
Understanding of L2/L3 forwarding, ECMP, RoCE, BGP, QoS, PFC, buffer tuning, and telemetry.
Experience with platform bring-up and board-level debugging across thermal, fan, power, transceiver, LED, watchdog, CPLD, or OSFP CMIS flows.

Nice-to-Haves

Experience with OpenConfig gNMI interfaces, YANG data models, or structured telemetry.
Familiarity with CI/CD pipelines, distributed config and state management, reproducible builds, and large-scale automation.
Ability to independently drive ambiguous NOS or platform feature development from problem definition through implementation, validation, rollout, and debugging across software, hardware, and vendor boundaries.
Familiarity with Rust or Go.

Skills

SONiCSAILinux kernelASIC SDKsFRRC/C++PythonGoRustBGPECMPRoCEOpenConfigYANGCI/CD

Similar roles at this salary range

All DevOps / SRE jobs →

Onebrief

Jun 4

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS

Sentry

Jun 4

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython

Together AI

Jun 4

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph

Forge

Jun 4

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes

Zoox

Jun 3

Staff Site Reliability Engineer

Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.

250k – 300kFoster City, CADevOps / SREHybridBuckCI/CD

Apply