Networking Operating System Firmware Engineer
Develops and maintains custom networking operating system firmware for AI supercomputers, integrating Linux kernel, switch ASICs, and control-plane services. Requires deep expertise in SONiC, SAI, routing protocols, and platform bring-up across hardware and software boundaries.
Responsibilities
- Design, develop, and maintain custom NOS images for large-scale AI fabrics, using open source components from SONiC, FRR, and related networking stacks.
- Integrate, build and configure Linux kernel components, device drivers, switch ASIC SDKs, and SAI layers.
- Bring up new switch platforms, including thermal and fan control, power monitoring, transceiver management, watchdogs, OSFP CMIS, LEDs, CPLDs, and board-specific platform logic.
- Extend and customize NOS services for routing, telemetry, control-plane state, and distributed automation.
- Implement and debug route, neighbor, next-hop, and ECMP programming flows from control-plane intent through ASIC hardware state.
- Build software mechanisms that distinguish control-plane acceptance, SAI/SDK acceptance, and explicit hardware programming acknowledgement.
- Work with hardware teams to validate ASIC configurations, link bring-up, SerDes tuning, buffer profiles, and performance baselines.
- Evaluate switch silicon SDK releases, track vendor deliverables, and validate platform requirements with vendors and ASIC partners.
- Debug complex issues spanning kernel drivers, platform monitoring, NOS services, routing agents, orchestration services, hardware signals, ASIC state, and network topology.
- Integrate switches into fleet-wide monitoring, remote diagnostics, telemetry pipelines, and automated lifecycle workflows.
- Develop robust CI/CD pipelines for reproducible NOS builds and controlled rollout across the fleet.
- Support factory bring-up and qualification all the way through mass deployment.
- Collaborate on networking protocols and technologies that improve performance and reliability at AI factory scale.
Requirements
- Proven experience working with SONiC or comparable NOS stacks such as FBOSS, Cumulus Linux, Arista EOS, Junos PFE-level integration, or equivalent platform software.
- Strong software engineering fundamentals: clear interfaces, data models, state-machine design, error handling, testing, observability, performance debugging, and maintainable C/C++, Python, Go or Rust code.
- Experience with Linux kernel internals, network device drivers, platform drivers, hwmon, I2C/SMBus, CPLDs, or board-level platform software.
- Experience integrating or debugging Broadcom, Marvell, NVIDIA, Intel, or comparable switch ASIC SDKs and SAI implementations.
- Understanding of L2/L3 forwarding, ECMP, RoCE, BGP, QoS, PFC, buffer tuning, and telemetry.
- Experience with platform bring-up and board-level debugging across thermal, fan, power, transceiver, LED, watchdog, CPLD, or OSFP CMIS flows.
Nice-to-Haves
- Experience with OpenConfig gNMI interfaces, YANG data models, or structured telemetry.
- Familiarity with CI/CD pipelines, distributed config and state management, reproducible builds, and large-scale automation.
- Ability to independently drive ambiguous NOS or platform feature development from problem definition through implementation, validation, rollout, and debugging across software, hardware, and vendor boundaries.
- Familiarity with Rust or Go.
Principal Infrastructure Engineer
Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.
Director of Platform & Reliability Engineering
The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.
Staff Site Reliability Engineer
Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.