Skip to content

Software Engineer, Data Infrastructure - Research

250k – 380kSan Francisco, CAOnsite
Summary

Designs and implements dataset infrastructure for OpenAI's large-scale LLM training stack, including standardized APIs for multimodal data, scaling pipelines across GPU fleets, and performance debugging. Requires strong distributed systems experience and collaboration with researchers.

About the role

Responsibilities

  • Design and maintain standardized dataset APIs, including for multimodal (MM) data that cannot fit in memory.
  • Build proactive testing and scale validation pipelines for dataset loading at GPU scale.
  • Collaborate with teammates to integrate datasets seamlessly into training and inference pipelines, ensuring smooth adoption and a great user experience.
  • Document and maintain dataset interfaces so they are discoverable, consistent, and easy for other teams to adopt.
  • Establish safeguards and validation systems to ensure datasets remain reproducible and unchanged once standardized.
  • Debug and resolve performance bottlenecks in distributed dataset loading (e.g., straggler systems slowing global training).
  • Provide visualization and inspection tools to surface errors, bugs, or bottlenecks in datasets.

Requirements

  • Strong engineering fundamentals with experience in distributed systems, data pipelines, or infrastructure.
  • Experience building APIs, modular code, and scalable abstractions, while recognizing that abstractions ultimately serve the users and UX is an important part of the abstractions design.
  • Comfortable debugging bottlenecks across large fleets of machines.
  • Take pride in building infrastructure that “just works,” and find joy in being the guardian of reliability and scale.
  • Collaborative, humble, and excited to own a foundational (if not glamorous) part of the ML stack.

Nice-to-Haves

  • Background knowledge in data math, probability, or distributed data theory.
  • Worked with GPU-scale distributed systems or dataset scaling for real-time data.
Skills
Distributed SystemsData PipelinesAPIsGPUPyTorchKubernetesPythonRustScalable AbstractionsDataset Loading
Similar roles at this salary range
All Data Engineering jobs →
Justworks

Manager, Data Engineering

Lead and mentor a team of data engineers building scalable data pipelines and platform infrastructure. Hands-on coding, operational excellence, and cross-functional collaboration with analytics, data science, and business teams.

205k – 262kNew York, NYData EngineeringHybridSQLAWS
Nuance Labs

Member of Technical Staff — ML Data Infra

Build and operate large-scale multimodal data pipelines for AI avatar model training. Design production-grade systems for petabyte-scale video, audio, and text data.

200k – 300kSeattle, WAData EngineeringOn-siteRayDVC
Jump

Data Platform Lead

Own end-to-end data platform strategy and lead the data engineering team. Build scalable multi-tenant infrastructure, AI-on-data capabilities, and productized integrations for sports analytics clients.

210k – 210kLos Angeles, CAData EngineeringRemotedbtAWS
CodeRabbit

Staff Analytics Engineer

CodeRabbit is seeking a Staff Analytics Engineer to build and own their BigQuery and dbt data foundation. This role involves architecting the data warehouse, defining key metrics, building revenue models, and developing GTM intelligence layers.

240k – 250kSan Francisco, CA +1Data EngineeringHybriddbtGCP
Discord

Staff Data Engineer, Ads

Discord is seeking a Staff Data Engineer to lead technical vision and strategy for ads data infrastructure. This role involves building and maintaining sophisticated data pipelines, datasets, and analytical tools, and mentoring other engineers.

248k – 279kUnited StatesData EngineeringRemoteSQLETL