Software Engineer, Distributed Data Systems

180k – 350kSan Francisco, CAData EngineeringOnsiteDec 19

Summary

Architects and builds massive-scale data infrastructure for web crawling, embedding model training, and real-time search, handling hundreds of petabytes. Requires expertise in lakehouse architectures, distributed processing pipelines, and streaming systems like Kafka and Flink.

About the role

Responsibilities

Architect and build data infrastructure for crawling billions of pages, training embedding models, and serving real-time search.
Design systems that scale to hundreds of petabytes.
Example projects: Design lakehouse architecture for 100+ PB web crawl data; build streaming pipelines processing billions of documents per day; architect data layer for embedding training on Ray; scale ClickHouse for petabyte-scale analytical queries.

Requirements

Deep understanding of lakehouse architectures (Delta Lake, Iceberg, Hudi) and when to use them.
Experience building and operating large-scale distributed data processing pipelines.
Hands-on experience with streaming data systems (Kafka, Flink, or similar).
Familiarity with Ray, Spark, or ClickHouse at production scale.
Obsessive focus on reliability and building systems that don't page you at 3am.

Nice-to-Haves

Experience with Lance or other vector-native storage formats.
Background in GPU-accelerated data processing (RAPIDS, cuDF).

Skills

Delta LakeIcebergHudiKafkaFlinkRaySparkClickHouseLanceRAPIDScuDF

Similar roles at this salary range

All Data Engineering jobs →

Rula

Jun 24

Sr. Data Engineer

Build and maintain scalable data pipelines and platforms that enable AI applications to securely access trusted data. Partner with analytics, marketing, and product teams to deliver production-grade data systems.

164k – 193kLos Angeles, CAData EngineeringRemote4+ YOESQLAWS

Coinbase

Jun 24

Analytics Engineer

Build and maintain production data models and pipelines for Coinbase's Compliance Data Mart, ensuring regulatory-grade data quality and supporting audits and exams.

152k – 179kUnited StatesData EngineeringRemote2+ YOESQLdbt

Chime

Jun 23

Senior Software Engineer

Senior Software Engineer building and scaling Chime's data platform, ETL pipelines, and distributed data infrastructure. Requires a Master's degree and 3+ years of experience with AWS/GCP, Spark/Trino, Kubernetes, and CI/CD.

210k – 230kSan Francisco, CAData EngineeringHybrid3+ YOEAWSETL

Sesame

Jun 23

Data Engineer, Machine Learning

Build and maintain production data pipelines that prepare conversational, voice, and multimodal data for ML model training and evaluation. Partner closely with ML engineers to deliver high-quality, versioned datasets and infrastructure.

170k – 240kSan Francisco, CAData EngineeringOn-site5+ YOESQLETL

Brex

Jun 23

Senior Software Engineer, Data Enablement Platform

Senior engineer building and operating Brex’s data platform and infrastructure, partnering with product and analytics teams to deliver data-backed products. Requires 5+ years in data infra/platform roles and experience with modern data stack tools.

192k – 240kSeattle, WAData EngineeringHybrid5+ YOEdbtCDC

Apply