Software Engineer, Distributed Data Systems
Architects and builds massive-scale data infrastructure for web crawling, embedding model training, and real-time search, handling hundreds of petabytes. Requires expertise in lakehouse architectures, distributed processing pipelines, and streaming systems like Kafka and Flink.
Responsibilities
- Architect and build data infrastructure for crawling billions of pages, training embedding models, and serving real-time search.
- Design systems that scale to hundreds of petabytes.
- Example projects: Design lakehouse architecture for 100+ PB web crawl data; build streaming pipelines processing billions of documents per day; architect data layer for embedding training on Ray; scale ClickHouse for petabyte-scale analytical queries.
Requirements
- Deep understanding of lakehouse architectures (Delta Lake, Iceberg, Hudi) and when to use them.
- Experience building and operating large-scale distributed data processing pipelines.
- Hands-on experience with streaming data systems (Kafka, Flink, or similar).
- Familiarity with Ray, Spark, or ClickHouse at production scale.
- Obsessive focus on reliability and building systems that don't page you at 3am.
Nice-to-Haves
- Experience with Lance or other vector-native storage formats.
- Background in GPU-accelerated data processing (RAPIDS, cuDF).
Senior Software Engineer
Senior Software Engineer building and scaling Chime's data platform, ETL pipelines, and distributed data infrastructure. Requires a Master's degree and 3+ years of experience with AWS/GCP, Spark/Trino, Kubernetes, and CI/CD.
Data Engineer, Machine Learning
Build and maintain production data pipelines that prepare conversational, voice, and multimodal data for ML model training and evaluation. Partner closely with ML engineers to deliver high-quality, versioned datasets and infrastructure.
Senior Software Engineer, Data Enablement Platform
Senior engineer building and operating Brex’s data platform and infrastructure, partnering with product and analytics teams to deliver data-backed products. Requires 5+ years in data infra/platform roles and experience with modern data stack tools.