Software Engineer, Cloud Infrastructure
Build and operate AWS cloud and LLM infrastructure powering retrieval-augmented generation, vector search, and ML pipelines for aviation AI systems. Requires strong AWS depth, Python data pipelines, and production LLM experience.
Key Responsibilities
Cloud Infrastructure Setup and Maintenance
- Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform
- Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline
- Operate secure networking with VPCs, PrivateLink, and VPC endpoints
- Manage IAM, KMS, Secrets Manager, and audit logging
LLM Platform and Runtime
- Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs
- Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies
- Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling
RAG Data Systems and Vector Search
- Design chunking and embedding pipelines for documents, time series, and multimedia; orchestrate with Step Functions or Airflow
- Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone
- Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources
Evaluation, Observability, and Cost Governance
- Create offline and online eval harnesses for prompts, retrievers, and chains
- Instrument model and app telemetry with CloudWatch and OpenTelemetry
- Build token usage and cost dashboards with budgets and alerts
- Add guardrails, rate limits, fallbacks, and provider routing for resilience
Safety, Privacy, and Compliance
- Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review
- Use Bedrock Guardrails or policy services to enforce safety standards
- Maintain audit trails for regulated environments
Data Pipeline Construction
- Build ingestion and processing pipelines for structured, unstructured, and multimedia data
- Ensure integrity, lineage, and cataloging with Glue and Lake Formation
- Optimize bulk data movement and storage in S3, Glacier, and tiered storage; use Athena for ad-hoc analysis
IoT Deployment Management
- Manage infrastructure that deploys to and communicates with edge devices
- Support secure messaging, identity, and over-the-air updates
Analytics and Application Support
- Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features
- Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance
Performance Optimization
- Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases
- Optimize inference with model selection, quantization, GPU/CPU instance choices, and autoscaling strategies
What Will Make You Successful
- End-to-End Ownership: Drives work from design through production, including on-call and continuous improvement
- LLM Systems Experience: Shipped or operated LLM-powered applications in production; familiar with RAG design, prompt versioning, and chain orchestration using LangChain or similar
- AWS Depth: Strong with core AWS services such as VPC, IAM, KMS, CloudWatch, S3, ECS/EKS, Lambda, Step Functions, Bedrock, and SageMaker
- Data Engineering Skills: Comfortable building ingestion and transformation pipelines in Python; familiar with Glue, Athena, and event-driven patterns using EventBridge and SQS
- Security Mindset: Applies least privilege, secrets management, network isolation, and compliance practices appropriate to sensitive data
- Evaluation and Metrics: Uses quantitative evals, A/B testing, and live metrics to guide improvements
- Clear Communication: Explains tradeoffs and aligns partners across product, security, and application engineering
Bonus Points
- 4+ years working with serverless or container platforms on AWS
- Experience with vector databases, OpenSearch, or pgvector at scale
- Hands-on with Bedrock Guardrails, Knowledge Bases, or custom policy engines
- Familiarity with GPU workloads, Triton Inference Server, or TensorRT-LLM
- Experience with big data tools for large-scale processing and search
- Background in aviation data or other safety-critical domains
- DevOps or DevSecOps experience automating CI/CD for ML and app services
Perks & Benefits (Full-Time Employees)
- Healthcare: 100% of employee medical premiums covered; 25% for dependents
- Time Off: 3 weeks PTO plus 13+ paid company holidays
- Stipends: Monthly phone and wellness benefits
- 401(k): Offered (no current employer match, but committed to enhancing this benefit)
Senior Manager, DevOps
Lead DevOps strategy and team to improve engineering velocity, platform reliability, and operational efficiency across multi-cloud (AWS/GCP) environments. Drive IaC, Kubernetes delivery, observability, AI-powered tooling adoption, and cross-functional collaboration.
Senior Software Engineer, Observability
Senior engineer on the Auth0 Platform Observability team responsible for designing, building, and maintaining scalable observability infrastructure (metrics, logs, traces) using Datadog, Terraform, and OpenTelemetry.
Senior DevOps Engineer
Senior DevOps Engineer managing CI/CD automation, infrastructure as code, and cloud-native deployments on Azure/AWS with Kubernetes, Terraform, and observability tooling. Requires 5+ years DevOps experience and a CS bachelor's or equivalent.
Software Engineer - Networking Software and Services
Build software, services, and frameworks for network management, automation, and monitoring of large-scale GPU supercomputing fabrics. Requires deep network protocol knowledge and experience orchestrating tens of thousands of devices.