# Software Engineer, Cloud Infrastructure
**Company:** [Beacon AI](https://hotfix.jobs/companies/beacon-ai)
**Location:** San Carlos, CA
**Salary:** $135K-$260K
**Experience:** 4+ years
**Skills:** AWS, Terraform, Aws Cdk, GitHub Actions, Aws Codebuild, Aws Codepipeline, Vpc, IAM, Kms, Secrets Manager, Aws Bedrock, SageMaker, LangChain, Opensearch
**Posted:** 2026-06-26
> Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.
## Job Description
## Key Responsibilities

### Cloud Infrastructure Setup and Maintenance
- Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform.
- Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline.
- Operate secure networking with VPCs, PrivateLink, and VPC endpoints.
- Manage IAM, KMS, Secrets Manager, and audit logging.

### LLM Platform and Runtime
- Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs.
- Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies.
- Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling.

### RAG Data Systems and Vector Search
- Design chunking and embedding pipelines for documents, time series, and multimedia. Orchestrate with Step Functions or Airflow.
- Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone. Tune recall, latency, and cost.
- Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources.

### Evaluation, Observability, and Cost Governance
- Create offline and online eval harnesses for prompts, retrievers, and chains. Track quality, latency, and regression risk.
- Instrument model and app telemetry with CloudWatch and OpenTelemetry. Build token usage and cost dashboards with budgets and alerts.
- Add guardrails, rate limits, fallbacks, and provider routing for resilience.

### Safety, Privacy, and Compliance
- Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review where needed.
- Use Bedrock Guardrails or policy services to enforce safety standards. Maintain audit trails for regulated environments.

### Data Pipeline Construction
- Build ingestion and processing pipelines for structured, unstructured, and multimedia data. Ensure integrity, lineage, and cataloging with Glue and Lake Formation.
- Optimize bulk data movement and storage in S3, Glacier, and tiered storage. Use Athena for ad-hoc analysis.

### IoT Deployment Management
- Manage infrastructure that deploys to and communicates with edge devices. Support secure messaging, identity, and over-the-air updates.

### Analytics and Application Support
- Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features.
- Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance.

### Performance Optimization
- Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases.
- Optimize inference with model selection, quantization where applicable, GPU/CPU instance choices, and autoscaling strategies.

## What Will Make You Successful
- End-to-End Ownership: Drives work from design through production, including on-call and continuous improvement.
- LLM Systems Experience: Shipped or operated LLM-powered applications in production. Familiar with RAG design, prompt versioning, and chain orchestration using LangChain or similar.
- AWS Depth: Strong with core AWS services such as VPC, IAM, KMS, CloudWatch, S3, ECS/EKS, Lambda, Step Functions, Bedrock, and SageMaker.
- Data Engineering Skills: Comfortable building ingestion and transformation pipelines in Python. Familiar with Glue, Athena, and event-driven patterns using EventBridge and SQS.
- Security Mindset: Applies least privilege, secrets management, network isolation, and compliance practices appropriate to sensitive data.
- Evaluation and Metrics: Uses quantitative evals, A/B testing, and live metrics to guide improvements.
- Clear Communication: Explains tradeoffs and aligns partners across product, security, and application engineering.

## Bonus Points
- 4+ years working with serverless or container platforms on AWS.
- Experience with vector databases, OpenSearch, or pgvector at scale.
- Hands-on with Bedrock Guardrails, Knowledge Bases, or custom policy engines.
- Familiarity with GPU workloads, Triton Inference Server, or TensorRT-LLM.
- Experience with big data tools for large-scale processing and search.
- Background in aviation data or other safety-critical domains.
- DevOps or DevSecOps experience automating CI/CD for ML and app services.

## Perks & Benefits (Full-Time Employees)
- Healthcare: 100% of employee medical premiums covered; 25% for dependents.
- Time Off: 3 weeks PTO plus 13+ paid company holidays.
- Stipends: Monthly phone and wellness benefits.
- 401(k): Offered (no current employer match, but committed to enhancing this benefit in the future).
**Apply:** https://hotfix.jobs/jobs/software-engineer-cloud-infrastructure-at-beacon-ai-5b25df7a-c917-4c94-8dc5-4a5150cab46a
**Canonical:** https://hotfix.jobs/jobs/software-engineer-cloud-infrastructure-at-beacon-ai-5b25df7a-c917-4c94-8dc5-4a5150cab46a