Platform Ops Lead
Leads platform operations team supporting developers on GitLab and Kubernetes-based DevOps platform. Resolves deployment issues, manages on-call support, trains team members, and ensures SLOs in microservices environment. Requires BS in STEM, Linux skills, and scripting experience.
Duties and Responsibilities
- Identify and resolve operational problems in a micro-service environment
- Work with developers to resolve deployment and runtime problems
- Perform analysis and debugging work across multiple technologies
- Prioritize issues to keep applications within error budgets and meeting their SLOs
- Provide technical solutions to a wide range of problems and user requests
- Document processes, procedures and SOPs by soliciting feedback and suggestions from team members
- Compile postmortems and action items to minimize future outages
- Interview other people for team member roles, and decide which ones to recommend for hire
- Train new team members, and assist them with issues
- Provide on-call support to NCBI's internal developers and other staff
Requirements
- BS degree in STEM or equivalent experience
- Customer-focused, team-oriented disposition
- Good systems debugging skills
- Comfortable with the Linux environment or UNIX CLI
- Experience with some programming or scripting language
- Have experience creating processes, procedures and SOP documentation
- General understanding of TCP/IP, HTTP, and related protocols
- Initiative to take ownership of tasks and drive them to completion
- Comfortable dealing with users with varying levels of IT knowledge
- Eager to learn new technologies
- Strong communication and soft skills to interface with customers, peers and management
- Good judgement, sense of integrity, and responsibility
Preferred Experience and Skillsets
-
Kubernetes, OpenShift, Cloud or Linux experience
-
Experience with:
- Service Reliability Engineering in any capacity
- Linux systems administration
- Automated CI servers, especially TeamCity and/or GitLab
- Automation programming/scripting in any of: bash, Ruby, Python, Go, Java, Scala, Rust, C++, Perl
- Automated configuration management, such as Puppet, Ansible, Chef, bcfg2, cfengine, etc. (Puppet is preferred)
- Version control systems, especially git
- Service Mesh technologies (e.g., linkerd, Istio)
- Configuring or using monitoring and alerting technologies (TIGK stack, Grafana, Prometheus, OpsGenie)
- Confluence, Jira, and Microsoft Office suite
- GitOps tools, especially ArgoCD
- Google Anthos
-
Understanding of:
- Linux internals (system calls, file systems, processes, etc.)
- Linux network configuration
- Linux application containerization, especially Docker
- Attached network storage technologies
- Cloud computing environment such as AWS, GCP or Azure
- Automated CI/CD pipelines
- Distributed systems design principles
Benefits and Salary
- Competitive benefits package that includes medical, dental and vision coverage, 401k plan with employer contribution, paid holidays, vacation, and tuition reimbursement
- Competitive salary commensurate with experience and location. The targeted range for this position is $135,000 - $165,000
Senior Infrastructure Engineer
Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.
Senior Site Reliability Engineer
Senior SRE to operate and evolve EKS Kubernetes platform, CI/CD pipelines, and observability stack for Thunderbird's open-source infrastructure. Requires 7+ years infrastructure experience and strong production Kubernetes and IaC skills.