Posted 1 month ago
Sr. Site Reliability Engineer
AI Summary
Senior Site Reliability Engineer focused on building reliable, scalable AI infrastructure with emphasis on MLOps, Kubernetes, and observability for production AI platforms.
About this role
Role Overview
We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on ** MLOps **—bridging the gap between model development and production-grade reliability.
Key Responsibilities
1. Reliability & Performance Engineering
- SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
- Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
- Scalability: Architect and manage auto-scaling strategies for ** Kubernetes (GKE)** to handle fluctuating workloads during model training and high-volume inference.
2. MLOps & AI Infrastructure
- Model Serving Reliability: Ensure the high availability of ** Vertex AI endpoints** and custom inference services.
- GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
- Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
3. Automation & Orchestration (Eliminating "Toil")
- Infrastructure as Code (IaC): Use ** Terraform** or Pulumi to provision and manage consistent, version-controlled cloud environments.
- CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
- Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
4. Monitoring, Alerting & Incident Response
- Observability: Build and manage comprehensive dashboards using ** Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver)**.
- Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
- Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.
Requirements
Orchestration: Expert-level knowledge of ** Kubernetes (K8s)** and Docker.
MLOps Stack: Familiarity with tools such as ** Kubeflow, Vertex AI, MLflow, or DVC **.
Scripting: Strong proficiency in ** Python** (for automation) and Bash; knowledge of Go is a plus.
Data Systems: Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).
Networking: Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).
Benefits
Benefits
Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.
Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.
Skills
Explore related jobs
More jobs at Tiger Analytics Inc.
- Manager - Immigration (North America)Chennai, Tamil Nadu
- Sr Manager/Associate Director - Analytics Consulting (Pharma & Life sciences)London, England
- Senior Data ScientistSt. Louis, Missouri
- Senior Manager/ Associate Director - Analytics Consulting (RGM)United States
- Director / Associate Director - Omnichannel & Commercial PharmaNew Jersey, United States
- Associate/Sr. Associate - Delivery OperationsIndia
Similar Anthos jobs
Jobs in Washington
- GAssistant Program Manager, ConstructionGreater Good Charities · Seattle, Washington
- FAsphalt Paver Operator-R2700Fort Myer Construction · Washington, United States
- FLowboy Driver - R2683Fort Myer Construction · Washington, United States
- FFlowboy Driver- R2664Fort Myer Construction · Washington, United States
- CSenior Scrum MasterCustom Software Systems · Washington, United States
- BSchool Counselor / Social WorkerBreakthrough Montessori Public Charter School · Washington, United States