Jobless Developer
Aghanim logo
Aghanim

Posted 1 month ago

Open

Senior/Principal DevOps

LisbonOn-siteFull-time

AI Summary

Senior/Principal DevOps owns a cloud-only GCP platform, ensuring reliability, scalability, and cost-efficiency for a high-load SaaS. Leads IaC, Kubernetes operations, observability, and CI/CD optimization with strong incident management.

About this role

We’re looking for a Senior/Principal DevOps to own our cloud-only platform and keep it reliable under high-load and bursty traffic. Our services run entirely on GCP, fronted by Cloudflare, with deep observability in Datadog and CI/CD in GitHub Actions.

This is a hands-on role with real ownership: ensuring we meet our SLA/SLOs, scaling fast (10–50×), and keeping infrastructure efficient and cost-conscious as the company grows and microservices multiply.

Role Responsibilities

  1. Cloud Infrastructure Ownership

  • Own and evolve production infrastructure on GCP and Cloudflare (cloud-only, no on-prem).

  • Maintain high availability and performance for a SaaS platform serving both B2B and B2C use cases.

  1. Scalability & Highload Resilience

  • Design and operate for unpredictable spikes where load can jump 10–20× within seconds.

  • Build scaling strategies across compute, networking, and data layers (autoscaling, capacity planning, bottleneck removal, safe degradation patterns).

  1. SLA/SLO & Incident Excellence

  • Be accountable for reliability outcomes: availability/latency/error rates tied to SLA/SLO.

  • Lead incident response practices: detection → mitigation → postmortem → permanent fixes (root cause elimination).

  1. IaC & Kubernetes Platform Operations

  • Build and maintain Infrastructure as Code using Terraform (and Terragrunt where applicable).

  • Own Kubernetes operations on GKE: upgrades, scaling, operational hardening.

  • Write and maintain Helm charts and Kubernetes manifests where needed.

  1. Observability (Datadog)

  • Build end-to-end observability using Datadog (metrics/logs/APM): dashboards, monitors, alert strategy.

  • Ensure critical system paths and dependencies are visible and actionable (reduce alert noise, increase signal).

  1. DevSecOps Baseline

  • Configure and operate security tooling and monitoring (e.g., Security Command Center, scanners/analyzers).

  • Triage findings and either fix issues directly or delegate remediation to the right teams.

  1. CI/CD Enablement

  • Collaborate with engineering to streamline and harden GitHub Actions / GitHub CI/CD pipelines.

  • Increase deployment safety and speed through automation and platform guardrails.

  1. Cost Management

  • Own cost visibility and optimization: identify waste, right-size resources, and implement practical FinOps controls.

Required Qualifications

  • Strong production experience in DevOps/SRE (typically 5+ years, but we value impact over years).

  • Proven experience operating infrastructure for SaaS with explicit SLA commitments (B2B + B2C is a plus).

  • Hands-on expertise with GCP, especially GKE, plus relevant managed services (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Dataflow, Cloud Run, Cloud Deploy, Memorystore).

  • Strong Infrastructure-as-Code with Terraform (bonus: Terragrunt).

  • Strong Kubernetes operations background (GKE at scale, reliability practices, upgrades, scaling).

  • Experience with Cloudflare (WAF/DNS/edge basics; Workers/CDN is a plus).

  • Production observability experience with Datadog (or comparable), ideally including APM/logging.

  • Strong scripting/automation skills and a reliability-first mindset.

Preferred Qualifications

  • Experience in game dev or similarly bursty high-load consumer products.

  • Familiarity with SOC 2 / PCI-DSS audits and security architecture requirements.

  • Service mesh experience (e.g., Cloud Service Mesh) in production.

  • Mature SRE practices: error budgets, on-call maturity, runbooks, proactive incident prevention.

What Success Looks Like

  • Platform consistently meets or exceeds SLA/SLO targets under bursty highload.

  • Incidents are detected early, mitigated quickly, and don’t repeat due to strong postmortem follow-through.

  • Scaling events (10–50×) are routine rather than heroic.

  • Cloud spend is transparent, controlled, and optimized without harming reliability.

  • Engineering teams ship faster with safer, smoother CI/CD and fewer infrastructure bottlenecks.

Why Join Us

  • Cloud-only infrastructure (GCP) with meaningful scale and real reliability ownership.

  • Small team (15–20 engineers) with high autonomy and fast decision-making.

  • Direct impact on platform stability, scaling, and cost efficiency.

  • Opportunity to shape SRE culture, tooling, and operational standards in a fast-growing startup.

Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Skills

BigQueryBigtableCI/CD AutomationCloudflare (WAF/DNS, Edge Basics; Workers/CDN A Plus)Cloud RunCloud SQLDatadog (metrics, Logs, APM, Dashboards, Monitors)DataflowFinOps / Cost OptimizationGitHub Actions / GitHub CI/CD PipelinesGoogle Cloud Platform (GCP)Google Kubernetes Engine (GKE)Helm Charts And Kubernetes ManifestsKubernetes Deployments And UpgradesMemorystorePub/SubSecurity Command CenterSRE Practices (SLA/SLO, Incident Response, Postmortems)TerraformTerragrunt

Explore related jobs

Browse these categories