Posted 3 months ago

Senior/Principal DevOps

LisbonOn-siteFull-time

AI Summary

Senior/Principal DevOps owns the cloud-only GCP platform, ensuring reliability, scalability under bursty traffic, SLA/SLO compliance, and efficient cost-conscious operations.

About this role

We’re looking for a Senior/Principal DevOps to own our cloud-only platform and keep it reliable under high-load and bursty traffic. Our services run entirely on GCP, fronted by Cloudflare, with deep observability in Datadog and CI/CD in GitHub Actions.

This is a hands-on role with real ownership: ensuring we meet our SLA/SLOs, scaling fast (10–50×), and keeping infrastructure efficient and cost-conscious as the company grows and microservices multiply.

Role Responsibilities

Cloud Infrastructure Ownership

Own and evolve production infrastructure on GCP and Cloudflare (cloud-only, no on-prem).
Maintain high availability and performance for a SaaS platform serving both B2B and B2C use cases.

Scalability & Highload Resilience

Design and operate for unpredictable spikes where load can jump 10–20× within seconds.
Build scaling strategies across compute, networking, and data layers (autoscaling, capacity planning, bottleneck removal, safe degradation patterns).

SLA/SLO & Incident Excellence

Be accountable for reliability outcomes: availability/latency/error rates tied to SLA/SLO.
Lead incident response practices: detection → mitigation → postmortem → permanent fixes (root cause elimination).

IaC & Kubernetes Platform Operations

Build and maintain Infrastructure as Code using Terraform (and Terragrunt where applicable).
Own Kubernetes operations on GKE: upgrades, scaling, operational hardening.
Write and maintain Helm charts and Kubernetes manifests where needed.

Observability (Datadog)

Build end-to-end observability using Datadog (metrics/logs/APM): dashboards, monitors, alert strategy.
Ensure critical system paths and dependencies are visible and actionable (reduce alert noise, increase signal).

DevSecOps Baseline

Configure and operate security tooling and monitoring (e.g., Security Command Center, scanners/analyzers).
Triage findings and either fix issues directly or delegate remediation to the right teams.

CI/CD Enablement

Collaborate with engineering to streamline and harden GitHub Actions / GitHub CI/CD pipelines.
Increase deployment safety and speed through automation and platform guardrails.

Cost Management

Own cost visibility and optimization: identify waste, right-size resources, and implement practical FinOps controls.

Required Qualifications

Strong production experience in DevOps/SRE (typically 5+ years, but we value impact over years).
Proven experience operating infrastructure for SaaS with explicit SLA commitments (B2B + B2C is a plus).
Hands-on expertise with GCP, especially GKE, plus relevant managed services (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Dataflow, Cloud Run, Cloud Deploy, Memorystore).
Strong Infrastructure-as-Code with Terraform (bonus: Terragrunt).
Strong Kubernetes operations background (GKE at scale, reliability practices, upgrades, scaling).
Experience with Cloudflare (WAF/DNS/edge basics; Workers/CDN is a plus).
Production observability experience with Datadog (or comparable), ideally including APM/logging.
Strong scripting/automation skills and a reliability-first mindset.

Preferred Qualifications

Experience in game dev or similarly bursty high-load consumer products.
Familiarity with SOC 2 / PCI-DSS audits and security architecture requirements.
Service mesh experience (e.g., Cloud Service Mesh) in production.
Mature SRE practices: error budgets, on-call maturity, runbooks, proactive incident prevention.

What Success Looks Like

Platform consistently meets or exceeds SLA/SLO targets under bursty highload.
Incidents are detected early, mitigated quickly, and don’t repeat due to strong postmortem follow-through.
Scaling events (10–50×) are routine rather than heroic.
Cloud spend is transparent, controlled, and optimized without harming reliability.
Engineering teams ship faster with safer, smoother CI/CD and fewer infrastructure bottlenecks.

Why Join Us

Cloud-only infrastructure (GCP) with meaningful scale and real reliability ownership.
Small team (15–20 engineers) with high autonomy and fast decision-making.
Direct impact on platform stability, scaling, and cost efficiency.
Opportunity to shape SRE culture, tooling, and operational standards in a fast-growing startup.

Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Skills

CI/CDCloudflareDataDogFinOpsGCPGitHub ActionsGKEHelmIaCKubernetesObservabilitySRETerraformTerragrunt

Senior/Principal DevOps

About this role

Role Responsibilities

Required Qualifications

Preferred Qualifications

What Success Looks Like

Why Join Us

Skills

Explore related jobs

More jobs at Aghanim

Similar CI/CD jobs

Jobs in Lisbon

Browse these categories