Jobless Developer
Magnet Forensics logo
Magnet Forensics

Posted 6 days ago

Open

Senior Site Reliability Engineer

CanadaRemoteFull-time

AI Summary

Who We Are; What We Do; Where We’re Going Magnet Forensics is a global leader in the development of digital investigative software that acquires, analyzes, and shares evidence from computers, smartphones, tablets, and IoT-related devices.

About this role

Who We Are; What We Do; Where We’re Going
Magnet Forensics is a global leader in the development of digital investigative software that acquires, analyzes, and shares evidence from computers, smartphones, tablets, and IoT-related devices. We are continually innovating so our customers can deploy advanced and effective tools to protect their companies, communities, and countries.
Serving thousands of customers globally, our solutions are playing a crucial role in modernizing digital investigations, helping investigators fight crime, protect assets, and guard national security.
With employees based around the world, Magnet Forensics has been expanding our global presence. As a part of Magnet Forensics, you can expect to make a difference in the world, no matter what role you play. You’ll be supported through learning and development, not to mention an incredible team with unbelievable talent and integrity.
If you think you would be the right person to join our team working towards this goal, we would love to hear from you!

Role Overview

We're seeking a Senior Site Reliability Engineer to join our SaaS-Ops team within Shared Services Engineering. The team owns reliability and operational excellence for our highly available SaaS platform, a production Kubernetes environment serving law enforcement and government customers globally.
This role requires deep AWS expertise, infrastructure-as-code discipline, and CI/CD best practices. You'll work closely with Application, Platform, and Security teams to drive secure-by-design architectures and improve automation and reliability across our cloud environments. You'll ship infrastructure as code, respond to production incidents with discipline, and drive platform modernization through deliberate roadmap execution.
As part of the SaaS-Ops team, you’ll work in a high-performing environment where members take ownership of outcomes and operate with a strong sense of trust and autonomy. You’ll identify challenges, contribute to solutions, raise concerns proactively, support improvements, and navigate situations requiring timely decision-making. If you’re looking for your next challenge where infrastructure quality directly impacts real‑world outcomes, this role could be a great fit!
Note: This role includes participation in an on-call rotation.

What You’ll Do

  • Own and operate production Kubernetes clusters (Amazon EKS) including upgrades, scaling, security hardening, and cluster lifecycle management;
  • Design, implement, and maintain infrastructure-as-code using Terraform; contribute to shared module libraries and enforce IaC standards across the team;

  • Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments;

  • Operate and maintain observability infrastructure including Grafana, alerts, dashboards, and log pipelines. Act to eliminate noise and surface signal;

  • Contribute to pipeline reliability: identify flaky stages, reduce build times, improve developer experience across CI/CD pipelines;

  • Remediate security vulnerabilities (CVEs) in container images and infrastructure components; participate in compliance work including FedRAMP support activities;

  • Develop and maintain runbooks, change management procedures, and operational documentation;

  • Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST;

  • Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap;

  • Participate in on-call incident response rotation; lead or support incident command during active production incidents including root cause analysis and post-incident review.

  • What We’re Looking For

  • 5+ years of industry experience with a trajectory that demonstrates growing depth in cloud infrastructure and SRE practices;

  • Managed production Kubernetes environments at scale: not just deployed workloads, but owned cluster health, upgrades, and failure modes;

  • Responded to production incidents in high-stakes environments where downtime has real consequences;

  • Written and maintained Terraform at the module level, not just as a consumer: understands state, dependencies, and the operational burden of drift;

  • Operated in an environment that uses GitOps: has a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent;

  • Balanced reactive operational work with proactive roadmap delivery; knows how to protect time for improvements while keeping production stable;

  • Worked with observability as a first-class discipline: built meaningful dashboards, eliminated alert fatigue, and used metrics to make operational decisions;

  • Contributed to security hardening in a regulated or compliance-adjacent environment: FedRAMP, SOC 2, or similar frameworks are a strong asset.

  • Explore related jobs

    Browse these categories