About the role
We’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.

This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers. You’ll define reliability standards, own production health, and build the feedback loops that make our systems more resilient over time.

You’ll work closely with backend, data, and ML engineers to ensure our platform is highly available, measurable, and continuously improving. This includes everything from incident response and performance debugging to SLO design and system-level optimization.

What You’ll Do

Reliability Engineering & System Ownership

Define and implement SLIs, SLOs, and error budgets across core services
Own production system health, including uptime, latency, and availability targets
Continuously improve system resilience through proactive reliability work
Identify and mitigate single points of failure across distributed systems

Production Operations & Incident Response

Participate in and improve on-call rotations and incident response processes
Lead incident triage, mitigation, and resolution in real time
Conduct blameless postmortems and ensure follow-through on action items
Build tooling and automation to reduce MTTR (Mean Time to Recovery)

Observability & System Insight

Design and evolve observability systems across:
- Metrics, logs, and distributed tracing (OpenTelemetry)
- Tooling including Datadog, CloudWatch, Grafana, Sentry
Improve signal quality to reduce noise and alert fatigue
Develop dashboards and alerts that reflect real system health and user impact
Use observability data to drive performance and reliability improvements

Performance & Scalability

Analyze system performance under load and identify bottlenecks
Optimize latency, throughput, and resource utilization across:
- Serverless systems (AWS Lambda)
- Containerized services (ECS)
- Data systems (Aurora Postgres, ClickHouse)
Partner with engineering teams to improve system efficiency and scaling behavior

Automation & Reliability Tooling

Build automation to eliminate repetitive operational work
Improve deployment safety through reliability checks and safeguards
Contribute to CI/CD pipelines (GitHub Actions) with a focus on system stability
Develop tools for:
- Incident response
- Debugging
- Capacity planning

Security, Compliance & Operational Maturity

Partner with security and compliance to ensure systems meet operational standards
Support audit readiness and reliability-related compliance requirements (Vanta)
Integrate monitoring and alerting into security and SIEM workflows
Help mature operational practices across the engineering team

Environment & Technical Context

You’ll work across a modern distributed stack:

Cloud: AWS (ECS, Lambda, RDS Aurora Postgres, CloudWatch)
Infrastructure: Terraform, Ansible, Linux
CI/CD: GitHub Actions
Observability: Datadog, Grafana, CloudWatch, OpenTelemetry, Sentry, pganalyze
Data Systems: Postgres, ClickHouse
Security & Compliance: Vanta, SIEM tooling
Product & Analytics: Amplitude
ML/Platform Infra: TrueFoundry

What Success Looks Like

Clear, enforced SLOs and error budgets across critical systems
Incidents are well-managed, rare, and decrease over time
Engineers have high-confidence signals about system health
Alerts are actionable, not noisy
Systems scale predictably under load without degradation
Postmortems lead to real, measurable improvements
Reliability is treated as a shared engineering responsibility, not a reactive function

Ideal Background

Must Have

5+ years in Site Reliability Engineering, SRE-adjacent roles, or production infrastructure
Strong experience operating and debugging distributed systems in production
Hands-on experience with:
- Observability tooling (Datadog, Grafana, OpenTelemetry, etc.)
- Incident response and on-call practices
- Performance and reliability debugging
Experience defining and working with SLOs / SLIs / error budgets
Familiarity with:
- AWS environments
- Serverless and container-based architectures

Site Reliability Engineer

About this role

Reliability Engineering & System Ownership

Must Have

Explore related jobs

More jobs at Plenful

Jobs in San Francisco

Browse these categories