Site Reliability Engineer
AI Summary
About the roleWe’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers.
About this role
About the role
We’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.
This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers. You’ll define reliability standards, own production health, and build the feedback loops that make our systems more resilient over time.
You’ll work closely with backend, data, and ML engineers to ensure our platform is highly available, measurable, and continuously improving. This includes everything from incident response and performance debugging to SLO design and system-level optimization.
What You’ll Do
Reliability Engineering & System Ownership
- Define and implement SLIs, SLOs, and error budgets across core services
- Own production system health, including uptime, latency, and availability targets
- Continuously improve system resilience through proactive reliability work
- Identify and mitigate single points of failure across distributed systems
Production Operations & Incident Response
- Participate in and improve on-call rotations and incident response processes
- Lead incident triage, mitigation, and resolution in real time
- Conduct blameless postmortems and ensure follow-through on action items
- Build tooling and automation to reduce MTTR (Mean Time to Recovery)
Observability & System Insight
- Design and evolve observability systems across:
- Metrics, logs, and distributed tracing (OpenTelemetry)
- Tooling including Datadog, CloudWatch, Grafana, Sentry
- Improve signal quality to reduce noise and alert fatigue
- Develop dashboards and alerts that reflect real system health and user impact
- Use observability data to drive performance and reliability improvements
Performance & Scalability
- Analyze system performance under load and identify bottlenecks
- Optimize latency, throughput, and resource utilization across:
- Serverless systems (AWS Lambda)
- Containerized services (ECS)
- Data systems (Aurora Postgres, ClickHouse)
- Partner with engineering teams to improve system efficiency and scaling behavior
Automation & Reliability Tooling
- Build automation to eliminate repetitive operational work
- Improve deployment safety through reliability checks and safeguards
- Contribute to CI/CD pipelines (GitHub Actions) with a focus on system stability
- Develop tools for:
- Incident response
- Debugging
- Capacity planning
Security, Compliance & Operational Maturity
- Partner with security and compliance to ensure systems meet operational standards
- Support audit readiness and reliability-related compliance requirements (Vanta)
- Integrate monitoring and alerting into security and SIEM workflows
- Help mature operational practices across the engineering team
Environment & Technical Context
You’ll work across a modern distributed stack:
- Cloud: AWS (ECS, Lambda, RDS Aurora Postgres, CloudWatch)
- Infrastructure: Terraform, Ansible, Linux
- CI/CD: GitHub Actions
- Observability: Datadog, Grafana, CloudWatch, OpenTelemetry, Sentry, pganalyze
- Data Systems: Postgres, ClickHouse
- Security & Compliance: Vanta, SIEM tooling
- Product & Analytics: Amplitude
- ML/Platform Infra: TrueFoundry
What Success Looks Like
- Clear, enforced SLOs and error budgets across critical systems
- Incidents are well-managed, rare, and decrease over time
- Engineers have high-confidence signals about system health
- Alerts are actionable, not noisy
- Systems scale predictably under load without degradation
- Postmortems lead to real, measurable improvements
- Reliability is treated as a shared engineering responsibility, not a reactive function
Ideal Background
Must Have
- 5+ years in Site Reliability Engineering, SRE-adjacent roles, or production infrastructure
- Strong experience operating and debugging distributed systems in production
- Hands-on experience with:
- Observability tooling (Datadog, Grafana, OpenTelemetry, etc.)
- Incident response and on-call practices
- Performance and reliability debugging
- Experience defining and working with SLOs / SLIs / error budgets
- Familiarity with:
- AWS environments
- Serverless and container-based architectures
Explore related jobs
More jobs at Plenful
Jobs in San Francisco
ControllerConnor Consulting · San Francisco, California- FRobotics Build EngineerFoundation Robotics Labs Inc. · San Francisco, Canada
Assistant Store Manager- Client Experience & DesignArticle · San Francisco
Assistant Store Manager- Visual MerchandisingArticle · San Francisco
Member of Product Staff, EngineerMetaview · San Francisco- Senior Product Manager, Buying & OptimizationTatari · San Francisco, California