Our client is an innovative technology company operating large-scale cloud and edge infrastructure supporting AI-driven products and services. As the platform continues to expand, they are looking for a Site Reliability Engineer to help build highly reliable, observable, and secure systems that power mission-critical applications.

This role offers the opportunity to work across cloud infrastructure, Kubernetes, observability, security, automation, and emerging AI operational platforms in a fast-growing environment.

What you will do:

Reliability & Observability

Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments.
Build visibility into system health through metrics, logs, traces, and performance analytics.
Define and manage SLIs, SLOs, and service reliability targets.
Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.

Cloud Infrastructure & Platform Operations

Deploy, manage, and optimize containerized workloads running on Kubernetes.
Maintain scalable cloud infrastructure across production environments.
Improve system performance, availability, and operational efficiency.
Support infrastructure provisioning through Infrastructure-as-Code practices.

Security & Access Management

Implement secure access controls and audit mechanisms across infrastructure environments.
Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions.
Develop alerting and response procedures for security-related incidents.
Contribute to operational security best practices and governance initiatives.

Automation & Engineering Excellence

Automate repetitive operational tasks to reduce manual effort and improve reliability.
Build tooling and scripts to streamline infrastructure operations.
Support CI/CD workflows and deployment automation.
Promote documentation, operational standards, and continuous improvement.

Incident Response & Reliability Engineering

Participate in on-call rotations and incident management.
Lead troubleshooting efforts during production incidents.
Conduct root-cause analysis and post-mortem reviews.
Drive long-term improvements that enhance system resilience.

Cross-Functional Collaboration

Work closely with software, AI, machine learning, hardware, and product teams.
Ensure new services are production-ready with appropriate monitoring, security, and reliability measures.
Support the operational needs of both cloud-based and distributed edge computing environments.

What you will need:

3+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Operations.

Hands-on experience with AWS or other major cloud platforms.

Strong understanding of observability and monitoring tools such as Grafana, Prometheus, or similar platforms.

Solid Linux administration and troubleshooting skills.

Experience with Docker, Kubernetes, and containerized workloads.

Experience with Infrastructure as Code tools such as Terraform.

Proficiency in at least one scripting or programming language (Python, Bash, etc.).

Understanding of networking fundamentals and infrastructure security concepts.

Experience supporting production systems and participating in incident response.

Strong automation mindset and commitment to operational excellence.

Nice-to-haves:

Experience operating large-scale edge computing or IoT deployments.

Familiarity with zero-trust access management platforms.

Experience in security operations, threat detection, or infrastructure security.

Exposure to AI infrastructure, LLM-based applications, or workflow automation platforms.

Knowledge of AI-Ops, anomaly detection, or intelligent monitoring solutions.

Familiarity with compliance and security frameworks such as ISO 27001.

Site Reliability Engineer (SRE)

About this role