Posted 7 days ago

Site Reliability Engineer - Senior

Ho Chi Minh CityHybrid

AI Summary

ABOUT US: At Gradion, we are the strategic partner for ambitious businesses, helping them achieve breakthrough growth through Digital Innovation and Deep Tech. With a global vision and an AI-first approach, we enable clients to reshape strategies, optimize systems, and adopt cutting-edge technologies to create sustainable value.

About this role

ABOUT US:

At Gradion, we are the strategic partner for ambitious businesses, helping them achieve breakthrough growth through Digital Innovation and Deep Tech.

With a global vision and an AI-first approach, we enable clients to reshape strategies, optimize systems, and adopt cutting-edge technologies to create sustainable value.

From AI and data to cybersecurity, robotics, and large-scale enterprise platforms, Gradion designs practical solutions that lay the foundation for the next generation of billion-dollar companies.

OUR FACTS & FIGURES:

- 23+ years of expertise - Gradion builds digital platforms & deep-tech solutions.

- 3 continents: Asia, Europe and Africa.

- 300+ specialists across 7 countries Vietnam, Singapore, Thailand, Saudi Arabia, Germany, Egypt and Australia.

- 100+ enterprise clients, including several unicorns (e.g., Alaiko, HomeToGo, Roadsurfer).

- Vietnam’s Best IT Company - recognized by ITViec for 8 consecutive years, including 2 consecutive years of ranking #1 (2024 and 2025).

- ISO 27001.

About the Role

Gradion is expanding its SRE team for the a client with a long-term managed services contract running through 2028. You will be part of a global, follow-the-sun SRE function, responsible for platform stability, cloud infrastructure, and 24/7 incident response across European and global client time zones.
This role suits engineers who are technically solid, self-directed, and comfortable operating in a fast-moving, internationally distributed environment. You will go through a structured onboarding alongside commercetools' internal SRE team before taking on independent operational responsibility.

What You Will Do

Own platform availability: monitor, triage, and resolve incidents within defined SLA windows
Manage cloud infrastructure on AWS and/or GCP - provisioning, scaling, and day-to-day operations
Maintain and improve CI/CD pipelines and GitOps workflows
Operate observability systems: monitoring, logging, and alerting at production scale
Participate in on-call rotation as part of the global follow-the-sun coverage model
Configure, deploy, and manage AI tooling and MCP servers in production environments
Contribute to infrastructure automation, scripting, and internal tooling
Write clear post-incident reviews and contribute to the monthly operational report
Collaborate closely with engineering teams across multiple time zones

What You Bring

4+ years in a DevOps / SRE / Platform Engineering role within an international team
Solid Kubernetes knowledge - cluster operations, troubleshooting, and configuration
Hands-on cloud experience with AWS and/or GCP
Good understanding of networking fundamentals - DNS, load balancing, firewalls, VPC
Scripting and automation skills (Python, Bash, or similar)
Experience with CI/CD tools and GitOps-based delivery
Working knowledge of monitoring and observability systems (Prometheus, ELK, or equivalent)
Fluent English (C1 minimum) - daily communication with European stakeholders is a core requirement
Self-directed and proactive - you ask the right questions and drive issues to resolution without waiting to be told

Nice to Have

Experience configuring and managing MCP servers and AI tooling in production
Exposure to AI enablement workflows or LLM infrastructure
Background supporting eCommerce or SaaS platforms
Familiarity with the Frontastic / commercetools Frontend ecosystem

Explore related jobs

More jobs at gradion

Jobs in Ho Chi Minh City

Browse these categories

Jobs in Ho Chi Minh City Jobs in Vietnam