Infra Support Engineer
AI Summary
Infra Support Engineer helps diagnose and fix AI infrastructure issues, supports GPU/CPU nodes, networking, storage, and orchestration, and coordinates with SRE to maintain reliability.
About this role
Title: Infra Support Engineer - Fuku
URL Source: https://apply.workable.com/j/03C688AFC9
Markdown Content: Infra Support Engineer – GMI Global Infrastructure Team
Preferred Location:
-
Taiwan
-
Malaysia
Responsibilities:
-
Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
-
Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
-
Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
-
Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
-
Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.
-
Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.
-
Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.
-
Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.
-
Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.
-
Work collaboratively with developers and SRE teams to improve system reliability.
Qualifications:
-
Bachelor’s degree in Computer Science or a related field.
-
Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.
-
Hands-on Linux experience, including shell, kernel, and log management.
-
Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.
-
Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.
-
Experience with Nvidia GPU infrastructure and Kubernetes.
-
Comfortable collecting diagnostics, reading logs, and interpreting traces.
-
Strong troubleshooting mindset and ability to follow runbooks under pressure.
-
Excellent written and verbal communication skills for customer-facing incident handling.
-
Willingness to work shifts and participate in on-call rotations.
-
Bilingual in English and Chinese.
Skills
Explore related jobs
More jobs at Fuku
- HR InternSingapore, Singapore
- Drone Pilot/Videographer/PhotographerSingapore, Singapore
- Machine Learning EngineerKuala Lumpur, Federal Territory of Kuala Lumpur
- Senior Analyst/ Specialist, HR Services Canada - Night ShiftKuala Lumpur, Federal Territory of Kuala Lumpur
- Digital Marketing ManagerSingapore, Singapore
- Commercial Leasing ManagerKuala Lumpur, Federal Territory of Kuala Lumpur
Similar Alertmanager jobs
Jobs in Kuala Lumpur
Global Recruiter - Fixed Term ContractAleph · Kuala Lumpur, Malaysia- Full Time Sales Advisor (NU Sentral)H&M Group · Kuala Lumpur, KUALA LUMPUR
- Part Time Sales Advisor (NU Sentral)H&M Group · Kuala Lumpur, KUALA LUMPUR
Executive of Information Technology Service DeliveryAtlantic Partners Asia · Kuala Lumpur, Malaysia- Technical Project ManagerEndava · Kuala Lumpur, Kuala Lumpur
- Crypto OTC Sales TraderStraitsX · Kuala Lumpur, Kuala Lumpur