Posted 2 months ago

Infra Support Engineer

Kuala LumpurOn-siteFull-time

AI Summary

Infra Support Engineer helps diagnose and fix AI infrastructure issues, supports GPU/CPU nodes, networking, storage, and orchestration, and coordinates with SRE to maintain reliability.

About this role

Title: Infra Support Engineer - Fuku

URL Source: https://apply.workable.com/j/03C688AFC9

Markdown Content: Infra Support Engineer – GMI Global Infrastructure Team

Preferred Location:

Taiwan
Malaysia

Responsibilities:

Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.
Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.
Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.
Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.
Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.
Work collaboratively with developers and SRE teams to improve system reliability.

Qualifications:

Bachelor’s degree in Computer Science or a related field.
Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.
Hands-on Linux experience, including shell, kernel, and log management.
Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.
Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.
Experience with Nvidia GPU infrastructure and Kubernetes.
Comfortable collecting diagnostics, reading logs, and interpreting traces.
Strong troubleshooting mindset and ability to follow runbooks under pressure.
Excellent written and verbal communication skills for customer-facing incident handling.
Willingness to work shifts and participate in on-call rotations.
Bilingual in English and Chinese.

Skills

AlertmanagerAlertsAutomationBIOS/firmware UpdatesCapacity ChecksDevOpsDNSGPU Driver/runtime InstallationGrafanaHTTPImage DeploymentIncident ManagementKernelKubernetesLinuxLog AnalysisLog ManagementMonitoringNetwork ValidationNVIDIA GPU InfrastructureOn-call RotationsPostmortemsPrometheusRunbooksShellSlackSOPsSRESystem ProvisioningTCP/IPTicketing SystemsVLANs

Infra Support Engineer

About this role

Skills

Explore related jobs

More jobs at Fuku

Similar Alertmanager jobs

Jobs in Kuala Lumpur

Browse these categories