Jobless Developer
Fuku logo
Fuku

Posted 2 months ago

Open

Infra Support Engineer

Kuala LumpurOn-siteFull-time

AI Summary

Infra Support Engineer helps diagnose and fix AI infrastructure issues, supports GPU/CPU nodes, networking, storage, and orchestration, and coordinates with SRE to maintain reliability.

About this role

Title: Infra Support Engineer - Fuku

URL Source: https://apply.workable.com/j/03C688AFC9

Markdown Content: Infra Support Engineer – GMI Global Infrastructure Team

Preferred Location:

  • Taiwan

  • Malaysia

Responsibilities:

  • Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.

  • Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.

  • Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.

  • Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.

  • Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.

  • Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.

  • Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.

  • Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.

  • Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.

  • Work collaboratively with developers and SRE teams to improve system reliability.

Qualifications:

  • Bachelor’s degree in Computer Science or a related field.

  • Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.

  • Hands-on Linux experience, including shell, kernel, and log management.

  • Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.

  • Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.

  • Experience with Nvidia GPU infrastructure and Kubernetes.

  • Comfortable collecting diagnostics, reading logs, and interpreting traces.

  • Strong troubleshooting mindset and ability to follow runbooks under pressure.

  • Excellent written and verbal communication skills for customer-facing incident handling.

  • Willingness to work shifts and participate in on-call rotations.

  • Bilingual in English and Chinese.

Skills

AlertmanagerAlertsAutomationBIOS/firmware UpdatesCapacity ChecksDevOpsDNSGPU Driver/runtime InstallationGrafanaHTTPImage DeploymentIncident ManagementKernelKubernetesLinuxLog AnalysisLog ManagementMonitoringNetwork ValidationNVIDIA GPU InfrastructureOn-call RotationsPostmortemsPrometheusRunbooksShellSlackSOPsSRESystem ProvisioningTCP/IPTicketing SystemsVLANs

Explore related jobs

Browse these categories