Datacenter Field Engineer
AI Summary
Hardware Operations & Systems Engineer responsible for the physical health and foundational infrastructure of GPU clusters, including data center coordination, Linux administration, security, and hardware lifecycle management.
About this role
Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.
Role Overview
We are looking for a dedicated Hardware Operations & Systems Engineer to own the physical health and foundational infrastructure of our GPU clusters. You will be the primary custodian of our compute hardware, responsible for everything from data center vendor coordination up to the base Linux OS layer. You will ensure our research and product teams have a stable, secure, and fully operational physical environment to run their demanding compute workloads on.
Key Responsibilities
System Health & Hardware Reliability
On-Call Response: Serve as the primary point of contact for physical system outages, hardware failures, and network interruptions to minimize downtime.
Cluster Monitoring: Proactively monitor hardware health, including GPU thermals, power draw, and physical system loads, catching anomalies before they impact active workloads.
Vendor Liaison: Work closely with data center facility staff and third-party hardware vendors to coordinate RMA processes, physical repairs, part replacements, and routine maintenance.
Hardware Deployment: Rack, cable, and lead the physical bring-up of new GPU nodes, ensuring power and network connectivity are fully integrated into the existing cluster.
Linux & Network Administration
OS Management: Install, patch, and maintain Linux operating systems (Ubuntu/CentOS/RHEL) across the cluster bare-metal servers.
Security & Access: Configure and maintain edge and internal networking, including firewalls, VPNs, and strict SSH access controls to secure our infrastructure.
Identity & Storage Management: Administer LDAP/Active Directory for centralized user authentication and ensure network storage systems (NFS/GPFS/Lustre) are reliably mounted and properly permissioned.
Qualifications
Must-Haves:
3+ years of experience in Linux Systems Administration (deep knowledge of boot processes, systemd, disk management, etc.).
Strong background in server hardware troubleshooting, specifically within high-density environments (power, cooling, PCIe topologies).
Experience managing networking security (VPNs, iptables/firewalld, VLANs) and directory services (LDAP/FreeIPA/Active Directory).
Proficiency in Bash scripting for essential system automation.
Nice-to-Haves:
Experience using configuration management tools like Ansible, SaltStack, or Terraform for OS provisioning.
Familiarity with data center operations, cooling requirements for high-TDP accelerators (like NVIDIA H100 or AMD MI300).
Benefits include
Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
Equal opportunity
Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.
Skills
Explore related jobs
More jobs at sciforium
Similar Active Directory jobs
Jobs in San Jose
Head of Customer SupportAiPrise · San Jose- YAccount Manager (Client Success & Growth)YuJa · San Jose, United States
- YPartnerships Development RepresentativeYuJa · San Jose, United States
- YAccount ExecutiveYuJa · San Jose, United States
- YSales Development RepresentativeYuJa · San Jose, United States
- TFreelance On-Site Interpreters (Spanish) - CaliforniaThe Language Doctors · San Jose, United States
