Jobless Developer
Integrant logo
Integrant

Posted 1 month ago

Open

Senior Lead SysOps/Devops Engineer

CairoOn-siteFull-time

AI Summary

Senior Lead SysOps/DevOps Engineer who combines hands-on HPC/SysOps with solution design, presales, and client-facing architecture across GPU, Kubernetes, and hybrid HPC environments.

About this role

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.

This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

• Partner with sales and solution teams to identify and qualify new opportunities

• Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations

• Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients

• Prepare high-quality technical materials

• Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals

In-Account Delivery — SysOps & DevOps Execution

• Operate directly within client accounts as a senior SysOps/DevOps engineer

• Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on

• Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling

• Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems

• Serve as the senior escalation point for complex operational incidents within accounts

Architecture & Solution Design

• Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments

• Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms

• Recommend and validate technology choices aligned to client scale, budget, and team maturity

• Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

• Design production-grade multi-cluster Kubernetes platforms:

◦ RKE2, EKS (AWS), AKS (Azure) at enterprise scale

◦ GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools

◦ Hybrid cloud + on-premises HPC infrastructure

• Define and document:

◦ Workload isolation: namespaces, MIG partitioning, multi-tenancy models

◦ Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)

◦ Storage: Longhorn, Ceph, distributed and high-throughput file systems

2. Platform Engineering & GitOps Strategy

• Define and enforce platform standards across the delivery lifecycle

• GitOps tooling: ArgoCD, Fleet — declarative cluster management

• CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote

• Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible

• Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev → QA → Prod)

3. AI / GPU Infrastructure Architecture (Priority Competency)

• Design and operate GPU compute platforms at scale:

◦ GPU Operator deployment and lifecycle management

◦ MIG (Multi-Instance GPU) partitioning for multi-tenant workloads

◦ Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)

• Understand AI workload classes and their infrastructure implications:

◦ Distributed training workloads (data/model/pipeline parallelism)

◦ Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization

• Align infrastructure to the full AI stack:

◦ CUDA stack, cuDNN, NCCL collective communication libraries

◦ High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA

◦ GPUDirect RDMA / GPUDirect Storage for low-latency data paths

4. Observability & Reliability Engineering

• Define and implement full-stack observability:

◦ Metrics: Prometheus, Thanos (long-term retention, multi-cluster)

◦ Logs: Loki, Fluent Bit

◦ GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems

• Build operational frameworks:

◦ SLO / SLA definitions and error budget tracking

◦ Alerting strategy — noise reduction, severity routing

◦ Incident response playbooks and on-call runbooks

5. Security & Multi-Tenancy Architecture

• Design zero-trust security postures for multi-tenant platforms

• Secret management: HashiCorp Vault, External Secrets Operator

• Identity and access: IAM, RBAC, SSO/OIDC integration

• Network isolation: NetworkPolicy, micro-segmentation, mTLS

• Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement

6. HPC, Data & Storage Architecture (Priority Competency)

• Understand the high-performance storage for AI/HPC workloads:

◦ GPUDirect Storage — bypassing CPU for GPU-native I/O

◦ Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)

◦ Storage tiering, caching strategies, and data lifecycle management

• Size and validate storage architectures against workload I/O profiles

7. Operational Leadership & Linux Systems

• Lead incident response and root cause analysis (RCA) for critical production issues

• Define upgrade strategies, change management procedures, and disaster recovery plans

• Write and maintain runbooks, operational playbooks, and knowledge base content

• Integrate organizational processes, compliance requirements, and security policies into operational frameworks

• Deep Linux expertise:

◦ Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)

◦ Storage I/O scheduling, NVMe optimization

◦ Network stack tuning for RDMA / InfiniBand

◦ System performance profiling and bottleneck analysis

Candidate Profile — Who You Are

• you are comfortable running production systems.

• You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity

• You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment

• You communicate technical complexity clearly — to engineers and to C-level stakeholders

• You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations

• You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions

• You thrive in ambiguity and can scope both short POCs and long-horizon platform programs

Requirements

Required

• 10+ years in platform/infrastructure engineering, with at least 2 years in architect-level role

• Proven hands-on experience operating Kubernetes at scale in production (multi-cluster, multi-tenant)

• Significant Linux systems administration experience — kernel, networking, storage at a low level

• HPC and/or GPU infrastructure experience — physical GPU servers, NCCL, InfiniBand, or high-speed fabrics

• Demonstrable presales or client-facing experience

• IaC experience: Terraform and/or Ansible in production environments

• Strong understanding of GitOps and CI/CD pipelines in enterprise settings

Strongly Preferred

• Experience with NVIDIA GPU Operator, MIG partitioning, Run:AI, or equivalent GPU scheduling tooling

• Knowledge of distributed AI training infrastructure (PyTorch DDP, Horovod, DeepSpeed) from an infrastructure perspective

• Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines

• Experience with Weka, Ceph, or GPUDirect Storage in HPC/AI environments

• Hands-on experience with Vault, External Secrets, and zero-trust network architectures

• Exposure to bare-metal provisioning and HPC cluster management (Slurm, PBS, or equivalent)

Certifications (Advantageous)

• CKA / CKS (Certified Kubernetes Administrator / Security Specialist)

• RHCE / RHCA (Red Hat Certified Engineer / Architect)

• AWS Solutions Architect / Azure Solutions Architect Expert

• HashiCorp Terraform Associate or Vault Associate

• NVIDIA DLI certifications (GPU computing, AI infrastructure)

Skills

ADRs / RunbooksAnsible PlaybooksCI/CD (Azure DevOps, Jenkins)CI/CD Pipelines In EnterpriseCUDACuDNNGitOps (ArgoCD, Fleet)GPU/HPC InfrastructureInfinibandInfrastructure-as-code (Terraform, Ansible)K8s Multi-clusterKubernetesLinux Kernel TuningLinux Systems AdministrationMIG PartitioningNCCLNetworking (BGP, Ingress, Istio, Cilium)NVIDIA GPU OperatorNVIDIA Triton, TensorRTObservability (Prometheus, Thanos, Loki, Fluent Bit)PyTorch DDP / Horovod / DeepSpeedRBAC / IAMRDMARun:AISlurm/PBS FamiliaritySRE Concepts (SLO/SLA, Incident Response)Storage (Ceph, Longhorn, GPUDirect Storage)TensorRTTerraform ModulesTerraform Remote StateVault | External SecretsVM/ Bare-metal ProvisioningZero-trust Security

Explore related jobs

Browse these categories