
Posted 2 months ago
Systems Reliability Engineer
AI Summary
Senior Systems Reliability Engineer focused on designing and operating large-scale, multi-cloud infrastructure with emphasis on observability, incident response, and IaC automation.
About this role
About Us
At Arkenstone Defense, we empower defense tech startups with the tools, infrastructure, and compliance solutions they need to become successful prime contractors. Our mission is to remove barriers and help innovators grow - from day one to becoming a trusted prime for the U.S. Government.
We're early, we're lean, and we're building something that actually matters. The people who do well here aren't waiting to be told what to do; they see a gap and fill it.
What You’ll Do
Design, implement, and own the infrastructure reliability strategy across AWS, Azure, and GCP
Champion observability by developing and maintaining effective logging, monitoring, and alerting systems
Lead efforts in performance tuning, system hardening, capacity planning, and disaster recovery
Own the incident management lifecycle: from detection to postmortem and root cause analysis
Automate deployment, scaling, and recovery workflows to reduce manual toil
Contribute to infrastructure as code (Terraform, ARM templates, CloudFormation, etc.)
Act as a mentor and technical leader to junior engineers and cross-functional partners
Other Duties
Perform any other related duties as required or assigned
Who You Are
Drive a culture of accountability, ownership, and continuous improvement
You thrive on building meaningful relationships and helping others succeed
You understand the unique challenges that defense tech startups face, and can speak their language
Requirements
5+ years of experience in SRE, DevOps, or infrastructure engineering roles
Proven track record of operating large-scale systems in multi-cloud environments
Strong knowledge of cloud-native architecture, container orchestration (e.g., Kubernetes), and CI/CD pipelines
Proficient in scripting (Python, Bash, etc.) and infrastructure automation tools
Experience with monitoring/observability platforms (e.g., Prometheus, Grafana, Datadog, ELK, etc.)
Excellent problem-solving skills and a bias toward ownership and action
Comfortable making decisions under pressure and leading through incidents
Working knowledge of FedRAMP or NIST 800-53 controls preferred
Comfortable participating in customer discussions
Clear communicator who can translate technical concepts to mixed audiences
Benefits for working with us!
We are committed to supporting our employees both professionally and personally. Our robust benefits package is designed to promote your well-being, growth, and work-life balance:
Competitive Salary: Recognizing your hard work with attractive compensation and rewarding excellence.
Health and Wellness Programs: Including medical, dental, and vision insurance options, along with mental health support and wellness initiatives.
Retirement Planning: Secure your future with our flexible 401(k) plan and matching company contributions.
Paid Time Off & Holidays: Generous PTO, sick leave, and holiday pay to help you recharge and enjoy life outside of work.
Employee Assistance Program: Confidential resources for personal and professional support.
Professional Development: Access to training, certifications, and continuing education to foster your career growth.
ADDITIONAL INFORMATION:
We are an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex (including pregnancy, gender identity, and sexual orientation), national origin, age, disability, genetic information, veteran status, or any other characteristic protected under applicable law.