Jobless Developer
ValGenesis logo
ValGenesis

Posted 3 months ago

Open

Site Reliability Engineer - SaaSOps

HyderabadHybridFull-time

AI Summary

SRE role focusing on building reliability for ValGenesis SaaS platform, defining SLAs/SLIs/SLOs, incident response, RCA postmortems, and observability across Azure and on-prem hybrid deployments.

About this role

About ValGenesis
ValGenesis is a leading digital validation platform provider for life sciences companies. ValGenesis suite of products are used by 30 of the top 50 global pharmaceutical and biotech companies to achieve digital transformation, total compliance and manufacturing excellence/intelligence across their product lifecycle.
Learn more about working for ValGenesis, the de facto standard for paperless validation in Life Sciences: https://www.valgenesis.com/about

About the Role:

Responsibilities:

  • Define and embed SRE best practices across the SaaS platform, ensuring reliability is built into the system from the ground up.
  • Establish and maintain meaningful SLA, SLIs, SLOs, and error budgets to protect customer experience and guide engineering priorities.
  • Design and continuously improve high-availability and disaster recovery strategies.
  • Automate manual processes, manage incident response, optimize performance (SLI/SL0).
  • Bridge the gap between development IT operations.
  • Ensure strong tenant isolation and consistent performance within a DB-per-tenant architecture.
  • Strengthen system resiliency across both Azure and on-prem deployments in our hybrid environment.
  • Lead incident response efforts with structured troubleshooting and clear communication.
  • Drive thorough root cause analysis (RCA) and conduct blameless postmortems focused on long-term improvements.
  • Translate incidents into systemic fixes rather than temporary patches.
  • Develop and maintain operational runbooks to standardize responses.
  • Design and maintain a comprehensive observability framework for both cloud and on-prem environments.
  • Requirements:

  • Must have a minimum of 3+ years of hands-on experience in Site Reliability Engineering (SRE), supporting production-grade, cloud-native enterprise software platform/applications.
  • Prior experience as a DevOps engineer, cloud system administrator or software developer.
  • Strong proficiency in scripting languages such as Python, PowerShell etc
  • Deep hands-on experience working with Microsoft Azure in production environments.
  • Possess a solid understanding of Terraform, Ansible, Kubernetes internals, including networking, scheduling, scaling, and resource management.
  • Have proven experience in PostgreSQL performance tuning and optimization in production systems.
  • Demonstrate hands-on experience with Azure Monitor, Application Insights, and Log Analytics for cloud-based observability.
  • Implement and manage Prometheus and Grafana for Kubernetes and on-prem monitoring.
  • Understand how to turn metrics, logs, and traces into actionable insights that improve reliability and performance.
  • Troubleshoot and improve CI/CD pipelines to ensure stable and predictable releases.
  • Apply GitOps principles to manage deployments and infrastructure changes in a controlled and auditable manner.
  • Skills

    AnsibleApplication InsightsAzure MonitorBlameless PostmortemsCI/CDDB-per-tenant ArchitectureDisaster RecoveryGitOpsGrafanaIncidence ResponseKubernetes Networking Scheduling Scaling Resource ManagementLog AnalyticsObservability FrameworkPostgreSQL Performance TuningPowerShellPrometheusPythonSLA/SLI/SLO DefinitionTenant IsolationTerraform

    Explore related jobs

    Browse these categories