The Senior Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, performance, and operability of production systems across our platforms, by applying software engineering practices to operations, with a focus on automation, observability, and incident response.
- Own and improve the reliability, availability, and performance of production services in Google Cloud (GCP).
- Participate in incident management, including detection, triage, mitigation, escalation, and recovery.
- Use and improve incident workflows and tooling (e.g., ServiceNow) to ensure clear ownership and timely communication.
- Design, implement, and operate observability solutions including monitoring, logging, tracing, synthetics, and dashboards (e.g., Splunk Observability, OpenTelemetry).
- Reduce operational toil through automation and engineering-led solutions, proactively introducing and driving SRE best practices.
- Support on-call rotations across multiple time zones, contributing to a sustainable 24/7 support model.
- Define, monitor, and report SLIs, SLOs, and error budgets for critical services.
- Drive and be accountable for best-in-class service availability through SRE principles, automation, and proactive reliability engineering.
- Bachelor’s degree in Computer Science, Information Technology or related field
- Strong experience with cloud-native concepts and technologies, with a strong preference for Google Cloud Platform (GCP) and Kubernetes (GKE).
- Proven experience with Site Reliability Engineering and production incident management, ideally using platforms such as ServiceNow.
- Experience with monitoring and observability tools, including metrics, logs, traces, and synthetics (e.g., Splunk Observability, OpenTelemetry).
- Exposure to reliability testing, resilience engineering, or cost optimisation initiatives.
- Excellent analytical and problem-solving skills, with the ability to diagnose complex production issues quickly.
- Software development or automation experience using Python, shell scripts, or similar languages.
- Hands-on experience operating production cloud infrastructure at scale.
- Experience managing multi-region, high-availability production systems with a focus on scalability, resilience, and minimising service disruption during failures.
- Proficiency in Microsoft Office Suites Skills
- Show an ownership mindset in everything you do; be a problem solver, be curious and be inspired to take action, be proactive, seek ways to collaborate and connect with people and teams in support of driving success.
- Continuous growth mindset, keep learning through social experiences and relationships with stakeholders, experts, colleagues and mentors as well as widen and broaden your competencies through structural courses and programs.
- Where applicable, fluency in English and languages relevant to the working market.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please visit https://bit.ly/3LMn4CQ.