Our Client is seeking to hire a Senior Manager of Site Reliability Engineering who will lead a hybrid team responsible for the operations of the company platform. This is to include the design, implementation, automation, and support of our systems. You will work closely with our business-critical engineering teams and will lead our engineers and developers to ensure all systems are well engineered using practices like SLOs, error budgets, actionable alerts, retrospectives, and end-to-end ownership.
- Manage a hybrid team of local and remote employees supporting our platforms and systems.
- Strong history in implementing and maintaining SRE topics like SLOs, resilience, scaling, performance, and more.
- Troubleshoot priority incidents, facilitate blameless post-mortems.
- Develop and track KPIs related to our platforms and report them to upper management.
- Build and drive adoption for greater self-healing and resiliency patterns.
- Align engineering development requirements with the capabilities of the site infrastructure.
- Grow the team by capturing new talent and coaching existing talent.
Skills & Requirements:
- Bachelors Degree in Computer Science, Information Systems or equivalent.
- Requires 10+ years of Software engineering or Devops/SRE engineering experience.
- Requires 5+ years managing teams of various sizes across multiple timezones.
- 3-5 years of managing cloud systems, costing, and reporting (AWS Preferred).
- Strong communication, negotiation, and collaboration skills; ability to convey complex, emergent, problems in a way that lets us focus on solving the root cause of the problem, rather than its symptoms.
- Proven ability to make difficult decisions about what to prioritize and what to ignore. You identify where you can make the most beneficial impact for our customers and continue working until that benefit has been achieved.
- Excellent programming skills in Java/Python.
- Demonstrated experience in creating monitoring and reporting on current status and progress of initiatives.
- Demonstrated experience in automating/improving existing deployment and configuration processes.
- Experience and understanding of CI/CD tools. (Jenkins, Ansible, Cloudformation preferred).
- Familiarity with modern observability tools (Grafana, Prometheus, or equivalent).