Site Reliability Manager

  • Location: Denver, Colorado
  • Type: Direct Hire
  • Job #765

As a Site Reliability Manager, your mission will be to ensure our services are fast, highly available,  and scalable.

The Site Reliability Manager leads a team of centralized specialists and engineers who are embedded with product software engineering teams. This team will provide specialized knowledge, practices, and tools to support the continuous improvement of reliability engineering.

  • Accomplished leader with 5+ years managing regional and global areas.
  • Experience with AWS, New Relic, observability (black box monitoring), Kafka, Terraform, Kubernetes.

The Skills and Experience

  • The successful candidate will possess an outstanding record of professional experience and will thrive in an environment that demands accountability. They must possess significant technology management and product development experience. They must also have strong planning, organizational, communication skills, and be a key driver to help the team understand the big picture perspective.
  • Proven leader of technology solutions in a high-volume transaction environment.
  • Have excellent time management, communication, decision-making, presentation, and organizational skills.
  • Maintain excellent written and verbal communications with clients, employees, and the management chain, including status reports, project plans, presentations, etc.
  • Ability to lead across functions and motivate staff

Leadership Responsibilities

  • Evangelize SRE practices with development, operational and product groups to align technology service/solution delivery.
  • Drive quality accountability within the organization with well-defined processes, metrics, and goals for process quality. This includes leading effective postmortems and maintaining transparency into action items.
  • Manage availability, latency, scalability, and efficiency of company applications development by instilling engineering reliability into our development life cycle with a focus on fault tolerant approaches.
  • Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
  • Must be able to define and report progress on strategic initiatives and project level tasks to all stakeholders including senior executives, clients and use effective communication approaches with each constituency.
  • Implement metrics driven processes to ensure service quality targets are met.
  • Our ideal candidate has a deep understanding of modern site reliability practices, has a strong customer focus, and is an enthusiastic change agent. You know how to set and evangelize the vision and how to work in an agile SaaS development cycle. Your coworkers recommend you as someone who made a positive difference in their careers. We work with low walls between teams and disciplines because we believe good ideas are everywhere and the best products come from teams who are working to a shared goal of building and operating performant, reliable, and scalable systems.
  • This is a unique opportunity to lead the evolution of modern reliability practices for an established leader in the freight industry. As a thought leader across the organization, you will influence and mentor teams to ensure that our systems are fast and reliable.

 What You’ll Do

  • Understanding of how to influence peers and other leaders to build a culture around reliability and transparency
  • Strong management skills, with a servant leadership mindset.
  • Strong knowledge in all aspects of designing, developing, managing large real-time systems.
  • Project and process management
  • Prior successful experience as a site reliability engineer.
  • Mastery of fault-tolerant approaches in a large-scale distributed environment and high-performance systems,
  • Demonstrated experience working in large, complex systems environments.
  • Deep understanding of internet and networking protocols.
  • A passion for performance excellence, robustness, and engineering mindset
  • Define how code gets into production (define release engineering processes, review artifacts, own and improve CI/CD pipelines)
  • Improve the tempo of releases by automating manual processes that prevent software delivery from being repeatable
  • Introduce progressive delivery mechanics such as Canary deployments
  • Manage reliability (set SLA's, SLO's, SLI's to improve the performance of an application)
  • Flag behaviors or outputs that violate reliability focus area
Attach a resume file. Accepted file types are DOC, DOCX, PDF, HTML, and TXT.

We are uploading your application. It may take a few moments to read your resume. Please wait!

About Us

Catapult Recruiting was founded by a group of seasoned IT professionals who are native to the Portland area and love Oregon.

Contact Us

6107 SW Murray Blvd, Unit 269
Beaverton, OR 97008
(503) 970-3111