Sr Site Reliability Engineer II -REMOTE

  • Location: Portland , Oregon
  • Type: Direct Hire
  • Job #731

We are searching for an experienced software or systems engineer interested in ensuring the stability, reliability, and performance of our SaaS assessment platform. The Site Reliability Engineering (SRE) team is a hybrid development and operations team within our Technical Operations group. The primary focus of the SRE team is to maintain the integrity and performance of our continuously-integrated platform through understanding the relationships between infrastructure-layer code, functional application pools, network and software-defined load balancers, Mongo, SQL and PostgreSQL databases, message queues and data caches, all running in a mixture of private data centers and Content Delivery Networks. We use a mixture of monitoring tools to identify and mitigate possible client affecting issues.

As a member of the SRE team, you must have the desire to troubleshoot complex technical problems, and you must work effectively with a wide variety of technical internal teams and vendors, as well as internal non-technical teams. Strong verbal and written communication skills are a must, as you will be collaborating with others to diagnose, resolve and communicate issues as efficiently as possible. You must be self-driven and can look at problems in new and different ways to find solutions. You will look for ways to implement scripting and automation to improve existing tasks and procedures.

  • Experience in cloud computing platforms (AWS/GCP etc) – GCP preferred
  • Minimum of 4 years of progressive experience in a software development environment in high growth technology companies
  • Experience with web development of large-scale distributed systems and complex application architectures are a big plus
  • Able to undertake RCA and experience in diagnosing and resolving infrastructure issues
  • Experience in building CI/CD pipelines using common Jenkins, GitOps
  • Experienced using terraform for IaC
  • Experience using ansible or a similar config management solution.
  • Experienced user in Linux based OS
  • Exposure to programming/scripting (e.g. Python, Bash, Golang)
  • Experience with broad cloud, monitoring, infra as code tech desirable (Datadog, Terraform etc..)

Responsibilities

  • Address, resolve and perform root cause analysis on all support escalations
  • Analyze the current state of the application and infrastructure, designing appropriate solutions and working with teams to implement them
  • Coordinate emergency responses, perform root cause analysis, identify and implement solutions to prevent re-occurrences
  • Work with the Operations and Software Engineering teams to identify ways to increase MTBF and lower MTTR for the environment
  • Review entire application stack and execute initiatives to reduce failures, defects and issues with overall performance
  • Review code base and make recommendations for improving performance
  • Collaborate with quality assurance engineers to assist in resolution of software defects
  • Collaborate with project architects and project lead developers to prove the validity of new software technologies
  • Engage and help improve our software development methodology
  • Perform other duties as assigned to ensure the success of the team and the entire organization
  • Identifying and working with the team to implement more efficient system procedures
  • Maintaining environment monitoring systems to provide the best visibility into the state of the production environment
  • Maintaining performance analysis tools, identifying any negative changes to performance and working with the teams to resolve them
  • Researching industry trends and technologies, and promote adoption of best-in-class tools and technologies
  • Taking initiative to advance the quality, performance, or scalability of our applications, by influencing the architecture or design of our products
  • Design, develop and execute automated tests to validate solutions and environments
  • Troubleshoot issues across the entire stack – hardware, software, application and network
  • Follow system resource utilization trends and identify capacity planning needs
  • Participate in regular meetings, both within the team and across it, to discuss previous accomplishments, upcoming goals and any roadblocks in the way.

Skills And Abilities

  • Experience with managing and configuring logging and monitoring systems
  • Experienced in software network design and implementation
  • Experience with container run times such as Kubernetes highly desirable
  • Enjoys solving complex technical problems
  • Minimum Bachelor’s degree in computer science or a related field; equivalent combinations of experience and education will be considered in lieu of a degree
  • Strong written and oral communication skills
  • Minimum of 4 years of experience working as a Systems Engineer or an equivalent position
  • Firm understanding of networking concepts and technologies
  • Firm understanding of SQL and NoSQL database concepts
  • Experience in fault tolerant system design
Attach a resume file. Accepted file types are DOC, DOCX, PDF, HTML, and TXT.

We are uploading your application. It may take a few moments to read your resume. Please wait!

About Us

Catapult Recruiting was founded by a group of seasoned IT professionals who are native to the Portland area and love Oregon.

Contact Us

6107 SW Murray Blvd, Unit 269
Beaverton, OR 97008
(503) 970-3111
talent@CatapultRecruiting.com