We are seeking a Senior Site Reliability Engineer to join our fast-growing Professional team. We are searching for an experienced software or systems engineer interested in ensuring the stability, reliability, and performance of our SaaS assessment platform. The Site Reliability Engineering (SRE) team is a hybrid development and operations team within our Technical Operations group. The primary focus of the SRE team is to maintain the integrity and performance of our continuously integrated platform through understanding the relationships between infrastructure-layer code, functional application pools, network and software-defined load balancers, Mongo, SQL and PostgreSQL databases, message queues and data caches, all running in a mixture of private data centers and Content Delivery Networks. We use a mixture of monitoring tools to identify and mitigate possible client affecting issues.
As a member of the SRE team, you must have the desire to troubleshoot complex technical problems, and you must work effectively with a wide variety of technical internal teams and vendors, as well as internal non-technical teams. Strong verbal and written communication skills are a must, as you will be collaborating with others to diagnose, resolve and communicate issues as efficiently as possible. You must be self-driven and can look at problems in new and different ways to find solutions. You will look for ways to implement scripting and automation to improve existing tasks and procedures.
How will you contribute?
- Execute with modern container and cloud native best practices
- Manage day to day operations of our SaaS on-prem platform ensuring health and performance of platform.
- Creatively solve problems in the DevOps space, collaborating with Development, DBA, and QA team members
- Communicate and coordinate effectively with Product, Customer Success and Integration teams on operations tasks and deployments.
- Listen to our internal customers/teams, understand their pain points, coach/mentor them for working smarter
- Document decisions regarding technology choices, best practices and process flow
- Help create and manage continuous integration systems.
- Mentor and up-level other SRE team members and champion best practices within the team.
- Automate builds and deployments across multi-platform environments
What will you bring?
- Strong experience working on a high-volume SaaS application managed with modern Infrastructure-as-Code methodologies/tooling.
- Experience with container technologies and orchestration platforms (Docker, Kubernetes, Rancher, Cloud Foundry)
- Experience with running/managing systems and services on a cloud platform (AWS, Google Cloud, Azure)
- Experience managing and using CI/CD systems (Flux, Circle CI, Concourse, Jenkins, TravisCI)
- Strong experience working with configuration management tools. (Terraform, Ansible, Puppet)
- Experience deploying and/or operating a centralized logging system (ELK stack or Splunk)
- Experience working with monitoring and observability tools (Datadog, New Relic, Grafana)
- Familiarity with working with relational databases MSSQL or MySQL databases
- Background working in a multi-platform environment (Linux, Windows)
- Familiarity with Agile/Scrum/Kanban methodologies
- Strong knowledge around of programming/scripting languages (ie. Python, Bash, Powershell, Go, etc.). Software Engineers looking to get into SRE/Devops encouraged to apply.
Education and Experience
- Ownership and collaboration with attitude to help other succeed.
- Strong interpersonal skills
- A can-do attitude and sense of urgency for a high growth/fast paced environment
- Proven track record of owning a complex technical project from inception to completion.
- BS in Computer Science or equivalent experience
- Curious mind, wanting to learn new technologies and share with others.
- The ability to think outside of the box to resolve issues and create long lasting solutions
- Firm understanding of networking concepts and technologies
- Firm understanding of SQL and NoSQL database concepts
- Experience in fault tolerant system design