Site Reliability Engineer

Description: We are looking for a very well-rounded, experienced Site Reliability Engineer to lead a team of SREs dedicated to improving the reliability of our client?s end-to-end platform. The Site Reliability Engineer will dive deep into gnarly operational issues, from the programming, systems, automation, and process perspectives. He/she will understand the challenges around rapidly creating, scaling, and managing distributed applications and services, and will be able to collaborate with talented engineers across multiple disciplines to address those challenges.

Responsibilities

Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
Set Direction, Mentor and Manage a team of Site Reliability Engineers
Troubleshoot issues across the entire stack: hardware, software, application and network
Drive standardization efforts across multiple disciplines and services
Mentor SREs across the organization on best practices for everything from monitoring to troubleshooting complex code issues
Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
Participate in code reviews for projects primarily written in Java, built on open source libraries, and running on both physical and virtualized platforms
Represent the SRE/Service Operations organization in design reviews and operational readiness exercises for new and existing services Requirements:
Experience Leading and training a team of 3+ SRE or Operations Engineers.
Solid understanding of systems and application design, including the operational trade-offs of various designs
Practical knowledge of various aspects of service design, including messaging protocols & behavior, caching strategies and software design practices
Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
Must work well with and be able to influence myriad personalities at all levels
Solid knowledge of shell scripting and at least one scripting language
Solid knowledge of automation tools, such as Puppet, Chef, Ansible, Salt, etc. in a production environment
Minimum 7 years of managing services in an internet scale *nix environment
Ability to prioritize tasks and assign priorities
Must be adaptable and able to focus on the simplest, most efficient & reliable solutions

Desired Skills and Experience

See application page for details