Description: We are looking for a very well-rounded, experienced Site Reliability Engineer to lead a team of SREs dedicated to improving the reliability of our client?s end-to-end platform. The Site Reliability Engineer will dive deep into gnarly operational issues, from the programming, systems, automation, and process perspectives. He/she will understand the challenges around rapidly creating, scaling, and managing distributed applications and services, and will be able to collaborate with talented engineers across multiple disciplines to address those challenges.

Responsibilities

  • Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes
  • Set Direction, Mentor and Manage a team of Site Reliability Engineers
  • Troubleshoot issues across the entire stack: hardware, software, application and network
  • Drive standardization efforts across multiple disciplines and services
  • Mentor SREs across the organization on best practices for everything from monitoring to troubleshooting complex code issues
  • Identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management and visibility of our services
  • Participate in code reviews for projects primarily written in Java, built on open source libraries, and running on both physical and virtualized platforms
  • Represent the SRE/Service Operations organization in design reviews and operational readiness exercises for new and existing services Requirements:
  • Experience Leading and training a team of 3+ SRE or Operations Engineers.

  • Solid understanding of systems and application design, including the operational trade-offs of various designs
  • Practical knowledge of various aspects of service design, including messaging protocols & behavior, caching strategies and software design practices
  • Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
  • Must work well with and be able to influence myriad personalities at all levels
  • Solid knowledge of shell scripting and at least one scripting language
  • Solid knowledge of automation tools, such as Puppet, Chef, Ansible, Salt, etc. in a production environment
  • Minimum 7 years of managing services in an internet scale *nix environment
  • Ability to prioritize tasks and assign priorities
  • Must be adaptable and able to focus on the simplest, most efficient & reliable solutions

Desired Skills and Experience

See application page for details