The Site Reliability Engineer main responsibility is to ensure sites reliability, resiliency and performance. Aside from engineering new tools and automation, the Site Reliability Engineer is also responsible for facilitation of communication, collaboration and integration between hosting operations and software development. Responsibilities:

Desired Skills and Experience

  • Promote DevOps culture across the organization
  • Be the champion of site resiliency, reliability, and robustness and balance developer requirements
  • Develop different kinds of tools that helps us build better insight into performance of our applications
  • Master various application and infrastructure monitoring tools
  • Drive team level initiatives for continuous technical and functional improvement
  • Provide technical input on architecture, design, and implementation changes
  • Conduct performance evaluations of new technology upgrades and critical software/ hardware changes
  • Be part of the decision of choosing new generation of real user monitoring tool
  • Write and deploy automation to maintain our deployments across multiple datacenters and environments
  • Strong understanding of a programming language (Python, C#, Java, etc.)
  • Bachelor’s degree in Computer Science or related software engineering discipline – preferred, or equivalent number of years in practice.
  • 4+ years of experience in the software development, hosting operations or QA field
  • Following experience is a plus: o Continuous integration tools (such as Jenkins, Travis-CI, Go) o Infrastructure provisioning tools (such as Chef, Puppet, Automic, Ansible) o VMware virtualization o Monitoring tools such as Extrahop, Application Dynamics, BMC software o Big Data - Splunk, ELK (Elasticsearch, Logstash, Kibana) o Experience administering application servers, servlet containers, and web servers (WebSphere, Apache Tomcat, Microsoft IIS, etc..) o Windows and Linux operating systems administration and tuning experience o Understanding of test automation principles and automation framework design is a plus
  • Strong programming skills in Python and demonstrated ability to apply these skills to solve complex problems.
  • Excellent communication and documentation skills
  • Critical thinking and problem solving
  • Comfort with collaboration and open communication across teams
  • Ability to drive a change
  • Thrive on “keeping the lights on” for our global production system (Cloud Services) and related deployments. Constantly finding ways to improve reliability and uptime.
  • Occasional on call (every 6-8 weeks)