Site Reliability Engineers (SRE) at ID.me fill the mission-critical role of ensuring that our complex systems are healthy, monitored, automated, and designed to scale. You will use your background as an operations generalist to work closely with our engineering team from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about an operations role that involves deep knowledge of both the application and the product, and he/she will also believe that automation is a key component to operating large-scale systems.

 

Responsibilities:

 

  • Serve as a primary point person responsible for the overall health, performance, and capacity of one or more of our Internet-facing services
  • Administration of complex custom applications on UNIX/Linux, Ruby on Rails and Postgres
  • Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
  • Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX/Linuxenvironment
  • Work closely with development teams to ensure that platforms are designed with “operability” in mind
  • Function well in a fast-paced, rapidly-changing environment
  • Participate in a 24x7 rotation for second-tier escalations  

Preferred Qualifications:

 

  • Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, application developers, etc.
  • UNIX/Linux systems knowledge/administration background
  • Trouble-shooting skills that span systems, network (TCP/IP), and code
  • Solid experience using configuration management frameworks (e.g. Chef, Puppet)
  • Firm grasp of at least one modern programming language (Ruby, Perl, Python) beyond basic scripting
  • Knowledge of the following and related topics:

  • Data structures, relational and non-relational databases
  • Linux internals, filesystems
  • Networking
  • Web/application servers such as Apache and Nginx
  • Centralized logging and related topics
  • Experience running NMS (Nagios, Zabbix, Zenoss, Hyperic or similar)
  • Experience administering continuous deployment infrastructure  

Desired Skills and Experience

See application page for details