Site Reliability Engineer

Site Reliability Engineers (SRE) at ID.me fill the mission-critical role of ensuring that our complex systems are healthy, monitored, automated, and designed to scale. You will use your background as an operations generalist to work closely with our engineering team from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about an operations role that involves deep knowledge of both the application and the product, and he/she will also believe that automation is a key component to operating large-scale systems.

Responsibilities:

Serve as a primary point person responsible for the overall health, performance, and capacity of one or more of our Internet-facing services
Administration of complex custom applications on UNIX/Linux, Ruby on Rails and Postgres
Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX/Linuxenvironment
Work closely with development teams to ensure that platforms are designed with “operability” in mind
Function well in a fast-paced, rapidly-changing environment
Participate in a 24x7 rotation for second-tier escalations

Preferred Qualifications:

Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, application developers, etc.
UNIX/Linux systems knowledge/administration background
Trouble-shooting skills that span systems, network (TCP/IP), and code
Solid experience using configuration management frameworks (e.g. Chef, Puppet)
Firm grasp of at least one modern programming language (Ruby, Perl, Python) beyond basic scripting
Knowledge of the following and related topics:
Data structures, relational and non-relational databases
Linux internals, filesystems
Networking
Web/application servers such as Apache and Nginx
Centralized logging and related topics
Experience running NMS (Nagios, Zabbix, Zenoss, Hyperic or similar)
Experience administering continuous deployment infrastructure

Desired Skills and Experience

See application page for details