Senior Site Reliability Engineer

Job Description

You will fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated and designed to scale. You'll use your background as an operations generalist to work closely with our development teams from the early stages of design all through identifying and resolving production issues.

Our Site Reliability Engineers are the primary interface between our developers and our production operations. No matter how many times we get searched, scraped, scanned, spammed, pinged, paged or queried, they need to respond calmly and professionally and keep their cool * and keep the site running smoothly. You'll work in both the dev and systems worlds, instrumenting key parts of core architecture and supporting devs as they try to do the same. We're looking for a true hacker * you'll work as much in bash as Ruby. You'll implement monitoring and alerting systems to support site stability and performance. You'll proactively scale our infrastructure to meet ever-increasing demand. You'll make sure that when something goes bump in the night, someone hears it. And you'll play a key role in keeping the Illumio ASP fast, available and growing.

What You’ll Do

Work closely with developers in supporting new features and services
Monitor site stability and performance
Scale infrastructure to meet demand
Troubleshoot site issues
Develop custom tools as necessary
Document system design and procedures
Serve as a primary point responsible for the overall health, performance, and capacity of our internet-facing systems
Gain deep knowledge of our complex applications
Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment
Work closely with development teams to ensure that platforms are designed with 'operability' in mind
Participate in a 24x7 rotation for primary-tier escalations What We Need

Qualifications

Expertise in Linux or Unix
3+ years prior experience in an internet-facing technical operations role
Command of your favorite modern programming language: Python, Ruby, Java, C++, etc.
Solid understanding of fundamental technologies like TCP/IP, HTTP,
Knowledge of best practices related to security, performance, and disaster recovery
Experience with web server configuration, monitoring, trending, network design, high availability
Excellent communication skills
Ability to pick up new software, frameworks and APIs quickly
A sense of humor! Bonus Points
PostgreSQL experience (high availability, scale-out replication)
Advanced knowledge of network security design within various public cloud providers
Advanced knowledge of Chef and Ansible configuration management tools Additional Information

null

Desired Skills and Experience

See application page for details