Senior Site Reliability Engineer
Job Description
You will fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated and designed to scale. You'll use your background as an operations generalist to work closely with our development teams from the early stages of design all through identifying and resolving production issues.
Our Site Reliability Engineers are the primary interface between our developers and our production operations. No matter how many times we get searched, scraped, scanned, spammed, pinged, paged or queried, they need to respond calmly and professionally and keep their cool * and keep the site running smoothly. You'll work in both the dev and systems worlds, instrumenting key parts of core architecture and supporting devs as they try to do the same. We're looking for a true hacker * you'll work as much in bash as Ruby. You'll implement monitoring and alerting systems to support site stability and performance. You'll proactively scale our infrastructure to meet ever-increasing demand. You'll make sure that when something goes bump in the night, someone hears it. And you'll play a key role in keeping the Illumio ASP fast, available and growing.
What You’ll Do
- Work closely with developers in supporting new features and services
- Monitor site stability and performance
- Scale infrastructure to meet demand
- Troubleshoot site issues
- Develop custom tools as necessary
- Document system design and procedures
- Serve as a primary point responsible for the overall health, performance, and capacity of our internet-facing systems
- Gain deep knowledge of our complex applications
- Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
- Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment
- Work closely with development teams to ensure that platforms are designed with 'operability' in mind
- Participate in a 24x7 rotation for primary-tier escalations What We Need
Qualifications
- Expertise in Linux or Unix
- 3+ years prior experience in an internet-facing technical operations role
- Command of your favorite modern programming language: Python, Ruby, Java, C++, etc.
- Solid understanding of fundamental technologies like TCP/IP, HTTP,
- Knowledge of best practices related to security, performance, and disaster recovery
- Experience with web server configuration, monitoring, trending, network design, high availability
- Excellent communication skills
- Ability to pick up new software, frameworks and APIs quickly
-
A sense of humor! Bonus Points
- PostgreSQL experience (high availability, scale-out replication)
- Advanced knowledge of network security design within various public cloud providers
- Advanced knowledge of Chef and Ansible configuration management tools Additional Information
null
Desired Skills and Experience
See application page for details