Site Reliability Engineer - Operations
Make Tumblr fast, reliable and available for hundreds of millions of users globally. As an SRE-Ops Engineer you are a software developer, systems maven, with a love of highly performant, fault-tolerant, massively distributed systems. What You’ll Do:
- Manage the availability, scalability and performance of Tumblr platforms
- Create the tools and infrastructure leveraged by the rest of the Tumblr engineering teams
- Diagnose and repair network, application, and hardware bottlenecks
- Test and tune network, hardware, and software configurations to maximize performance
- Deploy and manage monitoring and diagnostic tools
- Guide our product and platform teams to keep new features fast and stable
- Front-line defense on a daily rotation 10am-6pm (approximately 1 day per week)
-
Front-line defense on a weekly overnight rotation 12am-10am (1 week at a time, no daytime work during rotation) What We’re Looking For:
- Hunger to solve the problem. No stone left unturned while searching for the solution!
- Experience in troubleshooting large-scale distributed systems
- Experience scaling high-traffic web sites
- Experience with Unix systems administration, including solid scripting skills
- Experience in data structures and algorithms
- Experience and willingness to perform on-call duties
- Smarts, humility, and equal willingness to learn and teach
-
A sense of ownership, initiative, and drive Tools We Like:
- Nginx, Varnish and HAProxy
- Memcached and Redis
- MySQL (InnoDB)
- Puppet
- git and GitHub
- Ruby, Go, Scala, PHP
- Asynchronous services and queues like Oozie and Gearman
- Hadoop, Pig, ZooKeeper, and other Java/JVM projects
- Nagios, Icinga2, Pagerduty, OpenTSDB
- OpenStack, Docker, Mesos
Desired Skills and Experience
See application page for details