Site Reliability Engineer

Make Tumblr fast, reliable and available for hundreds of millions of users globally. As an SRE-Ops Engineer you are a software developer, systems maven, with a love of highly performant, fault-tolerant, massively distributed systems. What You’ll Do:

Manage the availability, scalability and performance of Tumblr platforms
Create the tools and infrastructure leveraged by the rest of the Tumblr engineering teams
Diagnose and repair network, application, and hardware bottlenecks
Test and tune network, hardware, and software configurations to maximize performance
Deploy and manage monitoring and diagnostic tools
Guide our product and platform teams to keep new features fast and stable
Front-line defense on a daily rotation 10am-6pm (approximately 1 day per week)
Front-line defense on a weekly overnight rotation 12am-10am (1 week at a time, no daytime work during rotation) What We’re Looking For:
Hunger to solve the problem. No stone left unturned while searching for the solution!
Experience in troubleshooting large-scale distributed systems
Experience scaling high-traffic web sites
Experience with Unix systems administration, including solid scripting skills
Experience in data structures and algorithms
Experience and willingness to perform on-call duties
Smarts, humility, and equal willingness to learn and teach
A sense of ownership, initiative, and drive Tools We Like:
Nginx, Varnish and HAProxy
Memcached and Redis
MySQL (InnoDB)
Puppet
git and GitHub
Ruby, Go, Scala, PHP
Asynchronous services and queues like Oozie and Gearman
Hadoop, Pig, ZooKeeper, and other Java/JVM projects
Nagios, Icinga2, Pagerduty, OpenTSDB
OpenStack, Docker, Mesos

Desired Skills and Experience

See application page for details