Site Reliability Engineer

Make Tumblr fast, reliable and available for hundreds of millions of users all over the world. As a site reliability engineer you are a software developer with a love of highly performant, fault-tolerant, massively distributed systems. What You’ll Do:

Manage the availability, scalability and performance of Tumblr platforms
Create the tools and infrastructure leveraged by the rest of the Tumblr engineering teams
Diagnose and repair network, application, and hardware bottlenecks
Test and tune network, hardware, and software configurations to maximize performance
Deploy and manage monitoring and diagnostic tools
Guide our product and platform teams to keep new features fast and stable What We’re Looking For:
Experience scaling high-traffic web sites
Experience with Unix systems administration including solid scripting skills in Ruby, PHP or Python
Expertise in data structures and algorithms
Expertise in troubleshooting large-scale distributed systems
Smarts, humility, and equal willingness to learn and teach
A sense of ownership, initiative, and drive Tools We Like:
Nginx, Varnish and HAProxy
Memcached and Redis
MySQL (InnoDB)
Puppet
PHP5 at its furthest extent
git and GitHub
Ruby, Scala and PHP
Asynchronous services and queues
Hadoop, Pig, ZooKeeper, and other Java/JVM projects
Nagios/Icinga, OpenTSDB

Desired Skills and Experience

See application page for details