Manager, Site Reliability Engineering

OpenX, a leading provider of digital advertising technology, seeks a Manager of Site Reliability Engineering to lead an SRE team responsible for the uptime, efficiency, and performance of our systems and services.

Our ideal candidate seeks out the interesting projects and problems that come with high scale and rapid growth. We are proud of our lean approach to systems architecture and seek a candidate with a similar mindset. You should have experience with large-scale management of thousands of physical servers, request rates in the hundreds of thousands per second, and data measured in petabytes.

If You Agree With The Following Statements, You Might Be a Good Fit For This Position

Servers should be managed like cattle, not pets.
Expensive logos such as Dell, HP, or Oracle should be avoided whenever possible.
DevOps is a culture, not a title or department.
Simplicity is the ultimate sophistication. This position requires a self-starter, able to function and lead with minimal supervision. Hands-on technical ability is required, balanced with experience managing and leading people.

Developing for and supporting our infrastructure presents many interesting technical challenges. We especially desire candidates with a passion for open-source software and an interest in the latest system architecture trends, for example: Docker, Mesos, Kubernetes, or other bare-metal-abstraction solutions.

Responsibilities Include

Lead a team, providing vision, direction, and context. Continuously develop yourself, the team, and individual team members
Be a hands-on technical manager
Design, implement, and support high-performance, highly-available services and infrastructure
Improve the efficiency and flexibility of our datacenters
Build and maintain models for growth and capacity planning
Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
Work closely with engineering, project management, operational, and engineering peers to develop innovative technical tools and solutions
Organize and manage multiple simultaneous projects
Practice and enforce Agile and Scrum methodologies

Requirements

Demonstrable experience as a Site Reliability Engineer, Systems Administrator, Software Engineer, or equivalent roles
Demonstrable experience leading teams
Demonstrable experience working in a high volume, large deployment, multi-datacenter environment
Capability to program in at least one language (other than Bash), ideally Python or Perl, but Ruby, C/C++, Java, or others are okay
Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
Experience with specific software such as Hadoop, Kafka, Spark, HBase, Riak, and similar technologies is desirable, but the ability to quickly and enthusiastically learn new technology is most important
Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
Must be willing to occasionally travel to domestic and international datacenter or office locations and be willing to participate in a 24/7 on-call rotation alongside your team members

Desired Skills and Experience

See application page for details