Manager, Site Reliability Engineering

OpenX, a leading provider of digital advertising technology, seeks a Manager of Site Reliability Engineering to lead an SRE team responsible for the uptime, efficiency, and performance of our systems and services.

Our ideal candidate seeks out the interesting projects and problems that come with high scale and rapid growth. We are proud of our lean approach to systems architecture and seek a candidate with a similar mindset. You should have experience with large-scale management of thousands of physical servers, request rates in the hundreds of thousands per second, and data measured in petabytes.

If You Agree With The Following Statements, You Might Be a Good Fit For This Position

  • Servers should be managed like cattle, not pets.
  • Expensive logos such as Dell, HP, or Oracle should be avoided whenever possible.
  • DevOps is a culture, not a title or department.
  • Simplicity is the ultimate sophistication. This position requires a self-starter, able to function and lead with minimal supervision. Hands-on technical ability is required, balanced with experience managing and leading people.

Developing for and supporting our infrastructure presents many interesting technical challenges. We especially desire candidates with a passion for open-source software and an interest in the latest system architecture trends, for example: Docker, Mesos, Kubernetes, or other bare-metal-abstraction solutions.

Responsibilities Include

  • Lead a team, providing vision, direction, and context.  Continuously develop yourself, the team, and individual team members
  • Be a hands-on technical manager
  • Design, implement, and support high-performance, highly-available services and infrastructure
  • Improve the efficiency and flexibility of our datacenters
  • Build and maintain models for growth and capacity planning
  • Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
  • Work closely with engineering, project management, operational, and engineering peers to develop innovative technical tools and solutions
  • Organize and manage multiple simultaneous projects
  • Practice and enforce Agile and Scrum methodologies

Requirements

  • Demonstrable experience as a Site Reliability Engineer, Systems Administrator, Software Engineer, or equivalent roles
  • Demonstrable experience leading teams
  • Demonstrable experience working in a high volume, large deployment, multi-datacenter environment
  • Capability to program in at least one language (other than Bash), ideally Python or Perl, but Ruby, C/C++, Java, or others are okay
  • Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
  • Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
  • Experience with specific software such as Hadoop, Kafka, Spark, HBase, Riak, and similar technologies is desirable, but the ability to quickly and enthusiastically learn new technology is most important
  • Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
  • Must be willing to occasionally travel to domestic and international datacenter or office locations and be willing to participate in a 24/7 on-call rotation alongside your team members

Desired Skills and Experience

See application page for details