Site Reliability Engineer - Core Infrastructure Systems

Who We Are: SREs work on improving the availability, scalability, performance and reliability of Twitter’s production services. Come join us! What You’ll Do: As a member of the organization you will be dedicated to improving the reliability of our end-to-end platform. Your work will integrate directly with Twitter’s products. Our core infrastructure receives hundreds of millions of tweets per day and serves tens of billions of API requests. We also serve over 2+ billion search queries per day, render hundreds of millions of ad impressions, and process hundreds of terabytes of log and interaction data daily. You will dive deep into gnarly operational issues; from the software, systems, automation, and process perspectives. You will understand the challenges around integrating disparate infrastructures into a new facility, processes and procedures. You will work with open-source technologies and the SRE community. You will actively participate in the vision to move away from high operational cost tasks such as break/fix, cluster migrations, new service buildouts, abuse, etc. You will contribute to services that can shrink and expand based on demand, self heal, automatically rollout, etc. The Core Infrastructure Systems Site Reliability Engineer (CISS) team handles Twitter’s internal core infrastructure services (in-house provision engineering stack, DNS, Puppet, LDAP, Kerberos, et cetera), that enable our large clusters of servers to operate effectively and reliably. This team’s mission is to enhance infrastructure effectiveness and increase efficiency for all core services used by the various platform and infrastructure teams at Twitter. The CISS team is responsible for the systems it owns end to end; architecting, developing, deploying, running and keeping the systems performing and reliable. As such, this team requires a unique mix of development and systems oriented skills, and offers a lot of opportunities to grow in different domains of software and systems engineering. Your responsibilities include but are not limited to:

Understand how the different core infrastructure systems come together to enable provisioning engineering at Twitter, and help keep all of the infrastructure running.
Meet with customers and partners in engineering to gather feedback, iterate on requirements, and align with the team and company mission and objectives.
Conceptualize, architect and develop systems and features for enabling core infrastructure engineering effectiveness: ease of use and ease of maintenance of these systems.
Development of solutions to enable automated workflows that bind together with Twitter’s core engineering principles, and enable all engineers to make use of and achieve their objectives as required of the core infrastructure systems.
Ensure reliability of the existing core infrastructure systems, to guarantee 99.99% uptime while maintaining SLAs to guarantee low latencies across the systems.
Perform deep dives into both systemic and latent reliability issues; partner with software, systems and security engineers across the organization to produce and roll out fixes.
Troubleshoot issues across the entire stack: hardware, software, application and network.
Provide project and technical leadership to the team, keeping in mind the above responsibilities.
Help the team to deliver on projects that align with the mission of CISS.
Consolidate and report team / product metrics for visibility to the rest of the company.
Measuring, maintaining, and reporting appropriate SLAs across products to internal and external partners.
Coordinate with customers and partner teams when required to help unblock teams and achieve consensus across teams to deliver on engineering-wide objectives.
Managing and prioritizing new project pipeline vs. pay down of technical debt.

Who You Are:

Solid understanding of systems and application design, including the operational trade-offs of various designs.
Practical knowledge of various aspects of service design like messaging protocols & behavior, caching strategies and software design practices.
Practical, solid knowledge of shell scripting and at least one higher-level language (Python or Ruby preferred).
Demonstrable knowledge of TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures.
Expert level understanding of Linux servers, specifically RHEL/CentOS.
Comfortable configuring DNS, DHCP, and LAN/WAN technologies.
Minimum 3+ years of handling services in a large scale environment.
Work well with and be able to influence a myriad of personalities at all levels.
Ability to prioritize tasks and work independently.
Be adaptable and able to focus on the simplest, most efficient & reliable solutions.
Track record of successful practical problem solving, excellent written and social communication, and documentation skills.
You have an understanding of DNS, LDAP, Puppet, Kerberos, Subversion, Git and Jenkins.
You have experience in leading and training a team of 3+ Engineers. Desired Practical experience in Python and Ruby. Ability to lead technical teams through design and implementation across an organization. Experience with existing open source projects such as Scribe and Apache Mesos. B.S. in Computer Science or related field. We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, ethnicity, color, ancestry, national origin, religion, sex, sexual orientation, gender identity, age, disability, veteran status, genetic information, marital status or any other legally protected status. San Francisco applicants: Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Desired Skills and Experience

See application page for details