Site Reliability Engineer at Klaviyo (Boston, MA)

About Klaviyo

Email will no longer suck. Klaviyo is using technology to reinvent email marketing from the ground up. For too long marketing has been pretty selfish. Companies blast everyone with the same email, not because they think that’s what customers want, but because they don’t have a choice.

Klaviyo fundamentally changes the game. Our software ties together everything companies know about their customers, processes massive quantities of data in real-time, and uses it drive a platform that’s part CRM, part analytics engine and 100% an email platform that gets results.

Technology has come a long way in the last 15 years, but email hasn’t really changed. We’re here to fix that.

About the Role

Site Reliability Engineering (SRE) is essentially what you get when you treat system operations as if it is a software problem. The mission of the Site Reliability Engineering team is to ensure uninterrupted service for Klaviyo customers and act as force multiplier for Klaviyo product teams to deliver better software faster.

Klaviyo is a high growth technology driven company and is passionate about the user experience of its application and the well orchestrated operations of its service infrastructure. The SRE team works on its own initiatives to build foundational backend services but also builds tooling and automation to allow product teams to release and scale their software predictably.

SREs are team players and embed themselves within product teams to advance the architecture and performance of software systems and to train their peers in topics such as debugging distributed systems, building self-healing capabilities or eking out every drop of performance possible.

As a Site Reliability Engineer you will have ownership of foundational Klaviyo services and a big impact on our product teams. Klaviyo’s infrastructure, event processing, and team have grown 300% year over year so there are always new skills to learn and technical challenges to solve the right way.

This position is full-time and based in Boston.

Responsibilities

Desired Skills and Experience

Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services.
Perform quantitative analysis to understand high-impact events that break * Klaviyo functionality and manage the cross-functional effort resolve those events
Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
Identify and drive opportunities to improve operational workflows
Conduct periodic on call duties
Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
BA or BS Degree in Computer Science, related field, or equivalent experience
Technical, Engineering or Quantitative background
Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
Experience working on team software projects
Experience in one or more of: Python, Ruby, Go.
Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.)
Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems.
Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
Experience with Amazon Web Services (AWS) or similar cloud compute offerings, and tools to make managing cloud workloads easier (Terraform, Packer, etc.)
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively)
Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.)
Systematic problem solving approach, coupled with a strong sense of ownership and drive