Site Reliability Engineer at Klaviyo (Boston, MA)

Desired Skills and Experience

Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services.
Perform quantitative analysis to understand high-impact events that break * Klaviyo functionality and manage the cross-functional effort resolve those events
Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
Identify and drive opportunities to improve operational workflows
Conduct periodic on call duties
Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
BA or BS Degree in Computer Science, related field, or equivalent experience
Technical, Engineering or Quantitative background
Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
Experience working on team software projects
Experience in one or more of: Python, Ruby, Go.
Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.)
Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems.
Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
Experience with Amazon Web Services (AWS) or similar cloud compute offerings, and tools to make managing cloud workloads easier (Terraform, Packer, etc.)
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively)
Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.)
Systematic problem solving approach, coupled with a strong sense of ownership and drive