Desired Skills and Experience

  • Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services. 
  • Perform quantitative analysis to understand high-impact events that break * Klaviyo functionality and manage the cross-functional effort resolve those events 
  • Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions. 
  • Engage in service capacity planning and demand forecasting, software performance analysis and system tuning. 
  • Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies 
  • Confidently make informed, data-driven decisions in a fast paced environment with competing priorities 
  • Identify and drive opportunities to improve operational workflows 
  • Conduct periodic on call duties 
  • Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
  • BA or BS Degree in Computer Science, related field, or equivalent experience 
  • Technical, Engineering or Quantitative background 
  • Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems 
  • Experience working on team software projects 
  • Experience in one or more of: Python, Ruby, Go. 
  • Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.)
  • Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems. 
  • Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. 
  • Experience with Amazon Web Services (AWS) or similar cloud compute offerings, and tools to make managing cloud workloads easier (Terraform, Packer, etc.)
  • Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing). 
  • Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively) 
  • Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.) 
  • Systematic problem solving approach, coupled with a strong sense of ownership and drive