Desired Skills and Experience
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services.
- Perform quantitative analysis to understand high-impact events that break * Klaviyo functionality and manage the cross-functional effort resolve those events
- Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
- Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
- Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
- Identify and drive opportunities to improve operational workflows
- Conduct periodic on call duties
- Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
- BA or BS Degree in Computer Science, related field, or equivalent experience
- Technical, Engineering or Quantitative background
- Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
- Experience working on team software projects
- Experience in one or more of: Python, Ruby, Go.
- Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.)
- Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
- Experience with Amazon Web Services (AWS) or similar cloud compute offerings, and tools to make managing cloud workloads easier (Terraform, Packer, etc.)
- Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively)
- Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.)
- Systematic problem solving approach, coupled with a strong sense of ownership and drive
Apply