Senior Site Reliability Engineer

If you are passionate about working on technologies and design approaches that are disruptive to incumbent Education players, like to leverage Open Source, tooling and automation and agile concepts, and are interested in opportunities to work at scale, we may have just the job for you. Site Reliability Engineering (SRE) is a discipline that is in and of itself not new, but at Knewton we like to think that we are considering it in a way that makes it one of the most important roles within our Technology organization. As an SRE at Knewton you will be responsible for running our production services and will be working very closely with developers to ensure reliability, scalability and performance of the next-generation of systems.

We’re looking for highly motivated and talented engineers for the Performance Engineering team. Come be part of Knewton, a proven leader in the adaptive learning space, and be part of a winning team that is responsible for running our production services.

Responsibilities of the Senior Site Reliability Engineer:

Develop software to help drive opportunities to improve automation for deployment, management, tooling and visibility within the engineering team
Work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering solution that we deliver to our stakeholders
Develop a deep understanding of the various services and applications that come together to deliver Knewton’s services
Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure
Design new tools and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
Perform code reviews, evaluate implementations, and provide feedback about potential tool improvements
Define and evangelize cloud-related optimizations and best practices to improve reliability and performance Basic Qualifications:
Demonstrable knowledge of TCP/IP, HTTP, distributed systems, and experience supporting multi-tier web application architectures in a web scale environment
Solid understanding of application design, including the operational trade-offs of various designs
Solid understanding of Unix/Linux internals
Familiarity with cloud infrastructure, such as AWS.
Experience with container technologies and orchestration layers (Docker, Vagrant, Mesos, Marathon, etc)
Demonstrated coding skills, preferably in Python
Excellent analytical skills, coupled with a strong sense of ownership, urgency, and drive.
Experience with queuing/data-pipelining solutions (Kafka, Storm, RabbitMQ, Amazon Kinesis, ZeroMQ, etc)
Minimum 5 years experience in production service troubleshooting that spans applications, systems and network

Desired Skills and Experience

See application page for details