You will join our team that is responsible for the maintenance and growth of our AWS based infrastructure, ensuring that our sites are available and meeting our internal and external service levels. This is accomplished by working very closely with the development teams before and after deployment. You and the team have the authority in designing and implementing the network, hardware, and systems we use to provide our services. The SRE team designs and leads our monitoring and alerting systems. You will share on-call responsibilities with the SRE team.
Desired Skills and Experience
Qualifications Required: Bachelor’s degree in computer science or a related field, or equivalent experience At least 3 years of experience in site reliability engineering or system administration with Unix (we use Linux and FreeBSD) 3 or more years of experience with a scripting language other than shell Qualifications Desired: Develop tools to improve our rapid deployment and monitoring systems Collaborate with developers and product management to identify and define service level agreements for our sites Participate in 24x7 on-call rotation for the SRE teams Work with Development and SRE teams in software and system performance analysis and tuning, demand forecasts, and capacity planning Participate in post-mortems so we can avoid repeating our mistakes Professional software development experience MySQL or PostgreSQL experience Experience with a NoSQL system such as Cassandra, Couchbase, or MongoDB Experience with AMQP such as RabbitMQ or ActiveMQ Knowledge about HTTP and caches Experience with automated deployment systems such as Docker and configuration management systems such as Puppet and Chef Experience with highly-scalable, high-availability microservice system architectures Experience in Cassandra, AWS, RabbitMQ and Python are a plus.