Site Reliability Engineers (SRE) at Dubizzle ensures that all our services are healthy, monitored, automated, and designed to scale. You’ll use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. You will support a wide range of products focusing on automation, availability and performance, and above all reliability.

Your Responsibilities :

Desired Skills and Experience

  • Serve as a primary point responsible for  the overall health, performance, and capacity of our production environment.
  • Be part of the implementation and design of the systems used to operate Dubizzle, with a focus on automation and maintainability at large scale
  • Collaboration with the development team on operations-related issues, providing support and acting as stakeholder.
  • Develop tools to effectively monitor custom applications in a large-scale environment.
  • Troubleshoot issues across the entire stack - hardware, software, application and network.
  • Migrate applications off of legacy environments with minimal downtime.
  • Take part in a shared 24x7 on-call rotation.
  • Document system design and procedures.
  • 3+ years experience in working with large-scale web applications.
  • Experience on cloud infrastructure of Amazon Web Services/Rackspace Cloud.
  • Expert in Linux server administration.
  • Demonstrated programming skills in one or more of: Python, Perl, Scala, Erlang.
  • Fluency with standard networking protocols, such as DNS, DHCP, LDAP, NIS, PXE, SMTP, ICMP, NTP
  • Familiarity with systems management tools (Puppet, Ansible, chef.)
  • Experience with existing open source projects such as Mesos, Hadoop, Scribe, Zookeeper is a plus.
  • Work with the open source community and apply latest best practices and learnings to our infrastructure.