Senior Site Reliability Engineer

Bitnami’s mission is to bring awesome software to everyone. Every month, 1MM+ developers come to our site to download and launch their favorite language runtimes and applications. The Sr Site Reliability Engineering (SRE) team at Bitnami is responsible for the availability and performance of the infrastructure as well as partnering with the other engineering teams to successfully build, deploy and manage Bitnami’s services. The principles that drive how we approach SRE at Bitnami are:

If it’s repeatable, it can be automated; quality of life matters, it must not be subsumed by toil
If it’s monitored, it can alert its owners; failure detected by humans ahead of systems are second degree failures
If it’s backed up, it can be restored; disasters must be recoverable
If it’s measured, it can be improved; when it fails, it’s a learning opportunity for that improvement You must bring an understanding of the IT business (typically gained by having built or worked extensively with a private or public cloud); a broad perspective of the cloud industry and where it is headed; and experience in building solutions that scale. Working with all of the major cloud providers, as well as the ones aspiring to be major, container hosting and orchestration services and infrastructure will provides challenges and opportunities rarely found elsewhere.

Responsibilities:

Creating and/or provisioning reliable tools and infrastructure that enables rapid iteration amongst the product, research and development teams
Automate All The Things by eating, sleeping and breathing Infrastructure as Code
Monitor, measure and troubleshoot infrastructure and services
Participate in the 24x7 follow-the-sun (US/Europe) on-call rotation to assure service SLAs are me
Optimize business continuity capabilities and drive down incident recovery times
Capacity planning and management Requirements:
At least 5 years of experience deploying, monitoring and troubleshooting multi-tier SOA applications, Rails, Node.js and distributed systems at scale
Software development with any or all these programming languages: Ruby, Go, Java, Javascript and Python
A passion for automated provisioning (Ansible, Puppet, Chef, etc) and instrumentation for status and trend monitoring (Icinga, Nagios, Graphite, Kibana, etc)
Highly developed cloud literacy with strong knowledge of AWS, GCE and Azure
Broad experience with Linux kernel and shell, TCP/IP and HTTP
Designing networks and systems for security, encryption, performance and agility
Backup and restoration automation, business continuity planning and testing Nice to haves:
Database administration with MySQL replication and high availability
Networking and security best practices with software defined networks
Container orchestration with Kubernetes, Docker Swarm, and/or Mesos
Big data, streaming and search systems like Cassandra, Hadoop, Spark, Kafka and ElasticSearch Benefits/Perks:
Competitive salary and stock options
100% fully covered Medical, Dental, Vision benefits
Catered lunches and open snack and beverage policy
Flexible time off policy, we believe everyone needs to recharge
Awesome commuter perks, generously subsidized Clipper card
Sweet set-up, huge monitor and your choice of operating system and hardware
Semi-annual trips to Spain
Monthly outings and fun events

Desired Skills and Experience

See application page for details