Senior Site Reliability Engineer
Bitnami’s mission is to bring awesome software to everyone. Every month, 1MM+ developers come to our site to download and launch their favorite language runtimes and applications. The Sr Site Reliability Engineering (SRE) team at Bitnami is responsible for the availability and performance of the infrastructure as well as partnering with the other engineering teams to successfully build, deploy and manage Bitnami’s services. The principles that drive how we approach SRE at Bitnami are:
-
If it’s repeatable, it can be automated; quality of life matters, it must not be subsumed by toil
-
If it’s monitored, it can alert its owners; failure detected by humans ahead of systems are second degree failures
-
If it’s backed up, it can be restored; disasters must be recoverable
-
If it’s measured, it can be improved; when it fails, it’s a learning opportunity for that improvement You must bring an understanding of the IT business (typically gained by having built or worked extensively with a private or public cloud); a broad perspective of the cloud industry and where it is headed; and experience in building solutions that scale. Working with all of the major cloud providers, as well as the ones aspiring to be major, container hosting and orchestration services and infrastructure will provides challenges and opportunities rarely found elsewhere.
Responsibilities:
- Creating and/or provisioning reliable tools and infrastructure that enables rapid iteration amongst the product, research and development teams
- Automate All The Things by eating, sleeping and breathing Infrastructure as Code
- Monitor, measure and troubleshoot infrastructure and services
- Participate in the 24x7 follow-the-sun (US/Europe) on-call rotation to assure service SLAs are me
- Optimize business continuity capabilities and drive down incident recovery times
-
Capacity planning and management Requirements:
- At least 5 years of experience deploying, monitoring and troubleshooting multi-tier SOA applications, Rails, Node.js and distributed systems at scale
- Software development with any or all these programming languages: Ruby, Go, Java, Javascript and Python
- A passion for automated provisioning (Ansible, Puppet, Chef, etc) and instrumentation for status and trend monitoring (Icinga, Nagios, Graphite, Kibana, etc)
- Highly developed cloud literacy with strong knowledge of AWS, GCE and Azure
- Broad experience with Linux kernel and shell, TCP/IP and HTTP
- Designing networks and systems for security, encryption, performance and agility
-
Backup and restoration automation, business continuity planning and testing Nice to haves:
- Database administration with MySQL replication and high availability
- Networking and security best practices with software defined networks
- Container orchestration with Kubernetes, Docker Swarm, and/or Mesos
-
Big data, streaming and search systems like Cassandra, Hadoop, Spark, Kafka and ElasticSearch Benefits/Perks:
- Competitive salary and stock options
- 100% fully covered Medical, Dental, Vision benefits
- Catered lunches and open snack and beverage policy
- Flexible time off policy, we believe everyone needs to recharge
- Awesome commuter perks, generously subsidized Clipper card
- Sweet set-up, huge monitor and your choice of operating system and hardware
- Semi-annual trips to Spain
- Monthly outings and fun events
Desired Skills and Experience
See application page for details