Senior Site Reliability Engineer at Cisco Meraki (San Francisco, CA)

Desired Skills and Experience

Collecting metrics, crunching data and improving service monitoring to detect problems before they’re visible to our customers.
Building systems to automate our server lifecycle, from configuration management to server bootstrap and decommission.
Scaling our continuous deployment system to accommodate a rapidly growing team and increasing feature velocity without compromising stability.
Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level. This might include anything from digging into source code (our own or from open source projects), hunting memory leaks, tracing bottlenecks in upstream networks, or database query optimization.
Advising other development teams when building new products so that they’re scalable, maintainable, and performing well.
Have 5+ years experience across a mix of software development and systems administration roles.
Script or code with 1-2 languages like Ruby, Scala, Python or Bash. You are comfortable digging into other people’s source code in search of the root cause of a problem and you automate all the things.
Care about the customer experience. You have experience supporting an externally-facing production environment.
Have experience on a pager rotation where you responded to escalations quickly to minimise customer downtime. This role requires being part of a one-week-in-eight on-call rotation.
Believe in the Unix way. You build large systems out of small components that each do one job and do it well. We run Debian.
Are familiar with logging and monitoring tools such as Graphite, Grafana, Logstash, ElasticSearch, statsd, collectd.