About the Opportunity

We have openings in both NY & Chicago office for Site Reliability Engineer’s. Grubhub is changing the way we get work done. We’re assembled into small, entrepreneurial type of teams that are designed to be self-sufficient. As a Site Reliability Engineer on one of these teams, it’s our job to provide the supporting infrastructure, infrastructure automation as well as systems design knowledge to the team. You’ll be supporting your team’s development, pre-production and production environments.

Our SREs particiapte in stand-up and sprint planning meetings, while also spending 20% of their on common systems related projects.  

Some Challenges You’ll Tackle

Tools We Work With:

Desired Skills and Experience

  • Create, maintain, own and operate your team’s services that supporting fundamental capabilities within Grubhub’s products.
  • Tackle some of the most challenging problems you can face developing high availability services in a distributed cloud environment that needs to scale exponentially.
  • Help evaluate and choose emerging technologies…new service protocols and architectures, self-healing capabilities, globally distributed caching, performance and code quality tooling, etc. Determine the right tool for the right task.
  • Manage / Lead a team of 2 to 3 direct reports
  • Java for micro services
  • Cassandra
  • Docker (in production!)
  • Mesos and Marathon for job scheduling
  • Combination of AWS and our own hardware
  • Python and Fabric for automation and our CD pipeline
  • Jenkins for builds and task execution
  • Linux (CentOS and Ubuntu)
  • DataDog for metrics and alerting
  • Puppet
  • Experience building complex distributed systems. In this role you are the one gravitating toward operational concerns of the team, focusing on reliability, performance, capacity planning and automation of everything.
  • Proficient in high level script languages such as Python or Ruby (Python preferred)
  • Experience developing solutions leveraging Docker
  • Experience managing Linux (Centos, Ubuntu) systems
  • Configuration management experience with Puppet, Chef, or Ansible
  • Building/implementing monitoring for network, server and application status
  • Experience with monitoring tools such as graphite, nagios, Datadog, Runscope
  • Experience with log aggregation systems using splunk, logstash, loggly, elasticsearch
  • Continuous integration, testing, and deployment using git, jenkins
  • Experience with relational databases (MySQL)  
  • Experience with NoSQL databases (Cassandra, Couchbase, Mongo)
  • Experience with Hadoop (Cloudera, DataStax), mahout and other big data platforms
  • Exceptional communication and troubleshooting skills.