Job Description: As a Site Reliability Engineer, you will be building, evolving, testing and operating the infrastructure automation platform used to power our Cloud services. You will ensure that our lab clouds and production clouds are operating and performing optimally; and that software is released and deployed in an efficient and streamlined manner, from development, through extensive testing, all the way to production hand-off and support. This is a hands-on devops role with a balanced amount of tool and infrastructure development, including advanced scripting and automation. You will be supporting our internal infrastructure, as well as providing managed services support, product development, and supporting the entire stack for a cloud-based service offering.   Success in this role requires very strong system administration skills, an aptitude for distributed systems and attention to minute details.  You need to have well exemplary network, systems and code-level troubleshooting abilities. You are expected to analyze complex system behaviors and performance problems, and be able to trace issues across multiple systems. The SRE works as a first responder and is ultimately responsible for ensuring our cloud infrastructure services are up and running.   Responsibilities:

  • Building and running global object storage
  • Operate and deploy cloud services and related projects from development to production
  • Develop automation, processes, and tools designed to make this process simpler and more robust
  • Bridge Engineering and core shared operations services
  • Participate in troubleshooting, capacity planning and analysis, performance analysis activities
  • Advise management on service on boarding strategies and execution
  • Mentor team members on areas of subject-matter expertise 

Desired Skills and Experience

Requirements:

  • BA/BS in Computer Science preferred, or equivalent experience
  • 5+ years of experience in a highly-complex technical operations environment
  • At least 2 years of experience with Linux/Unix systems administration
  • Hands on operational experience in a high-volume or critical production service environment 
  • distributed systems,  capacity planning, continuous deployment (EMC Atmos preferred)
  • Solid scripting skills, Ruby experience is a big plus (Perl, C, Python helpful)
  • SaltStack preferred
  • ECS experience preferred
  • OpenStack preferred
  • VMWare preferred
  • Expertise in IP networking, including familiarity with the functionality, operating, and failure modes of the network (iptables, haproxy, vpn, tcp/ip, http)
  • Proven technical troubleshooting and performance tuning experience, especially in a virtual (VMWare) environment
  • DevOps and Software Development experience (ability to code in operation)
  • Ability to handle periodic on-call duty as well as spider-sense awareness of services’ health
  • Ability to work in a team environment   You will work with and learn:

  • Atmos/Object Storage (Swift, etc.)
  • OpenStack
  • vSphere/VCD
  • Hadoop
  • NoSQL
  • Cassandra
  • Postgres
  • SLES/SUSE Linux
  • Load Balancing with Zeus/Riverbed
  • Globally distributed High Availability Software as a Service   Extensive experience and or willingness to learn any of the following is a plus: 1.     Operating Systems

  • Linux (RHEL, SLES/SUSE, CentOS, Ubuntu, Debian)
  • Unix (Solaris, AIX, HP/UX, etc.)
  • Windows
  • Mac OS X 2.     Infrastructure as a Service

  • Amazon Web Services j
  • Rackspace j
  • Cloud Foundry
  • Azure
  • OpenStack 3.     Virtualization Platforms

  • VMware
  • KVM
  • Xen
  • VirtualBox
  • Vagrant 4.     Containerization Tools

  • LXC
  • Solaris Containers
  • Docker 5.     Linux OS Installation

  • Kickstart
  • Cobbler
  • Fai 6.     Configuration Management

  • Puppet / MCollective
  • Chef
  • Ansible
  • CFEngine j
  • SaltStack
  • RANCID
  • Ubuntu Juju 7.     Test and Build Systems

  • Jenkins
  • Maven
  • Ant
  • Gradle 8.     Application Deployment

  • Capistrano 9.     Application Servers

  • JBoss
  • Tomcat
  • Jetty
  • Glassfish
  • Websphere
  • Weblogic 10.  Web Servers

  • nginx
  • Apache
  • IIS 11.  Queues, Caches, etc.

  • ActiveMQ
  • RabbitMQ
  • memcache
  • varnish
  • squid 12.  Databases

  • Percona Server
  • MySQL
  • PostgreSQL
  • OpenLDAP
  • MongoDB
  • Cassandra
  • Redis
  • Oracle
  • MS SQL 13.  Monitoring, Alerting, and Trending

  • Zabbix
  • Treasure
  • New Relic
  • Nagios
  • Icinga
  • Graphite
  • Ganglia
  • Cacti
  • PagerDuty
  • Sensu 14.  Logging

  • PaperTrail
  • Logstash
  • Loggly
  • Splunk
  • SumoLogic 15.  Process Supervisors

  • Monit
  • runit
  • Supervisor
  • god
  • Blue Pill
  • Upstart
  • systemd 16.  Security

  • Snorby Threat Stack
  • Tripwire
  • Snort 17.  Miscellaneous Tools

  • Multihost SSH Wrapper
  • Code Climate
  • iPerf
  • lldpd