SITE RELIABILITY ENGINEER

Job Description: As a Site Reliability Engineer, you will be building, evolving, testing and operating the infrastructure automation platform used to power our Cloud services. You will ensure that our lab clouds and production clouds are operating and performing optimally; and that software is released and deployed in an efficient and streamlined manner, from development, through extensive testing, all the way to production hand-off and support. This is a hands-on devops role with a balanced amount of tool and infrastructure development, including advanced scripting and automation. You will be supporting our internal infrastructure, as well as providing managed services support, product development, and supporting the entire stack for a cloud-based service offering. Success in this role requires very strong system administration skills, an aptitude for distributed systems and attention to minute details. You need to have well exemplary network, systems and code-level troubleshooting abilities. You are expected to analyze complex system behaviors and performance problems, and be able to trace issues across multiple systems. The SRE works as a first responder and is ultimately responsible for ensuring our cloud infrastructure services are up and running. Responsibilities:

Building and running global object storage
Operate and deploy cloud services and related projects from development to production
Develop automation, processes, and tools designed to make this process simpler and more robust
Bridge Engineering and core shared operations services
Participate in troubleshooting, capacity planning and analysis, performance analysis activities
Advise management on service on boarding strategies and execution
Mentor team members on areas of subject-matter expertise

Desired Skills and Experience

Requirements:

BA/BS in Computer Science preferred, or equivalent experience
5+ years of experience in a highly-complex technical operations environment
At least 2 years of experience with Linux/Unix systems administration
Hands on operational experience in a high-volume or critical production service environment
distributed systems, capacity planning, continuous deployment (EMC Atmos preferred)
Solid scripting skills, Ruby experience is a big plus (Perl, C, Python helpful)
SaltStack preferred
ECS experience preferred
OpenStack preferred
VMWare preferred
Expertise in IP networking, including familiarity with the functionality, operating, and failure modes of the network (iptables, haproxy, vpn, tcp/ip, http)
Proven technical troubleshooting and performance tuning experience, especially in a virtual (VMWare) environment
DevOps and Software Development experience (ability to code in operation)
Ability to handle periodic on-call duty as well as spider-sense awareness of services’ health
Ability to work in a team environment You will work with and learn:
Atmos/Object Storage (Swift, etc.)
OpenStack
vSphere/VCD
Hadoop
NoSQL
Cassandra
Postgres
SLES/SUSE Linux
Load Balancing with Zeus/Riverbed
Globally distributed High Availability Software as a Service Extensive experience and or willingness to learn any of the following is a plus: 1. Operating Systems
Linux (RHEL, SLES/SUSE, CentOS, Ubuntu, Debian)
Unix (Solaris, AIX, HP/UX, etc.)
Windows
Mac OS X 2. Infrastructure as a Service
Amazon Web Services j
Rackspace j
Cloud Foundry
Azure
OpenStack 3. Virtualization Platforms
VMware
KVM
Xen
VirtualBox
Vagrant 4. Containerization Tools
LXC
Solaris Containers
Docker 5. Linux OS Installation
Kickstart
Cobbler
Fai 6. Configuration Management
Puppet / MCollective
Chef
Ansible
CFEngine j
SaltStack
RANCID
Ubuntu Juju 7. Test and Build Systems
Jenkins
Maven
Ant
Gradle 8. Application Deployment
Capistrano 9. Application Servers
JBoss
Tomcat
Jetty
Glassfish
Websphere
Weblogic 10. Web Servers
nginx
Apache
IIS 11. Queues, Caches, etc.
ActiveMQ
RabbitMQ
memcache
varnish
squid 12. Databases
Percona Server
MySQL
PostgreSQL
OpenLDAP
MongoDB
Cassandra
Redis
Oracle
MS SQL 13. Monitoring, Alerting, and Trending
Zabbix
Treasure
New Relic
Nagios
Icinga
Graphite
Ganglia
Cacti
PagerDuty
Sensu 14. Logging
PaperTrail
Logstash
Loggly
Splunk
SumoLogic 15. Process Supervisors
Monit
runit
Supervisor
god
Blue Pill
Upstart
systemd 16. Security
Snorby Threat Stack
Tripwire
Snort 17. Miscellaneous Tools
Multihost SSH Wrapper
Code Climate
iPerf
lldpd