Desired Skills and Experience
- Provide Strong Leadership on team under the guidance of the Systems Manager/Architect
- Write tools and scripts to provide automation and self service solutions for ourselves and other teams
- Design new systems to support production services
- Install, configure and debug hardware and systems in our data center
- Creatively solve scale challenges regarding a rapidly expanding cloud environment
- Work with real hardware - Cisco UCS B & C series servers, SuperMicro Twin-Pro, storage (NAS and SAN), Mac-in-a-datacenter, custom appliances for mobile devices, load balancers, and beyond
- Help improve monitoring and identify key performance metrics
- Proactive R&D - discovering and implementing new tools, emerging technology, etc.
- Disaster recovery design, implementation, and maintenance
- Create NOC runbooks, procedures, documentation, and diagrams of the environments you manage
- Troubleshooting and resolution of server/network issues
- Help maintain hardware in Sauce’s colocation facilities
- Help build out new data centers around the globe
- Participation in 24x7 on-call rotation
- Optimize hardware and configuration for improving hypervisor performance
- Automating Deployment of operating systems to bare metal servers
- Building and optimizing a ELK cluster for our development team to monitor and analyze production system usage
- Able execute on high level goals independently and with cross functional teams
- 8+ years recent experience working as a Linux administrator/engineer at scale (hundreds of systems) and designing/deploying ‘highly available’ solutions
- 2+ years of recent professional experience designing, developing, and operating Configuration Management solutions such as Chef, Puppet, Salt (preferred), or Ansible (preferred) at scale
- Solid experience in Linux tuning, profiling, and monitoring
- Strong skills in at least one language: Python (preferred), Ruby, Bash:
- Experience deploying/managing KVM-Qemu and LXC
- Experience with Kubernetes, Docker and their ecosystems.
- Experience managing day-to-day operations with Redis, Memcached
- Solid understanding of cloud/networking/distributed computing environment concepts; including TCP/ IP connections, firewalls, VLANs, etc.
- Familiar with ZFS on Linux and storage appliances (iSCSI and NFS)
- Experience and understanding of contemporary metrics, monitors, and logging solutions especially statsD, Graphite, ELK, Splunk, Nagios, etc.
- Highly organized, able to multi-task, able to work individually, as well as within a team, and across teams
- Excellent communication skills, both verbal and written across all user levels
- Deployment automation in physical and virtual environments (PXE, MAAS (preferred))
- Experience with InSpec or a similar tool for testing configuration management.
- Working knowledge of load balancing technologies (hard/soft)
- Proven experience collaborating in a cross functional team environment
- Familiarity with software engineering practices, including n-tier architecture, configuration management, development methodologies (e.g. agile, waterfall, spiral, prototyping), etc.
- This role can be located remote from SF in the Continental US. Some travel to South Bay or SF is required