• Plan, execute, and manage strategy for test cluster operations, and improve cost efficiency as well as reliability.- Build new tools and scripts to improve application performance, monitoring, and recovery.- Help oversee and automate cloud infrastructure, systems scalability, and systems/network security.Key Qualifications

  • 5+ years in data center engineering and operations, ideally in a team lead role.
  • Working knowledge and practical experience in increasing technology footprints, ensuring uptime, and working with network, security, and systems engineers to produce quality solutions.
  • Minimum 3 years of experience with scripting (Python, shell, Ruby)
  • Experience with open-source continuous integration and configuration management tools
  • Jenkins, Chef, Puppet
  • Experience with Log Management tools such as ELK Stack, Splunk, MongoDB, etc.
  • Exposure to HVAC management, power supply, UPS, cooling/heating.
  • Experience with monitoring tools like Nagios and GangliaDescriptionApple’s Software Automation Platform team provides cloud-based testing services for all software contributions to iOS, OS X, tvOS, watchOS. The team operates a large-scale onsite test cluster that is poised to scale for supporting the ever-growing business needs. The Sr. Site Reliability Engineer will be responsible for operating, managing, and scaling a sophisticated test cluster environment. This is a unique opportunity to join an early stage and high-impact team.EducationBA/BS degree in computer science or equivalent field with 5+ years of professional experienc

Desired Skills and Experience

See application page for details