Site Reliability Engineer

With Apple in San Diego CA US

More jobs from Apple

Posted on April 06, 2021

About this job

Job type: Full-time
Role: System Administrator
Industry: Consumer Electronics
Company size: 10k+ people
Company type: Public


web-services, linux

Job description

Software Delivery Services & Infrastructure is focused on ensuring engineers at Apple can do amazing things and we are looking for a talented application Site Reliability Engineer (SRE) to join us in this mission. You’ll play a critical role in the day-to-day operations of services relied upon across Apple. You'll partner with engineering teams to ensure they're successful. You'll look for opportunities to innovate all while driving for rock-solid operations. Responsibilities will include - Adopt and apply SRE best practices to services you support - Keep users, key stakeholders, and leadership informed through regular reporting and communications - Identify areas of automation for manual tasks/toil - Develop playbooks related to actionable alerts - Foster strong relationships with cross-functional teams - Participating in on-call rotations - Deployment validation testing for production deployments - Continuous customer experience validation and performance analysis - Perform regular disaster recovery (DR) testing and fail-overs - Participation in incident post mortems and implementing preventive findings - Ensuring services are adhering to published specs/standards - Perform predictive analysis or implement AI to do issue avoidance

As part of Software Delivery Services & Infrastructure SRE, you will be responsible for delivering reliable services and driving projects to a successful outcome. This role will focus on operating and supporting a distributed development workflow used by teams in Software Engineering. You will monitor SLOs, respond to incidents, troubleshoot issues, and ensure the service is up-to-date and secure. You will collaborate with engineering teams to implement best practices and shape technical decisions. To ensure your success, this job will provide you with: - Passionate and talented coworkers around the global that are ready to collaborate, mentor, and learn from you - Ownership to drive meaningful improvements to the operational reliability of the services you manage - Opportunities to contribute to the best practices used by SRE teams within Software Delivery

Skills & requirements

  • A positive and respectful attitude

  • A passion for providing reliable services at scale, on bare metal as well as in cloud environments

  • A deep understanding of CI/CD technologies such as Jenkins

  • Strong working knowledge of Git and code-review systems such as Gerrit, Bitbucket, and Github

  • Good understanding of administration of Linux services

  • Experience using Prometheus, Grafana, and Splunk

  • Superb collaboration skills with excellent written and verbal communication

  • The ability to troubleshoot large scale systems

  • Deep understanding of web services, how they operate and what needs monitoring and alerts

  • Good understanding of security principals and design

  • The desire to be proactive at all times in issue prevention

  • The desire to do what is right for the customer and to provide a great customer experience

    • Prior experience as an SRE, software engineer, or system administrator
    • Proven ability to self-manage large projects and meet deadlines

Apply here