Director, Site Reliability Engineering

With Salesforce in Burlington MA US

More jobs from Salesforce

Posted on April 02, 2019

About this job

Job type: Full-time
Experience level: Senior, Lead, Manager
Role: System Administrator
Industry: Cloud Services, Information Technology
Company size: 10k+ people
Company type: Public


linux, continuous-integration

Job description


This role to reshape, innovate, and refunction our globally distributed Site Reliability teams at Salesforce. You’d be responsible to build a resilient platform with customer experience at the center of what we do. You’d be empowered to envision incident response by building out best in class tools, diagnostics,configurations, processes, and partnerships with a CI/CD mindset.

This role will be a balance of technical, influential leadership, and managerial expertise. You will proactively set technical direction on incident bridges and marshall resources accordingly. You will also ensure that investigations are following appropriate troubleshooting paths, monitoring, triage and change execution remain optimal. This position will involve fostering and maintaining strong relationships cross-functionally by ensuring the SRE team are vital stakeholders within any process and procedural enhancements, including M&A. As a managerial leader, you will inspire, coach, and mentor your managers and individual contributors to develop their career aspirations into reality.

Lead and manage a teams responsible for: Incident Management, Detection, Change Execution/Approvals, and maintenance for all integrated properties, as well as root cause analysis/remediation and other proactive measures to improve the stability of customer performance and minimize risk of impact to customers. The team work collaboratively with internal R&D Teams, and partner closely with various teams to drive resiliency improvements and reduce our MTTD and MTTR. You will manage a highly skilled team that currently work on shift rotation.

Ensures optics proactively in diagnostics, detection, configuration, application, develop service-ownership to fill gaps and provide detective in customer experience

Creating capabilities to have SR team respond in a timely manner to incidents and find root cause

Work successfully with other cross-cloud service owners (developers, DBAs, Network etc) with positive relationships but with influence

Proactive measure to impact customers beyond current NOC SRE team - We want to actually solve the problems and configure visibility

Involved in public cloud tooling in Linux environments

Collaborate with SR dashboards and analytics to give predictive insights on data center environments for customers

Passionate about engineering productivity and service ownership and customer success

Passionate about Continuous Integration and Delivery and driving teams to adopt this delivery model

Excited by building reliable, self-healing services on unreliable hardware

Experience designing, developing, debugging, and operating resilient distributed systems that run across thousands of compute nodes in multiple datacenters

Required skills/Experience:

10+ years of Infrastructure Engineering experience

7+ years managing Site Reliability, NOC, or mixed engineering teams and/or Managers in globally distributed environments

Past Experience in Incident Management and strong understanding of ITIL service operations and SCRUM methodologies

Experience growing high performing, globally distributed engineering teams

Passionate about employee development with experience successfully coaching and managing managers and individuals to achieve goals

  • Strong communication , organizational, analytical and problem solving skills and attention to detail

Experience in a large-scale Linux data center environment with knowledge in administration, troubleshooting

Process improvement and change management

CI/CD mindset

Has a passion for: Teamwork and collaboration, Adaptability, Communication, Problem Solving, Customer Focus, Results, and Innovation.

Provides vision as a multipler to deliver

Entrepreneurial-spirited, Results-driven, communicator, aloha spirit

Nerdy and business-like


Experience with software-based infrastructure such as AWS, GCP, Azure, GCE, CoreOS

Windows Systems knowledge as well

Experience with M&A Strategy with Site Reliability

Analytics/BI Background


MS in Computer Science or related field, or

BS in Computer Science plus relevant job-related experience

About Salesforce:
Salesforce, the Customer Success Platform and world's #1 CRM, empowers companies to connect with their customers in a whole new way. We are the fastest growing of the top 10 enterprise software companies, the World's Most Innovative Company according to Forbes, and one of Fortune's 100 Best Companies to Work for nine years running. The growth, innovation, and Aloha spirit of Salesforce are driven by our incredible employees who thrive on delivering success for our customers while also finding time to give back through our 1/1/1 model, which leverages 1% of our time, equity, and product to improve communities around the world.

Apply here