Director, Site Reliability Engineering
With Salesforce in Burlington MA USMore jobs from Salesforce
Posted on April 02, 2019
About this job
Job type: Full-time
Experience level: Senior, Lead, Manager
Role: System Administrator
Industry: Cloud Services, Information Technology
Company size: 10k+ people
Company type: Public
This role to reshape, innovate, and refunction our globally distributed Site Reliability teams at Salesforce. You’d be responsible to build a resilient platform with customer experience at the center of what we do. You’d be empowered to envision incident response by building out best in class tools, diagnostics,configurations, processes, and partnerships with a CI/CD mindset.
This role will be a balance of technical, influential leadership, and managerial expertise. You will proactively set technical direction on incident bridges and marshall resources accordingly. You will also ensure that investigations are following appropriate troubleshooting paths, monitoring, triage and change execution remain optimal. This position will involve fostering and maintaining strong relationships cross-functionally by ensuring the SRE team are vital stakeholders within any process and procedural enhancements, including M&A. As a managerial leader, you will inspire, coach, and mentor your managers and individual contributors to develop their career aspirations into reality.
Lead and manage a teams responsible for: Incident Management, Detection, Change Execution/Approvals, and maintenance for all integrated properties, as well as root cause analysis/remediation and other proactive measures to improve the stability of customer performance and minimize risk of impact to customers. The team work collaboratively with internal R&D Teams, and partner closely with various teams to drive resiliency improvements and reduce our MTTD and MTTR. You will manage a highly skilled team that currently work on shift rotation.
Ensures optics proactively in diagnostics, detection, configuration, application, develop service-ownership to fill gaps and provide detective in customer experience
Creating capabilities to have SR team respond in a timely manner to incidents and find root cause
Work successfully with other cross-cloud service owners (developers, DBAs, Network etc) with positive relationships but with influence
Proactive measure to impact customers beyond current NOC SRE team - We want to actually solve the problems and configure visibility
Involved in public cloud tooling in Linux environments
Collaborate with SR dashboards and analytics to give predictive insights on data center environments for customers
Passionate about engineering productivity and service ownership and customer success
Passionate about Continuous Integration and Delivery and driving teams to adopt this delivery model
Excited by building reliable, self-healing services on unreliable hardware
Experience designing, developing, debugging, and operating resilient distributed systems that run across thousands of compute nodes in multiple datacenters
10+ years of Infrastructure Engineering experience
7+ years managing Site Reliability, NOC, or mixed engineering teams and/or Managers in globally distributed environments
Past Experience in Incident Management and strong understanding of ITIL service operations and SCRUM methodologies
Experience growing high performing, globally distributed engineering teams
Passionate about employee development with experience successfully coaching and managing managers and individuals to achieve goals
- Strong communication , organizational, analytical and problem solving skills and attention to detail
Experience in a large-scale Linux data center environment with knowledge in administration, troubleshooting
Process improvement and change management
Has a passion for: Teamwork and collaboration, Adaptability, Communication, Problem Solving, Customer Focus, Results, and Innovation.
Provides vision as a multipler to deliver
Entrepreneurial-spirited, Results-driven, communicator, aloha spirit
Nerdy and business-like
Experience with software-based infrastructure such as AWS, GCP, Azure, GCE, CoreOS
Windows Systems knowledge as well
Experience with M&A Strategy with Site Reliability
MS in Computer Science or related field, or
BS in Computer Science plus relevant job-related experience
Salesforce, the Customer Success Platform and world's #1 CRM, empowers companies to connect with their customers in a whole new way. We are the fastest growing of the top 10 enterprise software companies, the World's Most Innovative Company according to Forbes, and one of Fortune's 100 Best Companies to Work for nine years running. The growth, innovation, and Aloha spirit of Salesforce are driven by our incredible employees who thrive on delivering success for our customers while also finding time to give back through our 1/1/1 model, which leverages 1% of our time, equity, and product to improve communities around the world.