Principal Site Reliability Engineer
Imagine the activity, the transactions, and the data that flows through one of the largest eCommerce sites in the world. Do you have that in your mind? Now imagine what it might take to keep that site up, performing and running efficiently. If you have a clear picture of that in your head, you might be the person we need to lead our SRE team!Reporting to the Sr. Manager of Site Reliability Engineering, the Principal Site Reliability Engineer will work with other SRE and DevOps practitioners to produce mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites. As a senior member of the team you will be expected to work with management, peers, and customers to define and implement the technical vision of the team.You’re right for the job if you’re comfortable with deep technical Linux, networking topics, and distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You’ll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our Software Engineering teams to build our next generation “always up” cloud based e-commerce platform.Our goal is to build, scale and guard the systems that delight customers. Do you have what it takes?Position Description
- Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
- Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure
- Design new tools and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and mitigating factors
- Root-cause complex problems and involve multiple stakeholders, network, hardware and software that relate to scaling and performance
- Participate in on-call rotation schedules as required
- Enable service reliability and availability supported by metrics and measurements
- Enable scaling by providing tools, and developing training by augmenting processes
- Protect the systems from critical issues, be they real, perceived or notional
- Build and drive the automation systems that maintain system health
- Drive improvements in all aspects of service delivery, including change management, continuous delivery, security, monitoring, and reliability
- Inspire the team to continuously update and sharpen their skills to keep a cloud scale system operational
- approaching these problems from both a systems and software engineering perspective
- Partner across the organization to foster collaboration and partnership
- Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; automate response to all non-exceptional service conditions
- Drive the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
- Design, implement, and support high-performance, highly-available services and infrastructure
- Improve the efficiency and flexibility of our datacenters
- Support and maintain models for growth and capacity planning
- Work closely with engineering, project management, operational, and engineering peers to develop innovative technical tools and solutions
- Organize and manage multiple simultaneous projects
- Practice and enforce Agile and Scrum methodologies
-
Lead by example, care for your team, and establish credibility with the quality of your and your team’s technical executionMinimum Qualifications
- BA/BS degree in Computer Science or related technical field, or equivalent practical experience
- Multiple years of experience in delivering systems level software solutions and/or infrastructure management with an emphasis on automation
- Understanding of the entire stack of datacenter technologies, from storage arrays to networking to orchestration software
- Demonstrable experience as a Site Reliability Engineer, Systems Administrator, Software Engineer, or equivalent roles
- Demonstrable experience leading teams
-
Demonstrable experience working in a high volume, large deployment, multi-datacenter environmentAdditional Preferred Qualifications
- Capability to program in at least one language, ideally Python or Perl, but Ruby, C/C++, Java, or others are okay
- Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
- Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
- Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
- Experience with specific software such as Hadoop, Kafka, Spark, CouchBase, Graphite, RUM, and similar technologies is desirable, but the ability to quickly learn new technology is most important
- Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
- Expertise in problem solving and analyzing global scale distributed systems
- Logging and Monitoring experience designing, deploying and running systems like Splunk, ELK, New Relic or other APM solutions
- Work with product delivery teams to identify architectural issues and ensure timely and smooth delivery of features into operations
- Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
- Excellent written and verbal communication skills in order to influence architectural and process level change in the organization
- Generally requires 10+ years in an software development role, operations role, or closely related positionReq ID: 685017BR
Desired Skills and Experience
See application page for details