Principal Site Reliability Engineer

Imagine the activity, the transactions, and the data that flows through one of the largest eCommerce sites in the world. Do you have that in your mind? Now imagine what it might take to keep that site up, performing and running efficiently. If you have a clear picture of that in your head, you might be the person we need to lead our SRE team!Reporting to the Sr. Manager of Site Reliability Engineering, the Principal Site Reliability Engineer will work with other SRE and DevOps practitioners to produce mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites. As a senior member of the team you will be expected to work with management, peers, and customers to define and implement the technical vision of the team.You’re right for the job if you’re comfortable with deep technical Linux, networking topics, and distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You’ll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our Software Engineering teams to build our next generation “always up” cloud based e-commerce platform.Our goal is to build, scale and guard the systems that delight customers. Do you have what it takes?Position Description

Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure
Design new tools and smart alerts that help discover failures/issues in a timely fashion and work with engineers to identify root cause and mitigating factors
Root-cause complex problems and involve multiple stakeholders, network, hardware and software that relate to scaling and performance
Participate in on-call rotation schedules as required
Enable service reliability and availability supported by metrics and measurements
Enable scaling by providing tools, and developing training by augmenting processes
Protect the systems from critical issues, be they real, perceived or notional
Build and drive the automation systems that maintain system health
Drive improvements in all aspects of service delivery, including change management, continuous delivery, security, monitoring, and reliability
Inspire the team to continuously update and sharpen their skills to keep a cloud scale system operational
approaching these problems from both a systems and software engineering perspective
Partner across the organization to foster collaboration and partnership
Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; automate response to all non-exceptional service conditions
Drive the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
Design, implement, and support high-performance, highly-available services and infrastructure
Improve the efficiency and flexibility of our datacenters
Support and maintain models for growth and capacity planning
Work closely with engineering, project management, operational, and engineering peers to develop innovative technical tools and solutions
Organize and manage multiple simultaneous projects
Practice and enforce Agile and Scrum methodologies
Lead by example, care for your team, and establish credibility with the quality of your and your team’s technical executionMinimum Qualifications
BA/BS degree in Computer Science or related technical field, or equivalent practical experience
Multiple years of experience in delivering systems level software solutions and/or infrastructure management with an emphasis on automation
Understanding of the entire stack of datacenter technologies, from storage arrays to networking to orchestration software
Demonstrable experience as a Site Reliability Engineer, Systems Administrator, Software Engineer, or equivalent roles
Demonstrable experience leading teams
Demonstrable experience working in a high volume, large deployment, multi-datacenter environmentAdditional Preferred Qualifications
Capability to program in at least one language, ideally Python or Perl, but Ruby, C/C++, Java, or others are okay
Experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
Strong knowledge of core protocols and tech such as: TCP/IP, HTTP, DNS, load balancers, distributed file systems, key-value and relational databases
Extensive experience with configuration management tools such as Puppet, Chef, Salt, or Ansible
Experience with specific software such as Hadoop, Kafka, Spark, CouchBase, Graphite, RUM, and similar technologies is desirable, but the ability to quickly learn new technology is most important
Capable of technical deep-dives into code, networking, systems, and storage with very bright, experienced engineers
Expertise in problem solving and analyzing global scale distributed systems
Logging and Monitoring experience designing, deploying and running systems like Splunk, ELK, New Relic or other APM solutions
Work with product delivery teams to identify architectural issues and ensure timely and smooth delivery of features into operations
Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization
Excellent written and verbal communication skills in order to influence architectural and process level change in the organization
Generally requires 10+ years in an software development role, operations role, or closely related positionReq ID: 685017BR

Desired Skills and Experience

See application page for details