Site Reliability Engineer jobs

With the renewed focus on the global commerce, building amazing commerce experiences to our community of users is at the core of eBay’s mission. One of the critical components enabling this mission is our massive technology infrastructure, platforms, application services underneath. It is an exciting time to be part of the Site Reliability Engineering (SRE) team who will play an integral role at the intersection of software engineering and infrastructure/systems engineering. SRE team strive to make eBay’s services secure, highly available, reliable and performant to our community of users. The team has a unique opportunity to advocate and participate in building services that are resilient, effectively monitored, alerted and self-healed by applying software engineering practices to operate the site. * Excellent trouble-shooting skills that span systems, network (TCP/IP), and code * Hands-on Software engineering skills including Java, Python, Scala etc * Expert knowledge in large scale web operations and web based Java/J2EE architectures and JVM configurations * Strong interpersonal and communication skills to work in a fast paced and rapidly changing dynamic environment * Strong skills in data structures, relational and NOSQL databases, networking, web architectures, UNIX flavors

Desired Skills and Experience

Additional Responsibilities * Primarily responsible for ensuring that customer facing eBay application services are highly available, reliable, and performant through world-class monitoring, alerting, self-healing capabilities by applying software engineering practices * Spends approximately half the time with the core engineering teams advocating and contributing towards making our application services resilient and the remaining half the time with core site operations teams. * Serve as the primary subject matter expert for eBay application services (Ex: Buying, Selling, core checkout etc) towards preventing (pro-active) as well as troubleshooting and mitigating (re-active) service availability/performance issues * Develop tools to improve our ability to rapidly deploy and effectively monitor application services in a large-scale and complex environment * Being able to multi task and deliver in a fast paced, rapidly evolving technology landscape and participate in an on-call escalation for incident resolutions. * Responsible for Lights Out Management of our services and advocating and contributing toward the best operability of the services in production