Desired Skills and Experience

  • Significant experience in both Systems Engineering and Software development
  • Proficiency in at least one of these disciplines:

Internals of distributed Operating System (Unix/Linux, Windows, Z/OS)

Systems programming

Network programming * Internals of distributed Operating System (Unix/Linux, Windows, Z/OS) * Systems programming * Network programming * Experience using large scale software development in one of these languages: (Java, Python, .Net, C++, etc.) * Experience using system and software security and entitlements such as SSO, windows, Kerberos, LDAP, Windows AD * Experience with new and emerging technologies such as cloud and virtualization

JPMorgan Chase’s Core Foundation Services (CFS) group, within Global Technology Infrastructure, designs and delivers critical and foundational platform solutions for all technology infrastructure systems across all lines of business. This includes the engineering design and delivery of platforms for Directory Services, Authentication and Privilege Management, Configuration Management, IPAM, Orchestration, Reference data, and more.

The CFS team is seeking a Site Reliability Engineer (SRE) Lead that combines software and systems engineering to build and run large-scale, massively distributed, fault tolerant system. Candidate will build creative engineering solutions to operations problems, including optimizing existing systems, building infrastructure and eliminating work through automation. Candidate will work with various cross-functional teams, and must be able to work in a global team setting and adapt to dynamic requirements. As SREs are responsible for the big picture of how our systems relate to each other, candidate will use a breadth of tools and approaches to solve a broad spectrum of problems

Key responsibilities: * Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement. * Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. * Maintain services once they are live by measuring and monitoring availability, latency and overall system health. * Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. * Practice sustainable incident response and blameless postmortems. * Be part of a dynamic team that provides round-the-clock coverage to ensure service uptime and stability * Thrive in a team environment with strong interpersonal skills. Collaborate and build relationships with engineers, development teams, architects, operations partners, and business clients * Establish, and regularly update, multi-phase delivery roadmap

Apply