Site Reliability Engineer - Data Technologies at Bloomberg (London, UK)

As an SRE (or “Site Reliability Engineer “) at Bloomberg, you will work on our biggest, most critical services. Your mission will be to ensure Bloomberg is fast, highly available, scalable, and able to withstand unprecedented increases in load. In this role you will be at the heart of managing production with a scope from the kernel to the application. That means the position requires the flexibility and creativity to take an all-round approach to troubleshooting.

We are building a new infrastructure from the ground up, on top of this you will design and build automation tools for system health, production acceptance tests to validate production changes and will ensure the system is well instrumented and highly fault tolerant. A strong attention to detail will be needed as you will deep dive in certain issues when required.

This is a new project where we are reimagining the way systems are currently developed across our entire company - it is our aim to entirely change the way our business operates. You will be part of a small focused team with the autonomy and flexibility to make bold choices.

We’ll trust you to:

Desired Skills and Experience

Ensure optimal availability, latency, scalability and efficiency of Bloomberg application development. You will do this by advocating engineering reliability into our development life cycle with a focus on fault tolerant approaches
Respond to and resolve unexpected and potential service problems. You will write software to prevent the same problem happening again
Drive capacity planning, performance analysis, instrumentation and other non-functional systems requirements
Review and influence on-going design, architecture, standards and methods for improving operating services
Own system releases, write production software acceptance tests and coordinate all aspects of the release including coverage and communication plans
A background as a Software Engineer or development of customer-facing, high-availability, large scale distributed applications
In-depth knowledge of Linux/Unix
Extensive exposure to working with fault tolerant approaches in a large scale distributed environment and high performance systems
Proficient in C, C++ or Python technologies
Understanding of a variety of scripting languages
You know how docker/rkt containers work at scale
You have exposure to Kubernetes/Swarm/Mesos/Spark
An understanding of how complex systems environments work
A deep understanding of internet and networking protocols
A passion for performance excellence, robustness and engineering mind-set
You have the ability to analyse and troubleshooting large-scale distributed systems