Site Reliability Engineer - Data Technologies at Bloomberg (London, UK)
As an SRE (or “Site Reliability Engineer “) at Bloomberg, you will work on our biggest, most critical services. Your mission will be to ensure Bloomberg is fast, highly available, scalable, and able to withstand unprecedented increases in load. In this role you will be at the heart of managing production with a scope from the kernel to the application. That means the position requires the flexibility and creativity to take an all-round approach to troubleshooting.
We are building a new infrastructure from the ground up, on top of this you will design and build automation tools for system health, production acceptance tests to validate production changes and will ensure the system is well instrumented and highly fault tolerant. A strong attention to detail will be needed as you will deep dive in certain issues when required.
This is a new project where we are reimagining the way systems are currently developed across our entire company - it is our aim to entirely change the way our business operates. You will be part of a small focused team with the autonomy and flexibility to make bold choices.
We’ll trust you to:
Desired Skills and Experience
- Ensure optimal availability, latency, scalability and efficiency of Bloomberg application development. You will do this by advocating engineering reliability into our development life cycle with a focus on fault tolerant approaches
- Respond to and resolve unexpected and potential service problems. You will write software to prevent the same problem happening again
- Drive capacity planning, performance analysis, instrumentation and other non-functional systems requirements
- Review and influence on-going design, architecture, standards and methods for improving operating services
- Own system releases, write production software acceptance tests and coordinate all aspects of the release including coverage and communication plans
- A background as a Software Engineer or development of customer-facing, high-availability, large scale distributed applications
- In-depth knowledge of Linux/Unix
- Extensive exposure to working with fault tolerant approaches in a large scale distributed environment and high performance systems
- Proficient in C, C++ or Python technologies
- Understanding of a variety of scripting languages
- You know how docker/rkt containers work at scale
- You have exposure to Kubernetes/Swarm/Mesos/Spark
- An understanding of how complex systems environments work
- A deep understanding of internet and networking protocols
- A passion for performance excellence, robustness and engineering mind-set
- You have the ability to analyse and troubleshooting large-scale distributed systems