Systems Reliability Engineer

” Here you go, you can send anyone you have to me, they MUST have a strong SRE background. This is a perm role in NYC. We are pretty open in the rate for the right person. “

Our Team

Bloomberg systems are fast and reliable and we’re the team that makes that possible. We build middleware * the software infrastructure designed for creating large-scale, fault-tolerant applications that run on thousands of machines throughout the world. What we build is used by both engineers and clients. Our complex infrastructure uses a variety of programming paradigms such as RPC, publish/subscribe and message queues. Building this infrastructure requires top engineers. We are two dozen C++ programmers with experience in designing network protocols and large-scale software architecture.

What’s In It For You

As a Systems Reliability Engineer (SRE) working on this critical infrastructure, your mission will be to take responsibility of deployments and ensure reliability. You will focus on automating everything from build and deployment to reaction and remediation to outages. You will be part of a larger SRE organization aimed at supporting the Bloomberg API. You will work on all aspects of this end-to-end system. This will require a wide range of skills, some of which you can learn on the job, such as time series databases, statistical analysis, web-based UIs, RESTful services and new programming languages.

We’ll Trust You To

Take responsibility for deployment after Beta for Bloomberg’s messaging and multicast services
Ensure level 1 support for production issues
Automate everything from reaction to outages to quality checks for new builds
Provide feedback to developers to make this infrastructure increasingly resilient You Need To Have
3+ years of experience as a software engineer or developer working on high availability, large-scale distributed applications
Excellent programming skills (you don’t need to know C++ or Java, although they are a plus, but you do need to be a great programmer in other programming languages such as Python, Ruby, Perl, Scala or JavaScript)
A strong understanding of the UNIX/Linux command line
A passion for performance excellence and an engineering mindset
Previous experience with data, statistics and latency numbers
A Bachelor’s degree in Computer Science or equivalent experience We’d Love To See
Strong leadership skills
Prior experience as a systems performance or site/systems reliability engineer
Extensive experience working with fault-tolerant approaches in a large-scale distributed environment with high performance systems
Expertise analyzing and troubleshooting large-scale distributed systems
A deep understanding of Internet and networking protocols, including IP multicast (PGM)
Knowledge of network analysis and performance and application issues using standard tools (Tcpdump or Wireshark)
A strong understanding of the software development lifecycle, as well as tools such as Git, Cmake, Jenkins, RPM or DPKG, Chef or Puppet
Experience with virtualization and Infrastructure as a Service models
The ability to handle periodic on-call duties as well as out-of-band requests Check Out How We Give Back To The Open Source Community

https://github.com/bloomberg/chef-bcpc https://github.com/bloomberg/nginx-cookbook https://github.com/bloomberg/zookeeper-cookbook https://github.com/bloomberg/kafka-cookbook https://github.com/bloomberg/redis-cookbook

Desired Skills and Experience

See application page for details