Site Reliability Engineer at Starling Bank (London, UK)

Background

Our SRE team proactively ensures the stability, resilience and scale of our services by automation, testing and engineering. We build on expertise from systems / operations (OS & DB), cloud infrastructure (AWS), pipeline / release engineering (TeamCity), software development and stress / load testing to make sure our services are available 24 hours a day, seven days a week.

We’re looking for engineers to join the team with a passion for infrastructure and delivery who are equally happy:

The ideal candidate will strive for continual improvement by contributing and assessing new ideas and innovations to meet short term and longer term goals whilst at the same time accepting responsibility for day-to-day health of our environments.

Responsibilities

You will work in our SRE team, or embedded in our engineering teams, to deliver our SRE mission:

Desired Skills and Experience

working with developers to ensure a principled approach to delivering change in a safe and secure way
working with third parties to ensure our comms are reliable
working with other SREs to hit our service level objectives and prove our systems and environments
Change management and delivery pipeline into production
Ensure safety, predictability, repeatability and auditability of all build and deploy processes
Enabling ownership by platform and application engineers of tech-specific build plans
Enabling maximum velocity without violating service level objectives
Monitoring, alerting, SLO tracking
To proactively manage delivery of service level objectives
Detection / early warning / self-heal
On-call management
Facilitate emergency / incident response
Create, maintain and test for recovery (backup & restore, infra automation etc.)
Provisioning / automating deployment infrastructure
Demand forecasting and capacity management
Efficiency and cost management
Performance and scalability of the services
Ownership of some cross-cutting implementation like logs / metrics infrastructure
Automation of security checks, break-glass procedures, etc.
Provide level of audit and control to security personnel
Software development experience: ideally Java / JVM but not essentially; javascript, python, bash all beneficial
AWS expertise; familiarity with core services (S3, EC2, ELB, ASG) and CloudFormation
Good understanding of traditional ops areas of expertise: Linux, Disk I/O, Networking, VPNs
Good familiarity with docker and container ecosystem
Continuous delivery - principles and pragmatics of dealing with build pipelines, artefact repositories, zero-downtime deployment and so on
Proving resilience via failure injection (chaos monkey), scalability via load and stress testing
Experience with any of the following: CoreOS, ELK, Prometheus, ElasticSearch, PostgreSQL, PagerDuty, Gatling, JMeter, Kubernetes
Some understanding of iOS or Android also beneficial
Sensitivity to (but also boldness to influence) culture and behaviour across an organisation
We raised US$70m in 2016 to see us through to launch and beyond
The Starling team are a mix of technologists, entrepreneurs, designers, brand and customer experts, biz ops managers, and strategists, all working together to deliver our vision
There’s currently between 60-70 people working in the office on any given day
There isn’t an IT/Engineering department – talented engineers are just a core part of the team
Doing the right thing for customers trumps all, and as such we take our regulatory, conduct and ethical responsibilities very seriously
Passion for what we do
Belief in how we’re different
A can-do attitude, ready to tackle challenges laterally
The ability to communicate our vision internally and externally
Innovative thinking