Site Reliability Engineer at Faria Education Group (San Francisco, CA) (allows remote)

Site Reliability Engineers are hybrid systems and software engineers who are responsible for and take ownership of reliability, automation, and other issues related to ‘keeping the lights on’ across Fariaâs multi-product SaaS systems stack.

SREs are integrated within the Technical Operations team and work under the Head of Technical Operations and with the CTO and Principal Developers. We are looking for engineers who want to be a part of developing infrastructure software, maintaining it and scaling it.

Benefits

LOCATION

This is a 100% remote job. The ideal candidate will be in GMT-8 timezone (Pacific Coast) or prepared to have daily availability during GMT-8 business hours. Because of the nature of SRE work you should also be prepared for on-call shifts and potential “all hands on deck” situations at any hour of the day or night. Minimizing those situations is part of your job!

Please note: Due to a high volume of applicants, only shortlisted candidates will be contacted.

Desired Skills and Experience

50k-60k / year, negotiable
Dedicated AWS account (or bare metal servers, per your choice) for infrastructure automation testing, development and general learning.
Retina MacBook Pro or another laptop of your specification, peripherals and displays included.
Health care.
Books, library & conference budget.
Remote-friendly.
Minimum of 2+ years of system administration experience for a high-usage, web-based software service ideally built using open-source software components
Knowledge and familiarity with alerts & monitoring tools, and system management tools for Linux environments (including Nginx, NewRelic, CloudFlare, MySQL/PostgreSQL, Apache, IPTables, ELK stack, RabbitMQ, etc.)
Knowledge of developing / deploying / troubleshooting / tuning Ruby on Rails applications (Passenger, Capistrano, SideKiq, Bundler)
Knowledge and familiarity with configuration management tools including Ansible, Chef or Puppet. Knowledge of Amazon AWS services and API’s including EC2, S3, VPC, IAM
Knowledge of type-1 hypervisor virtualization (Xen, VSphere)
Knowledge of containers (LXC, Docker)
Strong communication skills with ability to coordinate incident response with urgency.
Reliably automate the server provisioning process to reduce the labor of our R&D team
Building scalable infrastructure to manage high-load, concurrent sessions to support ~50 mm monthly page views and 500k+ active users
Drive the company through âDisaster Recovery Testsâ, where we manually turn down pieces of infrastructure to test Fariaâs overall resiliency to failures
Implement the systems and processes that Faria Developers use to deploy their software into production
Build an auto-remediation system to automatically resolve production incidents before escalating them to on-call Developers
Proper remote presence & etiquette (acknowledging requests in a timely fashion over Slack, not leaving requests unacknowledged at all)
Tagging the appropriate person and persistently reminding them every 24 hours until full resolution is achieved (not having things fall through the cracks)
Effective adherence to operating procedures (organising day-to-day work and large-scale tasks in a calm manner with priority-driven sequencing)