Site Reliability Engineer

At Apixio we are building a data infrastructure and family of products to help transform the HealthCare industry. Our data scientists and engineering teams are creating smart products that use clinical patient data to do this and need great Site Reliability Engineers to build, deploy and support the infrastructure that brings their work to our customers. Our Site Reliability Engineering team support our AWS footprint as infrastructure as code, are super security minded and know how to help the engineering team create a highly reliable site. They also automate everything. Our operations team share an on-call rotation backed by our engineering teams. As one of our SREs you will be capable of doing many of the following:

Own and maintain Apixio services and data infrastructure in production
Analyze and improve the efficiency, scalability, and reliability of our backend systems
Capacity planning and buildout
Create scalable and reliable monitoring and alerting that works
When things go bad, perform advanced troubleshooting of our systems
Build scalable, secure and measurable infrastructure with code
Create automation of engineering deployments
Support engineering team in implementing system reliability
Support disaster recovery design, implementation and testing
Support Apixio’s office infrastructure (Cisco ASA, Meraki switch & AP’s etc)
Build tooling to allow the self-service of AWS offerings by engineering teams And you will have knowledge of many of the following:
Strong understanding of Unix and system administration
Strong programming skills in languages Python, Go, Scala/Java, C++
Strong knowledge of best in class security practices and testing methods
Amazon Web Services (AWS) and APIs
Strong knowledge of the configuration and maintenance of common big infrastructure components such as Cassandra, Redis, FluentD, Apache/Django/Flask, Kafka, Redis, Elasticsearch & Hadoop
Strong knowledge of internet service architecture (TCP/IP, HTTP, DNS, routing, load balancing)
Experience with monitoring and logging applications like FluentD, Graylog, Datadog
Experience with deployment and config management systems like Salt Stack, Ansible and HashiCorp A strong candidate will have:
Past experience and success stories managing a significant infrastructure in AWS, and have maintained a 24x7 commercial SLA
An a passion for killer SLA’s, Secure Infrastructure and Automation of Everything
A BS, MS in Computer Science / Engineering or equivalent

Desired Skills and Experience

See application page for details