Site Reliability Engineer
With SparkCognition in Austin TX USMore jobs from SparkCognition
Posted on November 07, 2019
About this job
Job type: Full-time
Role: System Administrator
sysadmin, docker, cloud
SparkCognition is an AI leader that offers business-critical solutions for customers in energy, oil and gas, manufacturing, finance, aerospace, defense, and security. A highly awarded company recognized for cutting-edge technology, SparkCognition develops AI-powered, cyber-physical software for the safety, security, reliability, and optimization of IT, OT, and the Industrial IoT.
SparkCognition is looking for a Site Reliability Engineer who can help drive SparkCognition’s production operations initiatives. The ideal candidate has experience in monitoring and maintaining production systems, issue resolution, automation, and continuous improvement. The position offers opportunities for building and designing a modern, automated platform in the cloud, spanning multiple regions around the globe. This is a high visibility role where the candidate will work across multiple teams to ensure the stability of advanced machine-learning solutions.
- Incorporate monitoring of systems & applications to prevent system disruptions and ensure that required Service Level Agreements (SLAs) are met.
- Respond to tickets within team defined Service Level Objectives (SLO).
- Notify appropriate teams of performance issues and trends.
- Suggest improvements to monitoring process and tools as needed.
- Deploy production infrastructure, product releases, and maintain systems.
- Document system requirements, configurations, procedures, changes, incidents, and problem resolution.
- Work with DevOps to support the development, testing, and pre-production environments
- Perform Root Cause Analysis (RCA) of outages & performance issues, and provide feedback to appropriate teams to prevent similar reoccurrences.
- Participate in on-call rotation with the ability to respond to the needs of a 24x7 environment.
- Perform other related duties as assigned.
- Must have experience with monitoring & logging systems, collecting metrics, and using tools like DataDog, Grafana, Graylog, Prometheus, ELK, Zabbix, etc.
- Must have experience with Docker containers, Kubernetes, and working with Google Cloud Platform.
- Must have functional knowledge of networking, DNS, DHCP, VPCs, ACLs, network security and troubleshooting.
- Experience building out continuous integration/continuous delivery pipelines and overall production operations.
- Proven experience managing multiple projects and competing priorities in a fast-paced work environment.
- Proven ability to work across multiple product teams and deliver solutions on tight deadlines
- Strong written and verbal communication, organization and customer services skills.
- Familiarity with some or all of the following technologies:
- Automation/configuration management tools like Terraform, Ansible, Helm.
- Scripting languages like Shell, Bash, Python
- Containerization experience with Docker of similar technology
- Streaming or message queue such as Kafka
- SQL databases such as Postgres
- Cloud networking and traffic management (VPCs, load balancers, network segregation)