SRE (Compute)

With KI group GmbH in Lisboa - PT

More jobs from KI group GmbH

Posted on March 17, 2020

About this job

Location options: Paid relocation
Job type: Full-time
Experience level: Mid-Level
Role: DevOps, System Administrator
Industry: Information Technology, IT Consulting
Company size: 201–500 people
Company type: Private


linux, kubernetes, python, scripting, ansible

Job description

If you're looking for an established company that lives the startup culture, keep reading!

Here, you will help find problems to solve and build solutions together with an international team, with people coming from the biggest companies in Europe.

About Us

At KI labs we design and build state of the art software and data products and solutions for the major brands of Germany and Europe. We aim to push the status quo of technology in corporations, with special focus areas of software, data and culture. Inside, we are a team of software developers, designers, product managers and data scientists, who are passionate about building the products of future today. We believe in open-source and independent teams, follow Agile practices, lean startup method and aim to share this culture with our clients. We are mainly located in Lisbon and Munich.

Your day to day

Help define the process for SRE across the entire organisation;

Analyse existing, create and maintain new Service Level Objectives;

Investigate failures, identify root cause and resolve operational challenges contributing to defined SLO's;

Provide on-call (9-5) and incident management support for production systems;

Ensure that SRE team's toil keeps under 50% by focusing on automating incident resolutions and platform reliability tasks;

Be part of a Postmortem Culture in order to enable learning from failure and share the knowledge across the entire engineering team;

Align across the entire engineering team to assure the usage of standardised tools for both monitoring and logging;

Enhance, maintain and scale our infrastructure;

Work alongside with the Infrastructure engineering team to ensure the implementation best practices in stability, resiliency, security, performance and observability;

Leverage Chaos Engineering approaches to identify opportunities;

Maintain up-to-date documentation on deployments, processes and standard operating procedures/run-books

Technologies we use

Infrastructure (Kubernetes, Cilium, Terraform,  Docker, Helm, Ansible, AWX, Ceph)

Monitoring (Prometheus, Thanos, AlertManager, Grafana, NodeExporter)

Logging (EFK stack, ElasticSearch, Fluentd and Kibana)

Data Storage (TSDB, Ceph Block Storage, S3)

Security (Keycloak, Wazuh, k8s RBAC, mTLS)

Main requirements

2+ years experience as SRE, Production or Systems Engineer;

2+ years experience with building or maintaining Kubernetes clusters in Production;

2+ years experience with large scale data center environment

SME in scripting and deployment automation (ie. Ansible, Puppet or Chef);

SME with Linux in a large scale deployment;

SME with monitoring large scale infrastructure;

Strong Bash/Linux skills and proficiency in at least one programming language (ie. Golang, Python, etc.);

Strong ability to troubleshoot errors and outages across the stack and identify root causes and solutions;

Strong comprehension of continuous integration and continuous deployment methodologies;

Strong communicator fluent in both written and verbal English

Apply here