SRE (Storage)

With KI group GmbH in Leiria - PT

More jobs from KI group GmbH

Posted on March 17, 2020

About this job

Location options: Paid relocation
Job type: Full-time
Experience level: Mid-Level
Role: DevOps, System Administrator
Industry: Information Technology, IT Consulting
Company size: 201–500 people
Company type: Private


ceph, ansible, linux, python

Job description

At xgeeks - a part of the KI Group - we are looking for software engineers with a keen eye for high quality execution to join our rapidly growing teams in the lovely city of Leiria. Here, you will design and build solutions for start ups and scale ups together with an international team, with people coming from the biggest companies in Europe.

We are a team of very talented individuals building state of the art solutions for the major brands in Europe. Our mission is to bring engineering capacity and expertise to big corporations.

Your day to day

Help define the process for SRE across the entire organisation;

Analyse existing, create and maintain new Service Level Objectives;

Investigate failures, identify root cause and resolve operational challenges contributing to defined SLO's;

Provide on-call (9-5) and incident management support for production systems;

Ensure that SRE team's toil keeps under 50% by focusing on automating incident resolutions and platform reliability tasks;

Be part of a Postmortem Culture in order to enable learning from failure and share the knowledge across the entire engineering team;

Align across the entire engineering team to assure the usage of standardised tools for both monitoring and logging;

Enhance, maintain and scale our infrastructure;

Work alongside with the Infrastructure engineering team to ensure the implementation best practices in stability, resiliency, security, performance and observability;

Leverage Chaos Engineering approaches to identify opportunities;

Maintain up-to-date documentation on deployments, processes and standard operating procedures/run-books

Technologies we use

Infrastructure (Kubernetes, Cilium, Terraform,  Docker, Helm, Ansible, AWX, Ceph)

Monitoring (Prometheus, Thanos, AlertManager, Grafana, NodeExporter)

Logging (EFK stack, ElasticSearch, Fluentd and Kibana)

Data Storage (TSDB, Ceph Block Storage, S3)

Security (Keycloak, Wazuh, k8s RBAC, mTLS)

Main requirements

2+ years experience as SRE, Production or Systems Engineer;

2+ years experience with building or maintaining Ceph clusters in Production;

2+ years experience with large scale data center environment

SME in scripting and deployment automation (ie. Ansible, Puppet or Chef);

SME with Linux in a large scale deployment;

SME with monitoring large scale infrastructure;

Strong Bash/Linux skills and proficiency in at least one programming language (ie. Golang, Python, etc.);

Strong ability to troubleshoot errors and outages across the stack and identify root causes and solutions;

Strong comprehension of continuous integration and continuous deployment methodologies;

Strong communicator fluent in both written and verbal English


Impact! You will have an opportunity to be at the frontline of innovation together with our prominent clients, influencing the car you drive in five years, the services you have on your flight, and the way you pay for your morning coffee

  • The chance to work on various interesting projects in the correct time frames
  • We have an open door work culture where ideas and initiatives are encouraged
  • We offer a performance-based competitive salary
  • You'll be part of the first engineers joining the Leiria office contributing for our internal processes since the beginning
  • Yearly training
  • Remote friendly policy
  • Two way constant feedback for your professional growth

Apply here