Desired Skills and Experience
- Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
- Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
- Assisting in the architectural design of new services and making them operate at scale
- Monitoring of systems, services and service clusters, optimization of performance and resource utilization
- Assisting in or lead incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
- Share our values and work in accordance with them
- 3+ years experience in an SRE/Operations/DevOps role as part of a team
- Experience with managing geographically distributed, highly available, high traffic infrastructure based on Linux
- Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.)
- Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Icinga/Nagios, Prometheus, Grafana, Graphite, Logstash/Kibana, etc.)
- Comfortable with shell and scripting languages used in an SRE/Operations engineering context (Python, Go, Bash, Ruby, etc.)
- Comfortable with managing remotely both bare-metal servers and virtualized environments
- Experience with software and service deployment and package management, including (Debian) packaging as well as container systems
- Aptitude for automation and streamlining of tasks
- Strong English language skills and ability to work independently, as an effective part of a globally distributed team
- B.S. or M.S. in Computer Science or equivalent work experience
- Track record of open source contributions is a major plus
- Familiarity with modern distributed container cluster management systems (Kubernetes, Docker Swarm, Mesos, …)
- Experience with LAMP stack technologies (PHP/HHVM, memcached/Redis, MySQL) - MediaWiki experience is a definite plus
- Low level systems troubleshooting and debugging (CPU/memory profiling, C/C++ experience, in-depth Linux knowledge)
- Experience with advanced distributed storage and database systems (Swift, Ceph, Cassandra, etc.)
- Design, implement and maintain backup and underlying storage infrastructure, ensuring all Wikimedia mission-critical data is backed up to on-site and off-site storage in an automated, consistent and reliable manner
- Ensure smooth and reliable operation of the MediaWiki application server platform and its dependencies (Memcached, Redis, etcd, …)
- Perform platform transformations and migrations towards modernized infrastructure (HHVM to Zend PHP7, bare metal deployments to Kubernetes clusters, active/active multi-data center support, etc.)
- Design, implement and maintain our metrics, monitoring and logging infrastructure using modern and state-of-the-art tooling (Prometheus, Grafana, Logstash/Kibana)
- Implement and improve orchestration and automation tooling that eliminates toil and acts as an enabler for the entire SRE team
- Help keep Wikimedia’s infrastructure secure in an ever-changing, high-velocity environment with staff and volunteers across the world
- Fully paid medical, dental and vision coverage for employees and their eligible families (yes, fully paid premiums!)
- The Wellness Program provides reimbursement for mind, body and soul activities such as fitness memberships, baby sitting, continuing education and much more
- The 401(k) retirement plan offers matched contributions at 4% of annual salary
- Flexible and generous time off - vacation, sick and volunteer days, plus 19 paid holidays - including the last week of the year.
- Family friendly! 100% paid new parent leave for seven weeks plus an additional five weeks for pregnancy, flexible options to phase back in after leave, fully equipped lactation room.
- For those emergency moments - long and short term disability, life insurance (2x salary) and an employee assistance program
- Pre-tax savings plans for health care, child care, elder care, public transportation and parking expenses
- Telecommuting and flexible work schedules available
- Appropriate fuel for thinking and coding (aka, a pantry full of treats) and monthly massages to help staff relax
- Great colleagues - diverse staff and contractors speaking dozens of languages from around the world, fantastic intellectual discourse, mission-driven and intensely passionate people
Apply