Desired Skills and Experience
- Create a resilient and highly operable production environment with 24x7 availability, high performance, scalable and zero downtime releases in AWS environment.
- Manage large MySQL database clusters and NoSQL systems such as Redis, DynamoDB, and Cassandra.
- Manage regional deployments and set up disaster recovery of Kafka data pipelines, systems and stores in AWS environment.
- Collaborate with Engineers to create a continuous delivery environment and processes.
- Instrument and monitor the health and availability of services, with fault detection, alerting, triage and recovery (automated and manual).
- Work closely with Twilio’s cloud infrastructure, orchestration, and security teams to help implement company-wide security and operability initiatives and to provide tooling requirements.
- Performance manage (with benchmarking and monitoring of vital metrics), capacity plan, and resolve performance problems affecting service levels.
- Write scripts and runbooks to automate procedures.
- Enable auto-scaling.
- Your background will be that of Senior Engineer who has had considerable experience in a highly-complex technical operations environment with cloud-based services.
- Minimum 5+ years experience building complex distributed systems. In this role, you focused on reliability, high-availability, performance, scalability, capacity planning, backup and recovery, business continuity planning and automation of everything.
- Strong Amazon AWS experience in a production environment.
- Experience with managing and automating configuration of MySQL database clusters.
- Hands-on experience with cloud infrastructure technologies, including continuous integration tools, configuration management, systems monitoring and alerting tools.
- Experience with managing systems in distributed regions in the cloud or on-site.
- Adept at troubleshooting and administering Linux systems, dealing with networking issues, and fine tuning instrumentation and alerting systems.
- Demonstrated experience of agile processes, continuous integration, test automation and release management.
- Significant development experience in at least one modern scripting language, preferably Python.
- Preferably experience with operating a high load data pipeline and exposure to technologies such as Kafka, Kinesis, Spark, S3, and Redshift.