MediaMath’s Media/Intelligence team is currently seeking a Site Reliability Engineer. As an SRE, you will be front-and-center in the effort to keep our distributed services fast and reliable, 100% of the time. Our systems span from a large globally distributed RTB bidding platform which services more than 3 million real-time transactions a second, to clusters of user databases that host tens of billions records, to AWS EMR cluster, and more. Responsibilities: * Manage the scalability, performance, and availability of MediaMath RTB bidding platform by solving for reliability against existing systems and services spanning the entire stack. * Develop tools and automation to minimize delivery time and increase developer productivity. * Participate in the design and development of new and evolving services, architecture, and performance standards. * Own and participate in capacity planning and service performance analysis and tuning. * Design and provide best practices for deployment, monitoring and alerting. * Support team members in the development of a SOA strategy and migration path. * Respond to and resolve emergent issues. Be on-call periodically as part of shared team. * Provide mentor-ship and coaching for junior team members
Desired Skills and Experience
The top qualification for this role, above all else, is a strong desire to be part of something big; where input is encouraged and results are rewarded. * 4+ years of relevant work experience, including experience with high-volume, production distributed systems environment. * Extensive working experience with Linux system (Debian based). * Familiarity with cloud infrastructure, such as AWS. * High-level shell fluency + one or more scripting languages (Python, Perl, or similar). * Experience managing and deploying full stack, distributed services. * Experience with container technologies (Docker, Vagrant, LXC, etc) * Experience with system automation tools (Ansible, Chef, Puppet, Salt Stack, etc.). * Experience with monitoring, alerting, and pipeline analysis tools (Nagios, Sensu, Graphite, Riemann, Logstash, etc.). * Excellent analytical skills, coupled with a strong sense of ownership, urgency and drive. * Experience with queuing/data-pipelining solutions (Storm, RabbitMQ, Amazon Kinesis, ZeroMQ, Kafka, etc.). * Experience with SQL/NoSQL systems such as PostgresSQL, MongoDB, Redis, Cassandra, DynamoDB, etc.