Site Reliability Engineer
We are Sumo Logic and we are building the Next Generation Log Management and Analytics solution * – delivered as a cloud-based service. With 1000+ enterprise customers in just over two years after our public launch and $160.5M in funding from the world’s leading investors (Accel, Greylock, Sequoia, Sutter Hill, and DFJ Growth), Sumo Logic is reshaping the Big Data landscape with its cloud-based machine data analytics platform. The proliferation of machine log data has the potential to give organizations unprecedented real-time visibility into their infrastructure and operations. With this opportunity comes tremendous technical challenges around ingesting, managing, and understanding high-volume streams of heterogeneous data. As a Site Reliability Engineer, you are a hybrid software/systems engineer who ensures that the Sumo Logic service runs smoothly and has capacity for future growth. You will own all of Sumo Logic’s production systems. You will embed with the various Developer teams to ensure that as the system scales and new features are delivered, reliability is not compromised.
Responsibilities:
-
Own and scale Sumo Logic services.
-
Analyze and improve the efficiency, scalability, and reliability of our backend systems.
-
Create scalable alerting and auto remediation systems.
-
You will share an on-call rotation backed by all our engineering teams.
-
Perform advanced troubleshooting and monitoring of our systems to ensure adequate SLA and capacity requirement.
Requirements:
-
BS, MS in Computer Science / Engineering or equivalent.
-
Strong understanding of Unix and TCP/IP fundamentals.
-
Object-oriented programming experience, for example in Python, Scala, Java or C++.
-
Ability to rapidly learn new software, frameworks, open source tools and development languages.
-
Configuration and maintenance of common infrastructure such as Apache, memcached, MySQL and common NoSQL implementations.
-
Strong knowledge of large-scale internet service architecture (load balancing, Hadoop, OpenStack).
-
Detail oriented and systematic. Desirable:
-
Experience with performance, scalability, and reliability issues of 24x7 commercial services.
-
Strong troubleshooting skills.
-
Experience with regular expressions.
Desired Skills and Experience
Requirements:
-
BS, MS in Computer Science / Engineering or equivalent.
-
Strong understanding of Unix and TCP/IP fundamentals.
-
Object-oriented programming experience, for example in Python, Scala, Java or C++.
-
Ability to rapidly learn new software, frameworks, open source tools and development languages.
-
Configuration and maintenance of common infrastructure such as Apache, memcached, MySQL and common NoSQL implementations.
-
Strong knowledge of large-scale internet service architecture (load balancing, Hadoop, OpenStack).
-
Detail oriented and systematic. Desirable:
-
Experience with performance, scalability, and reliability issues of 24x7 commercial services.
-
Strong troubleshooting skills.
-
Experience with regular expressions.