The Senior Site Reliability Engineer’s main focus is ensuring our SaaS applications are always up and running. They monitor the software and hardware, and resolve issues when they occur. They also build new tools and scripts to improve application performance, monitoring, and recovery. We are a distributed team spanning Santa Clara and Taipei, so good communication is key. We work very closely with the Platform development team as we have significant backend infrastructure to deal with the large amounts of data we process to fulfill customer requests. Location Ideally the candidate would work out of one of the following offices: Santa Clara, CA; Pleasanton, CA; Portland, OR; Chicago, IL; New York, NY. However, telecommuting from a different location is acceptable for the right candidate.
Desired Skills and Experience
Responsibilities: * Build new tools and scripts to improve application performance, monitoring, and recovery. * Conduct root cause analysis of production issues including troubleshooting and debugging through very complex BigData backend pipelines * Monitor and address Software-as-a-Service (SaaS) web application issues. * Remote management of distributed servers in cloud computing environment. * Collaborate with operations/engineering groups to implement failover, redundancy, and scalability. * Help oversee and automate cloud infrastructure, systems scalability, and systems/network security. * Share routine operations tasks and on-call duties. On-call duties rotate weekly and are only from 6am – 6pm Pacific. * Make constant and proactive improvements to existing processes. * Work with Customer Support to help debug customer issues. Qualifications & Skills: * 5+ years of Linux system administrator (sysadmin) experience or Linux webapp operations experience * Networking experience * Previous work with search-related technologies is a plus * Experience working with distributed servers in cloud computing environment a plus * Bachelors degree or above * Computer science degree preferred * Excellent written and verbal English skills * Excellent troubleshooting skills * Expert with Linux (CentOS) and command-line tools * Experience with one of the following scripting languages: Bash, Python * Self-motivator. Interests in Linux and back-end matters. * Enjoy hacking/learning new technologies and able to learn quickly. * Incredible attention to detail. Plan ahead before making changes. * Deep expertise in designing, building, operating, and troubleshooting Linux-based large scale Internet or SaaS systems. * Demonstrated ability to build resilient, scalable web-based systems that support rapid growth. * Good working knowledge of MYSQL. * Experience with monitoring tools like Nagios and Ganglia * Experience with Log Management tools such as ELK Stack, Splunk, MongoDB, etc. * Experience with Big Data technology, Hadoop, Cassandra, is a plus * Experience with JVM monitoring, is a plus * Experience with ZooKeeper is a plus * Experience with Configuration Management tools like Puppet is a plus * Experience with Amazon EC2, OpenStack, and server virtualization software (XenServer) is a plus * Experience with working with geographically-distributed teams is a plus * Coding experience is a big plus