Sr. Site Reliability Engineer

Description

Palo Alto Networks reinvented the enterprise firewall, growing from a start-up to a multi-billion dollar company. Our next leading-edge innovation is PAN’s Threat Prevention Cloud Services, leveraging the latest developments in big data, machine learning, virtualization, high density storage, and distributed systems to enable threat analysis at a scale never seen before. Our cloud services, in conjunction with the reach of our firewalls deployed across the globe, create a data platform which no other enterprise security company has ever been able to build. The size and complexity of the data we are dealing with is in line with some of the largest Internet companies.

As a Senior Site Reliability Engineer at PAN you will be responsible for provisioning, maintaining, and scaling our production services and server farms across multiple data centers for a complex and data-intensive cloud service. You will contribute to the architecture to improve scalability, service reliability, capacity, and performance. You will write automation code for provisioning and operating this infrastructure at massive scale. You will work with development and QA on building pipelines and automation for delivering and deploying production applications to this infrastructure.

We are looking for passion, curiosity, attention to details, taking pride in one’s work, taking ownership, and having ideas/opinions. If you’re the level-headed team player who cares about the infrastructure, remains calm in crisis, collaborates cross functionally, and easily writes code for automation we want to talk to you.

Share ownership of a customer-facing production service at significant scale
10+ years of Unix/Linux experience, understanding shell/tools/kernel/networking. CentOS preferred
3+ years of experience managing 500+ servers through automation
Being very comfortable writing Python code, REST calls, and parsing JSON and log data. You don’t operate — you write code to operate.
Experience with geographically dispersed systems and architecting distributed systems for fault tolerance and manageability
Strong troubleshooting skills across the entire stack
Experience with Configuration Management systems, Salt preferred, Chef/Puppet ok. Jenkins pipelines a plus.
AWS EC2 and S3, and experience with the AWS command line tools and APIs
Experience (large, production clusters) with some of: Hadoop 2.x, Kafka, Spark, HBase, Elastic Search.
Nginx, HAProxy, high-throughput message/queue processing such as RabbitMQ a plus.
Organized, focused on building, improving, resolving and delivering. Good communicator in and across teams, taking the lead.

Desired Skills and Experience

See application page for details