Site Reliability Engineer, Generalist at Stack Overflow () (allows remote)

Stack Overflow is growing fast, and our infrastructure needs just keep getting bigger. We’re looking for a senior sysadmin to join our team of extraordinary sysadmins and developers working on sites that see 4000-6000 hits per second during peak times.

In addition to having a passion for working with software and keeping a top 50 website on the web, you should love hardware. We love pushing hardware as hard as we possibly can. This means hands-on maintenance from time to time. We require our SREs to be good coders. We are a mixed Windows and Linux shop; you should be highly proficient in one and comfortable with the other.

At Stack Overflow we’re passionate about our technology. We own and operate our own infrastructure, and take the time to do it right. We like to stay on the cutting edge of technology, so you will always be working with or working towards using the latest and greatest there is. We get all the hardware we need for redundancy and performance.

Because we take the time to do things right, our on call responsibilities are very light. We get paged very infrequently.

Some projects that we’ve recently completed or are working on:

Technologies you’ll work with:

What you’ll do:

Desired Skills and Experience

Improving how we monitor service internals
Automating firmware upgrades
Improving HBase reliability
Migrating to a new CDN
Reinventing how DNS is managed
Evaluating new security and VPN technologies
Hardware upgrades for all our Microsoft SQL Servers with 0 downtime
We’re involved in Microsoft TAP programs (early access)
Windows 2012 R2 and 2016
Modern Linux distributions - we’re running CentOS 7
Haproxy, Redis, Puppet, Elasticsearch
IIS, DFS, Multi-site AD, SQL Server 2016
Fortinet Firewalls, Cisco Routers, Switches, HSRP / Keepalived / BGP
PowerShell, C#, Go, Bash, Python
Maintain the services and infrastructure platform used by the Stack Overflow websites
Help us handle traffic of 4000 hits/sec and plan for growth to 10,000
Take on big projects from inception to deployment
Coordinate daily with a top-notch team of sysadmins and developers
Handle alerts on all parts of our infrastructure as part of a 24x7 on-call rotation (approximately 1 week out of 5)
Be awesome and teach others to do the same by blogging about it
Expertise in either Windows or Linux environments (and exposure to both)
Knowledge of programming beyond scripting (we use Golang, C#, and others)
Experience working hands-on with Server class hardware (we are a Dell shop)
Basic understanding of networking: the HTTP protocol, how load balancers work, IP addressing
(We use HAProxy, Fast.ly/Varnish, Keepalived, IIS)
Experience with a configuration management system (we use Puppet)
A track record of taking on challenges and delivering thorough, stable, and maintainable systems
A track record of getting projects done in a timely manner
Works well in a team
Document as you go, not at the end of a project
You live near our Denver or Jersey City datacenters
You have experience with HBase system administration
A security-oriented mindset
Experience in a SOX or PCI environment
Experience with Azure