Site Reliability Engineer, Generalist at Stack Overflow () (allows remote)
Stack Overflow is growing fast, and our infrastructure needs just keep getting bigger. We’re looking for a senior sysadmin to join our team of extraordinary sysadmins and developers working on sites that see 4000-6000 hits per second during peak times.
In addition to having a passion for working with software and keeping a top 50 website on the web, you should love hardware. We love pushing hardware as hard as we possibly can. This means hands-on maintenance from time to time. We require our SREs to be good coders. We are a mixed Windows and Linux shop; you should be highly proficient in one and comfortable with the other.
At Stack Overflow we’re passionate about our technology. We own and operate our own infrastructure, and take the time to do it right. We like to stay on the cutting edge of technology, so you will always be working with or working towards using the latest and greatest there is. We get all the hardware we need for redundancy and performance.
Because we take the time to do things right, our on call responsibilities are very light. We get paged very infrequently.
Some projects that we’ve recently completed or are working on:
Technologies you’ll work with:
What you’ll do:
Desired Skills and Experience
- Improving how we monitor service internals
- Automating firmware upgrades
- Improving HBase reliability
- Migrating to a new CDN
- Reinventing how DNS is managed
- Evaluating new security and VPN technologies
- Hardware upgrades for all our Microsoft SQL Servers with 0 downtime
- We’re involved in Microsoft TAP programs (early access)
- Windows 2012 R2 and 2016
- Modern Linux distributions - we’re running CentOS 7
- Haproxy, Redis, Puppet, Elasticsearch
- IIS, DFS, Multi-site AD, SQL Server 2016
- Fortinet Firewalls, Cisco Routers, Switches, HSRP / Keepalived / BGP
- PowerShell, C#, Go, Bash, Python
- Maintain the services and infrastructure platform used by the Stack Overflow websites
- Help us handle traffic of 4000 hits/sec and plan for growth to 10,000
- Take on big projects from inception to deployment
- Coordinate daily with a top-notch team of sysadmins and developers
- Handle alerts on all parts of our infrastructure as part of a 24x7 on-call rotation (approximately 1 week out of 5)
- Be awesome and teach others to do the same by blogging about it
- Expertise in either Windows or Linux environments (and exposure to both)
- Knowledge of programming beyond scripting (we use Golang, C#, and others)
- Experience working hands-on with Server class hardware (we are a Dell shop)
- Basic understanding of networking: the HTTP protocol, how load balancers work, IP addressing
- (We use HAProxy, Fast.ly/Varnish, Keepalived, IIS)
- Experience with a configuration management system (we use Puppet)
- A track record of taking on challenges and delivering thorough, stable, and maintainable systems
- A track record of getting projects done in a timely manner
- Works well in a team
- Document as you go, not at the end of a project
- You live near our Denver or Jersey City datacenters
- You have experience with HBase system administration
- A security-oriented mindset
- Experience in a SOX or PCI environment
- Experience with Azure