Lead Site Reliability Engineer

Lead Site Reliability Engineer * Manufacturing SW/IT

Are you driven by uptime and performance? Do you love end-to-end technology operation challenges? Are you so curious about how and why things work that, if you were a cat, you would have already used 7 of your 9 lives?

Everyday just our Windsor, Canada plant processes over 50,000 orders, 80,000+ on peak days. That’s hundreds of thousands of business cards, brochures, folders, pens, mugs, shirts, and many other mass customized items. We can’t get behind. We need a Site Reliability Engineer to ensure that we have the right IT architecture in all of our plants so that no one is twiddling their thumbs. We need you to make sure that before the next release or patch of the custom software that makes the plant run, everything is rock solid and fast as it can be. Deployment needs to go off with as little manual intervention as possible, and that is where you come in. Windsor is just one of our worldwide plants across four continents, and soon to be five continents with our new partnership in South America.

As a Site Reliability Engineer (SRE) for our manufacturing platform, you will be responsible for the area where our manufacturing software development meets Tech Ops. We’re looking for a tech ops/IT generalist who is comfortable in the details and at the architecture level. You might be in the middle of a conversation about tuning the end-to-end system architecture of our manufacturing software and servers then be working with a developer to optimize their C# code for performance. One moment you’re forcing vSphere to do things through the RESTful API that leaves your mouse gathering cobwebs, while a few minutes later you may be working to determine what the performance impact will be of switching from AMD to Intel systems. You will be bouncing between the systems running our manufacturing plants in Australia, Canada, The Netherlands, or India to verify all are optimized and ready for live patching of code without plant downtime.

With a breadth of knowledge spanning all aspects of tech ops, you will go deep into the architecture of our IT solutions in our manufacturing plants across the globe. You will work hand-in-hand with the manufacturing software dev teams and guide their choices as a champion of uptime and performance in production. You must be able to understand and incorporate other people’s views. You must be willing to change your mind when evidence points in a different direction.

We’re seeking the right kind of a “lazy” engineer who sees something repeatable and automates it. You rely upon Infrastructure-as-a-Service offerings from our various core infrastructure teams (and potentially public clouds) for your automated deployments. You understand the consequence of technical debt that you take on in the name of a quick fix, and you’ll work to ensure that we’re avoiding it and constantly building and leveraging the common and not the abandoned.

The position is in Cimpress’ Technology Operations & Support organization, and our manufacturing sites must be up when they are open, or there are a lot of operators twiddling their thumbs. As such, you will be an escalation point for our NOC and Problem Management teams.

Daily Responsibilities

Collaborate on the design and refinement of our software with development team partners
Operate in a DevOps fashion with your aligned development team
Coordinate initiatives across the different Tech Ops teams at all our plants
Decide on how best to deploy across a world wide tech ops infrastructure inside of our manufacturing plants for uptime and performance
Research and investigate reported site reliability or performance related issues
Troubleshoot issues and resolve them across many aspects of tech ops
Document new findings and share information in wiki and documentation stores
Ensure SLAs are met for issues during work and on-call hours by responding to tickets and pages

Required Abilities

A candidate needs between 4-10+ years of solid experience working end-to-end with web based application systems in C#/.NET, application delivery controllers/load balancers, Microsoft Server administration, networks, firewalls, storage area networks, and Microsoft SQL systems.

Required Knowledge

System Administration of a Windows Server environment running on bare iron and VMWare ESX
Custom C#/.NET Application Administration & Deployment
Some selection from the desired skills/knowledge/experience below
Protocol knowledge of: HTTP(S) and TCP/IP

Desired Skills/Knowledge/Experience (any From Below)

Experience with IT systems, networks, and applications in manufacturing plants
Experience with controllers (servers) for computer integrated manufacturing plant equipment
Worldwide plant IT across multiple continents working with local IT teams in the plants to deliver the highest levels of uptime and performance for the plant.
ADC/Load balancers
DNS Technology & Systems
Scripting languages: Python, PowerShell, Perl, bash shell scripting, etc.
Deployment automation: Puppet, Chef, etc.
Web technologies such as: IIS, .NET, Java, Ruby, LAMP Stack, etc.
Cloud technologies such as: Amazon Web Services (EC2/S3/ELB/RDS/DynamoDB/etc.), Rackspace, OpenStack, Eucalyptus, etc.
Database technologies such as: MS-SQL, Cassandra, MongoDB, MySQL, PostgreSQL etc.
Network devices/firewalls from: Cisco, Juniper, Arista, Palo Alto, Checkpoint, etc.
Storage technologies: SAN and NAS of various flavors from established and cutting edge vendors
Advanced understanding of various protocols and ability to troubleshoot and tune them for maximum performance under a variety of conditions

Education/Experience

A Bachelor’s degree in Computer Science or related discipline preferred

Desired Skills and Experience

See application page for details