Desired Skills and Experience
- Defines, owns and delivers process and tooling that supports reliability, including incident management and post-incident review
- Guides and trains teams in these processes so they get maximium benefit with minimal friction
- Acts as an advocate for these processes, running meetups, “War Games” sessions, and blogging & presenting internally & externally
- Measures the business-relevant outcomes of the processes you own, including incident rate and time-to-recovery
- Experience managing major incidents and post-incident reviews at comparable technology organisations
- A track record of measurably improving reliability results across teams through your own initiatives
- Experience working with other engineering leaders as a trusted reliability subject matter expert
- Sufficient technical nous to understand complex, high-scale information systems
- Examples of your data-driven approach to process measurement and improvement
- Demonstrated program leadership and accountability generation abilities
- Some ability to write code and implement automation that supports reliability
- A positive and enthusiastic attitude
- Project and program management experience, including goal-setting & measurement, and stakeholder management
- Experience and desire to present your expertise to large groups, eg. at all-hands meetings and conferences
- Experience developing and delivering training in a comparable organization to Atlassian
- More extensive current or former experience in software development or release management
- Formal training or qualification in incident management or post-incident reviews
- Experience and skill in data analysis and reporting (eg. SQL, ETL systems, and data visualisation)
- Program Management around incident management (IM) and post-incident reviews (PIRs)
- Goal-setting and measurement
- Accountability generation in a matrix organisation
- Stakeholder management and communication - i.e. “People skills”
- Ownership of IM & PIR process and tooling
- Continuous process measurement and improvement
- Internal advocacy for process excellence across many disparate teams
- Development and delivery of IM/PIR training across the company
- Developing supporting tooling and automation
- Analysis and Reporting
- Analysis across groups to draw valid conclusions about the drivers of reliability
- Regular and ad-hoc reporting to key stakeholders
- Domain model, batch job, and report creation and maintenance
- Hands-on Incident Management and PIR
- Manage major incidents in an on-call roster as part of our global major incident management team
- Lead incident teams to resolve major incidents quickly and effectively
- Drive post-incident reviews to turn failures into resilience
Apply