What is Incident Response? Procedures and Confidence
Important Questions How do we proceed? • What is wrong? • Who is impacted? • When do we communicate to a wider audience? • Is the issue resolved?
Incident Response Models Tactical Approaches
Types of Incident Response “Ping Ops” Model
What do we do? Ping Ops
Types of Incident Response “Product State” Model
Heroku’s Incident Response State Machine
Success! Built a lot of confidence in our incident response capabilities. If you want to read more check out the Heroku blog.
Incident Response @ Simple
A Major Incident Processor Conversion
What did we learn? More than fits on this slide • Zero confidence in our new system due to timeline • Our response was almost solely focused on the engineering side of the company • Our communication during the event reached only a small number of employees • We didn’t have clearly defined methods for answering the important questions
Communication and Coordination Across the Company
“Product State” Model Lets try this
How Complex is the Product? Important for Planning
Not So Simple Product Many features to consider • We have partners • We have > 1 products: • Activity • Instant (money transfer) • Goals • And more… • Customers interact via web, mobile, and ATM’s • Requirements: Risk Management and Security
Not So Simple Disruptions What happens when things go wrong? • ACH Transfers • Check Review • Card transactions • ATM transactions • Direct Deposits • Mobile, Web • Onboarding
Many Facets Less Definition
Start Over What do we need?
What do we want? • Any system MUST NOT be focused solely on engineering • Any system MUST NOT be a burden on responders (or the company) to implement and utilize • Any system MUST be usable, and in many cases, managed by teams across the company • Any system MUST increase confidence to the response of the incident • Any system MUST have a procedure for a dynamic team built to handle any severity of incident
Incident Complexity How do we determine general impact? • How many teams are impacted? • Is there an immediate impact to internal customers? • Is there an immediate impact to external customers? • Is there an immediate impact to our partners?
What is an Incident? Planned & Unplanned
“Any disruption to normal business activity.” - THE DEFINITION OF AN INCIDENT
Types of Incident Response “Complexity” Model
Incident Complexity Framework Our method for determining the base level response for any disruption.
Complexity Levels • Five levels • Each complexity defines the expectations around response and resolution • Clear procedures for communicating to both internal and external customers • Determines assignments in the incident organization • Organization roles enable a feedback loop • As complexity increases, these expectations may include post-incident procedures
Incident Complexity Three Properties • Incident complexity can never decrease • Incident complexities require an owner • Incident complexities are globally recognized
Incident Command Organization • Command: • Responsible for management of the incident and response team. • Lead: Incident Commander • Communications: • Responsible for communications to and from internal and external customers. • Branches: Marketing, Customer Relations • Lead: Incident Signaller • Operations: • Responsible for the work to resolve the incident • Branches: Backend, Frontend, Integration, Infra • Lead: Incident Engineer
Does it work? Incredibly Well
Check Incident It might work like this… • Checks team finds a failure • Sent to Customer Relations Technical Team (CR Tech) • CR Tech creates an incident issue • Initial complexity assigned here (best guess) • Notifies Infrastructure Engineering • Roles assigned: IC, IS (standby), IE • Work to resolve issue, and potential escalation
Something More Impactful? A Look at a Larger Incident • ACH Memo file arrived, but isn’t processing • First responder assigns complexity, creates issue • Integration trained as Level 4 Incident Commanders • Integration brings on an engineer (alert or IRC) • Integration assigns an IS • IS begins work on communications • Keeps track of customer contact • Messaging for Status and CR reps • Incident Organization iterate on issue/comms
What needs work? A few items to be worked on
Incidents Are Common 156 identified incidents since December 2014 • Level 2 & Level 4 most common • Three Level 5’s: 2 planned, 1 unplanned • One week in April we identified 11 incidents: • Including 4 Level 3’s, 3 Level 4’s • Participation from 11 teams • One Level 4 spanned 8 teams
How do I use this? Where does one start?
Identify Impact Least severe, most severe • Find your most severe disruption: • What procedures do you want completed? • How do you want to communicate? • What is the impact? Who’s aﬀected? • Now do the same for least severe • Find the commonalities • Fill in the gaps (Level 2 through Level 4) • Train people to handle coordination and communication • Find what works, but avoid making response a burden