Monitoring Considerations Monitorama, 2013 John Allspaw SVP, Technical Operations Sunday, August 4, 13 I want to warn you that I will lift references from various sources this morning, and I’ll make sure to point to those further readings I’ll touch on when I post slides. You can feel free to view those readings as HOMEWORK. Unsurprisingly to anyone who knows me, a large amount of them will be in the field of Human Factors and Safety. WHO HERE HAS EVER WRITTEN MONITORING SOFTWARE? (alerts, dashboards, graphs, metrics collection, analysis, display, etc.)
“In the long term, Operations as a science needs to be elevated.” Chris Brown Velocity London, 2012 Sunday, August 4, 13 We are at an interesting time in our field. We are still naive. We express indignation in terse remarks about our challenges. We also believe that certainty is something we can attain through the use of technology alone. This makes the field of web engineering as a whole ADORABLE.
Dr. Richard Cook, Velocity US 2012 http://www.youtube.com/watch?v=R_PDc0HFdP0 Sunday, August 4, 13 Dr. Cook explains how the research done in Human Factors and Systems Safety has a good relevance to the operation of web infrastructures. “Anytime you find a world in which you have high consequences, high-tempo operations, time pressure, and lots of complexity. .and people are cal ed upon to manage that, you’re going to have these kinds of issues arise.” Aviation, patient safety, military, power generation and distribution, space travel, etc.....they are attractive because we see something in them that is familiar. While we have an opportunity to take ADVANTAGE of LESSONS LEARNED in other fields of high-tempo/complexity/consequences, it behooves us to think on how we are DIFFERENT from the other fields. We also have an opportunity to SIDESTEP some of the quagmires those fields have found themselves in. This talk is a tiny effort towards this direction.
LANGUAGE Sunday, August 4, 13 In order to support this, I will argue that we need to start paying attention to our language. 1. OTHER DOMAINS ALREADY HAVE A LEXICON, WE CAN BORROW SOME TERMS FROM THEM 2. How we discuss our challenges can play a very large role in how we surmount them. There are a number of concepts, words, and ideas that need to enter our lexicon, especially when it comes to monitoring and the challenges that come with making sense of where, what, how, and why complex systems behave.
BETTER QUESTIONS Sunday, August 4, 13 One of the OTHER things that has become clear to me is that as a field, we need to ask BETTER QUESTIONS instead of quickly jumping to CORRECT ANSWERS or SOLUTIONS. ASKING TERRIBLE QUESTIONS WILL GUARANTEE TERRIBLE SOLUTIONS. I’m increasingly convinced that the road to progress on such a broad and complicated topic as monitoring is paved with BETTER QUESTIONS, not NEWER TOOLS. So you may hear me asking some questions today. They may or may not be good questions, but I’ll take a stab at it anyway.
DOWN and IN Sunday, August 4, 13 “Down and In” As the years go by and we see the continued decline of storage prices, the explosion of accessible processing power, we have an ever-expanding ability to zoom in deeply to the ways servers and services talk to each other and process information. WE CAN ZOOM IN ON THE RELATIONSHIPS and BEHAVIORS of SEEMINGLY DISPARATE PIECES OF DATA... ... AND WE CAN DISCOVER AND DETECT DISRUPTIONS IN SOMETIMES SURPRISING PLACES. THIS IS INTERESTING. BUT IT IS ALSO WOEFULLY INCOMPLETE IF WE ARE TO MAKE ANY PROGRESS IN OPERATIONS.
UP and OUT Sunday, August 4, 13 ...it is INCOMPLETE because as we ZOOM OUT, what we find is a much-ignored environment which includes one of the most powerful CONTEXT-SENSITIVE and INCREDIBLY ADAPTIVE anomaly detection and response agent in the world: HUMANS
Sunday, August 4, 13 Do we have ANOMALY DETECTION problems? Certainly. One can argue (I will, if you’d like, later at the bar) that we will ALWAYS have them. BUT: What I’m interested in is NOT how software can be used to detect anomalies automatically. (well, I’m interested, but I don’t doubt that you all will continue to get better at it)
Sunday, August 4, 13 ... It is how people navigate this boundary between themselves and the machines they work with. The BOUNDARY between humans and machines, as we observe our use of tools, is a focus IN and OF ITSELF. If we have any hope of making progress in monitoring complex systems, we must take this boundary into account.
Sunday, August 4, 13 BUT ABOUT HUMANS: A couple of observations with respect to tools and monitoring in general. 1. We don’t use a single tool to gain insight into the architectures we build. And we will not. 2. Teams of people are the NORM, which means communication and coordination become as important (if not more important) than surfacing anomalies themselves. 3. We bring our BIASES, EXPECTATIONS, TRUST, and PERCEPTIONS to the table. No tool or piece of automation or tooling will change that. 4. Understanding the breakdowns at these boundaries between people and machines should be a part of how we approach design of tools and organizational behaviors.
LESS CODE MORE PSYCHOLOGY Sunday, August 4, 13 SPECIFICALLY: ALGORITHMS ALONE WILL NOT DELIVER US TO A BETTER AND SAFER PLACE.
OODA Loop Observe Orient Decide Act credit: http://blog.b3k.us/ooda.html Sunday, August 4, 13 WHO IS FAMILIAR WITH Lt. Boyd’s OODA Loop? Observation and orientation is a place where we can look for making progress. When we get alerted, look at dashboards, graphs and logs, we’re looking to make sense of the past and project into the future. NOTE: Observe and Orient are not Unix commands, they are HUMAN ACTIVITIES.
We need to understand how people make sense of what is going on Sunday, August 4, 13 SO: Writing code to TELL COMPUTERS WHAT TO LOOK AT is quite different than making sure that the code’s human supervisors are equipped or aided in what to look at when an alert goes off. How people make sense of what is going on (in diagnosis? In planning? In response? In control?) is just plain HARD.
We need to understand how normal work is getting done by normal people in normal situations. Sunday, August 4, 13 If we don’t understand how people consume, adapt to, work around, and make use of tools under “normal” operating conditions, how can we have confidence that our designs will perform under uncertain or escalating scenarios?
Work As Imagined Work As Done Sunday, August 4, 13 Our clues on how we THINK we work guides our design decisions. But there is a gap between how we think we work, and how we actually work. How large is this gap? How will we know when it’s too large?
Where is design? “The system should therefore be designed so that human adaptation is ENHANCED.”
Erik Hollnagel Expertise and Technology: Cognition & Human-Computer Cooperation, 1995 Sunday, August 4, 13 Design thought should be in tools, displays, controls, and processes. What do we have to work with, though? “It is the expertise of the human operator that makes it possible to adapt the performance of the joint system, in real time, to unexpected events and disturbances. Every working day, across the whole spectrum of human enterprise, a large number of near-misses are prevented from turning into accidents only because human operators intervene...
Sunday, August 4, 13 Whether we know it or not, we are ALL designers now, if we build tools intended to aid monitoring. I’m not just talking about UI and garden-variety HCI work, but those topics should be considered table stakes.
Where is design? http://www.perceptualedge.com/articles/visual_business_intelligence/ time_on_the_horizon.pdf Sunday, August 4, 13 VISUAL PERCEPTIONS and UI approaches are integral to our field, so we should try to understand them as deeply as we can. Armed with the knowledge that every element of design can (and will) be mis-used (like these Horizon Graphs), we are left with a dilemma: How can we understand what can augment human capabilities without getting in the way, and without having to first re-start our career as an Human Factors expert? WE FAKE IT UNTIL WE MAKE IT
Principles of Display Design • Principle of information need • Principle of legibility • Principle of display integration/proximity • Principle of pictorial realism • Principle of the moving part • Principle of predictive aiding • Principle of discriminability: status versus command Wickens, Lee, Liu, Becker An Introduction to Human Factors Engineering Sunday, August 4, 13 Here is another great pointer on display design, from “AN INTRODUCTION TO HUMAN FACTORS ENGINEERING”.
Cognition In The Wild “It is notoriously diﬃcult to generalize laboratory findings to real-world situations.” Sunday, August 4, 13 So let’s leave design for a moment and talk about how we can VALIDATE our design choices. We CANNOT hope to understand how people behave in real-world scenarios BY USING OUR IMAGINATION alone. How many of you work at a company where funnel or clickstream analysis is being done? How many of you have done clickstream or funnel analysis on your monitoring dashboards, graphs, and displays? What sort of information might we find when we gather data on how people navigate metric data during varying scenarios?
ALERT DESIGN Sunday, August 4, 13 - Who has ever gotten a page and ignored it? Endsley: At a safety expert conference, in a 300-person hall, only 3 people got up for a fire alarm. - How many alerts were received in the past week that were not actionable? (no human action was required?) - How many alerts were received in the past week as a result of known work being done, but alerts were not silenced during that period? - How many alerts were received as a result of a previously silenced alert (because work was being done) that was mistakenly un-silenced?
Jack Garman Flight control er NASA Mission Control Apol o Program (Murray and Cox 1990) Sunday, August 4, 13 “A program alarm could be triggered by trivial problems that could be ignored altogether. Or it could be triggered by problems that called for an immediate abort. How to decide which was which? "We wrote ourselves little rules like 'If this alarm happens and it only happens once, don't worry about it. If it happens repeatedly, but other indicators are okay, don't worry about it.'"
Operator, interviewed. The Three Mile Island nuclear power plant, fol owing the accident. (Kemeny 1979) Sunday, August 4, 13 “I would have liked to have thrown away the alarm panel. It wasn't giving us any useful information." Comment by one operator at the Three Mile Island nuclear power plant to the oﬃcial inquiry following the TMI accident (Kemeny 1979).
Physician, explaining how they respond to a nuisance alarm on a device in the operating room. (Cook, Potter, Woods and McDonald 1991) Sunday, August 4, 13 "When the alarm kept going oﬀ then we kept shutting it [the device] oﬀ [and on] and when the alarm would go oﬀ [again], we’d shut it oﬀ.” “... so I just reset it [a device control] to a higher temperature. So I kinda fooled it [the alarm]...”
SIGNAL DETECTION THEORY Sunday, August 4, 13 Signal Detection Theory - Too sensitive, and you’ll get false alarms - Not sensitive enough, and you’ll get missed alarms
ALERT DESIGN Mica Endsley Designing for Situation Awareness Sunday, August 4, 13 What about the context people are in when they experience a FALSE ALERT? Or a MISSED ALERT?
Other Situational Information Interpreta Interpr ti eta o ti n o Response Alarm Signal Integration Decision Expectancies Past History Mental Model Designing for Situational Awareness, Mica Endsley Sunday, August 4, 13 The cognitive processing of an alarm signal. When we DESIGN ALERTS, we HAVE to think about the various ways that the ALERT could be interpreted or acted on. Often times, we will PUNT on aiding the operator with CONTEXT.
Critical Care & Anesthesiology • Monitors & alarms designed to “never miss” • 566 deaths reported related to alarms (2005-2008) • Most associate with the silencing function • ECRI’s #1 health technology hazard, 2012 & 2013 And you have complaints about Nagios’ “set downtime” feature? Sunday, August 4, 13 Emergency Care Research Institute (ECRI), which recently identified alarms as the “number one health technology hazard” for 2012.9 And you have complaints about Nagios’ “set downtime” feature?
ALERT DESIGN Confirmation Sunday, August 4, 13 - Because false alarms are a problem, people will spend time not reacting to an alert, but confirming that the alert is legit. - Pilots delay responding to GPWS (Ground Proximity Warning System) 73% of the time, because they’re looking out the window to confirm it’s true, and how true it is. What are ways we can SUPPORT CONFIRMATION or VALIDATION in our alert design?
ALERT DESIGN Expectancy Sunday, August 4, 13 - People’s expectancies can also aﬀect their interpretation of alerts. - In many cases, people EXPECT the alert to go oﬀ, as the result of their own actions. - In a study in 2001, 6% of operating room alarms were found to be expected or anticipated. - This can become a nuisance, and further degrade the trust in the alerts. - Example: disk space alerts that happen during a backup, and then recover. - Example: someone on the team doing work, and not silencing the alerts temporarily. BONUS: when the time period for an alert is silenced passes, and the condition isn’t acceptable yet. (downtime expiring) What are ways that we could SUPPORT EXPECTANCY in our alert design?
ALERT DESIGN • Signal:Noise can be diﬃcult • Easy to err on more false alarms • Decay in trust • Origins: Undetectable conditions Sunday, August 4, 13 - Signal:Noise can be difficult to get right - General view: err on the side of too many false alarms. This ignores the detrimental effect of them on humans. - Study in 1998 said: New ATC systems, missed alerts at 0.2%, false alarm rates at 65%. - Underlying false alerts: not the functioning of algorithms themselves, but the CONDITIONS AND FACTORS THAT THE ALARM SYSTEMS CANNOT DETECT OR INTERPRET Ex: Cincinnati Airport - riverbank leading up to a runway increases in terrain causes an alarm because the system can’t detect that it’s going to plateau at the runway. Pilots familiar with the airport ignore the alarms.
Directed Attention • Attention focusing • Attention switching • Dynamic Prioritization Sunday, August 4, 13 We work in a COGNITIVELY NOISY WORLD, even when there is NOT an outage going on. Alerts are ESSENTIALLY ATTENTION DIRECTORS. The main challenge for DYNAMIC FAULT MANAGEMENT (HF term) in design is to support: - ATTENTION FOCUSING - ATTENTION SWITCHING - DYNAMIC PRIORITIZATION By getting to know how human attention works (and its relationship to context, perception, etc.), we can hope to design better alerts.
Interrupts AND Underspecification 1. “Here is the data I want you to see” 2. “Here is why I think you would find it interesting” Sunday, August 4, 13 An alert is essential y an INTERRUPT. TWO STATES: 1 - HERE IS THE DATA I WANT YOU TO SEE 2 - HERE IS WHY I THINK YOU WOULD FIND IT INTERESTING What can we do to support #2?
Paradox Of Directed Attention Sunday, August 4, 13 An alert is essentially an interruption to everyday work, and there is a paradox at the heart of DIRECTED ATTENTION. 1. We are always busy! 2. Shifting attention has a very real cost! 2. Not all signals are worth paying attention to; context-sensitivity will always vary 3. So how can you SKILLFULLY IGNORE a SIGNAL that should NOT SHIFT UR ATTENTION WITHOUT first processing it....IN WHICH CASE IT HASN’T BEEN IGNORED. “Given that the supervisory agent is loaded by various other task related demands, how does one interpret information about the potential need to switch attentional focus without interrupting or interfering with the tasks or lines of reasoning already under attentional control. We can state this paradox in another way: how can one skillfully ignore a signal that should not shift attention within the current context, without first processing it -- in which case it hasn't been ignored.” - David Woods David Woods has suggested some ways to break this paradox, he calls it PREATTENTIVE REFERENCE. I’ll let you discover his suggestions on your own.
Directed Attention Picking up on Sorting through subtle early an avalanche of indications of a data fault Sunday, August 4, 13 This idea of an alert DIRECTING OUR ATTENTION can exist in two views: SORTING THROUGH AN AVALANCHE or PICKING UP SUBTLE/EARLY INDICATIONS.... So....which is it? IT CAN BE BOTH! “The critical point is that the challenge of fault management lies in sorting through an avalanche of raw data -- a data overload problem. This is in contrast to the view that the performance bottleneck is the difficulty of picking up subtle early indications of a fault against the background of a quiescent monitored process.”
Context Sensitivity Sunday, August 4, 13 The background and context in which a SIGNAL arrives can play a huge role in how they can HELP or HINDER us. If the background is one of QUIET, contrast is HIGH. <- this is what most designers plan for If the background is ONGOING DIAGNOSIS, then SIGNAL can SUPPORT/CONTRADICT existing hypothesis If the background is EXECUTING A RESPONSE, then SIGNAL can cue the RESPONSE is WRONG or INCOMPLETE. In any case, the ALERT’s MEANING will change as CONTEXT and BACKGROUND changes.
Data Overload Sunday, August 4, 13 This is simply a tough problem. There are approaches to solve it, but none of them to date are effective given the rate at which new pieces of data are being collected and stored. There is a significant agreement among those who study data overload phenomena that the critical piece to understand is of CONTEXT SENSITIVITY. Some HF researchers have pointed at something that may help reduce the effects of DO: Depicting RELATIONSHIPS between data in a known FRAME of REFERENCE, as opposed to the raw data. What can we do as designers to aid surfacing those relationships?
How have I taken the OPERATOR into account? Sunday, August 4, 13 PEOPLE use monitoring tools. Arguably, MACHINES use monitoring tools we build, as well. But only PEOPLE can adapt and improvise with a given tool outside of the original intentions of its designer.
Am I hurting or helping: •Data overload or underload? •Salience? •Directed attention? •Interruptibility? Sunday, August 4, 13 When we design alerts and monitoring tools, we should be asking these questions. In addition: HOW WILL WE KNOW WHEN THIS DESIGN WOULD HURT those things?
Joint Cognitive Systems Sunday, August 4, 13 One final thought: what if, instead of the view that the BOUNDARY is a large barrier to be hurdled only by our writing increasingly complex code...we view that boundary as a place for an actual cooperative RELATIONSHIP?
Joint Cognitive Systems What if we viewed an alerting system as a PARTNER, instead of a subordinate? Sunday, August 4, 13 What is we viewed alerting systems as a PARTNER, instead of a subordinate or otherwise dumb messenger delivering news to us? What does the world look like if we designed alerts to COOPERATE with us? If TRUST in alerting systems is such a big deal.... WHAT can we learn from how HUMANS learn to trust each other, and let that influence our design decisions? In other words: how can we design alerts that SUPPORT our confirming their legitimacy, or our expectations when an alert will fire? Is context-sensitivity part of this? We see some blunt versions of these notions: 1 - Time periods for alerts, so that people aren’t woken up for things that can wait until morning (the machine has been given some context about our availability to pay attention to an alert) 2 - Rough dependency relationships, so we don’t send a bazillion alerts when a known SPOF dies What other examples can we think of, where the COMPUTERS can attempt to understand, predict, or observe US, as we work?
The End Sunday, August 4, 13 My hope is that I’ve been able to ask BETTER QUESTIONS, and I can kick off this conference with food for thought. You can tell me how that food tastes at the bar later.