The problem
A large enterprise customer experienced cascading service failures: frontline users locked out by an expired SSO certificate; a third-party carrier outage dropping inbound calls intermittently; tens of thousands of customer interactions impacted; executive leadership demanding updates by the hour.
Root causes were largely outside the platform's direct control — but the risk to trust, adoption, and long-term value was extremely high. The real challenge wasn't only technical reliability. It was how humans across roles made sense of failure, coordinated response, and decided what to do next.
Approach: the incident as research
Instead of waiting for retrospective interviews, I treated the live incident as contextual inquiry — observing decisions, language, and translation work as they happened.
- Longitudinal analysis of cross-functional Microsoft Teams threads
- Behavioral observation during real-time incident bridges
- Thematic synthesis of escalation patterns and decision points
- Mapping observed behavior to personas, responsibilities, and incentives
Key insight
Service breakdowns are rarely just system failures — they are sensemaking failures.
Across every persona, the most acute pain wasn't lack of data. It was lack of clear explanation, ownership clarity, and role-appropriate guidance on what to do next. People weren't asking for more dashboards. They were asking for understanding they could act on.
Design opportunity: AI as a sensemaking layer
A single "AI Copilot" wouldn't work. Different roles needed different help at different moments:
- Frontline leaders → plain-language explanations + next steps
- Customer-facing roles → executive-ready narratives
- Support engineers → faster pattern detection + historical context
- Product / UX → synthesis across incidents, not raw logs
How I worked
Most teams treat an outage as a fire to put out and a retro to write. I treated it as the highest-fidelity research opportunity I was going to get all year, because every persona on the platform was acting at full intensity, in real time, with the stakes fully visible.
Rather than scheduling interviews after the fact — where memory smooths over the messy middle — I sat in the live Teams threads and incident bridges and took field notes the way I'd take them in a contact center. What language did each role reach for? Where did handoffs break? Whose decision got blocked waiting on whose explanation? When the incident closed, I had three days of behavioral data that no retrospective interview could have produced.
Outages are sensemaking failures, not just system failures. The research opportunity is the sensemaking — and it only exists while the incident is live.
Synthesis: from incident log to product strategy
I ran thematic synthesis across the Teams threads, the bridge transcripts, and the customer-side escalations. Codes were behavioral ("asks for an executive-ready summary," "requests next-step instruction," "translates a technical message for a non-technical stakeholder"), not sentimental.
What emerged wasn't a list of bugs. It was a map of where each persona's sensemaking broke and what kind of help would have unblocked them. That map became the persona-tiered Copilot brief — explicitly arguing against a one-size-fits-all assistant before product even considered building one.
What this work earned
The findings reframed an internal conversation that had been about reliability into one about service design and AI strategy. Funding decisions for the next two release trains traced back to this synthesis. And the customer relationship, which had been at real risk, stabilized in part because the response we shaped sounded like understanding rather than apology.
What this actually shipped
- 01Persona-specific AI Copilot strategy grounded in observed behavior
- 02Service-design improvements scoped from real incident patterns
- 03Cross-functional alignment on what reliability work to fund next
- 04Reframed an operational crisis as a strategic research artifact



