AI / Generative AI MVP — Iteration 1

Azure Build Health's
AI-Generated Risk Assessment

The first attempt to integrate AI into an internal Microsoft engineering system — without a chatbot. Designing for ambiguity, trust, and a team that had never worked with a designer.

Role
Lead UX Designer
Team
Azure Build Health, Copilot Engineers, PM
Type
Greenfield / Enterprise Tool
Azure Build Health Risk Assessment in Azure DevOps

Adding AI to an engineering tool — without a chatbot

Azure Build Health is an internal Microsoft tool used by Release Managers to evaluate the safety and quality of code changes before deployment. When the team decided to explore Generative AI, there was no template to follow. No established pattern. No prior art inside Microsoft for embedding AI directly into an engineering workflow like this.

I was the lead designer embedded across two cross-functional teams — the Azure Build Health engineering team and Microsoft Copilot developers. We knew AI could reduce manual effort, but not what form it should take, what users would trust, or what "good" even looked like.

The challenge

How do you introduce AI into a high-stakes engineering workflow where wrong information causes outages — without eroding user trust or replacing the human judgment that release managers rely on?

Identifying the pains

Before any design work, I conducted 8 discovery interviews (45–60 minutes each) with our core users: Release Managers. The goal was to understand their end-to-end release process, how long preparation took, and where the biggest friction points lived.

Here is a quote from one of our core users explaining how long it takes them to complete their tasks:

"Every time we go to R2D, I go commit by commit, looking at the summary to get a sense of the risk — and with that risk then I try to get a measurement of the overall risk of a given release. As you can imagine, this is a very time consuming, somewhat subjective manual process."

— MDM Release Manager

From the interviews, I identified two major themes and three core user problems:

01
Manual effort
Release Managers manually evaluate the intent and risk of every code change in a release — a slow, subjective process with no tooling support.
02
Lack of context
They have minimal context on each change and must rely on the PR owner to evaluate risk. This takes 4–5 days on average to track down all the details.
03
Information gaps
Without full visibility into code quality before deployment, defective or harmful code can reach customers — leading to service outages and security risks.

Finding a place for AI

One of the first design decisions was where AI output should live. After mapping the existing Payload tab — where Release Managers already spend the majority of their time — it became clear this was the right home. It's where release information is collected, evaluated, and acted on.

Annotated Payload tab showing pain points

Annotated Payload tab — identifying where time was being lost and where AI could add value

From there I worked through the design in layers — each addressing one of the core user problems surfaced in research:

Reducing manual toil
I designed a new section of the Payload tab where Copilot automatically generates a risk level for the release — High, Medium, or Low — based on the content of the payload. This immediately reduced the manual commit-by-commit review process.
Reducing manual toil — automated risk evaluation screen
Evaluating risk
To give Release Managers the context they were missing on each individual PR, I designed a detailed Copilot risk assessment at the pull request level. This surfaced the "why" behind each change — reducing manual analysis and the 4–5 day wait to track down PR owners.
Evaluating risk — per-PR risk context screen
User overrides for AI output
One of the biggest design challenges was managing user anxiety around AI inaccuracies. I designed controls that let Release Managers change the AI-generated risk level for both individual PRs and the overall release score. These corrections were fed back into Copilot to improve future outputs — giving users agency while improving the model.
User overrides — risk change flow screen 1
User overrides — risk change flow screen 2
View Interaction
Transparency with iconography
Every AI-generated surface included clear attribution and a disclaimer that AI suggestions can be incorrect. This wasn't an afterthought — it was a deliberate design decision to build trust incrementally rather than oversell what the AI could do.
Transparency with iconography — screen 1
Transparency with iconography — screen 2

Raising UX maturity

This project required me to introduce a user-centered design process to an engineering team that had never worked with a designer before. I embedded UX methodologies into their workflow — shifting focus from what to build to why we were building it.

After release we held a postmortem. My contributions improved the team's UX maturity across four dimensions:

Collaboration
I created consistent opportunities for engineering partners to give input during the design process. This built team-wide alignment and ensured technical constraints were integrated early — not retrofitted at the end.
User-Centricity
I shifted the team's focus from internal assumptions to validated user needs. Engineers attended user interviews directly and we built a shared document of pain points to ground every design decision.
Scope & Prioritization
Anchoring all feature discussions to documented user pain points empowered the team to have productive trade-off conversations. In iteration 1 we focused only on the highest-impact work.
Iterative Validation
I established a practice of validating designs early and often. This reduced the cost of changes and resulted in a higher quality product delivered on schedule.

What we measured

After release, early adopters flagged inaccuracies in Copilot-generated data. A follow-up usability study was planned to evaluate both usefulness and usability — which became the foundation for Iteration 2.

75%
Reduction in manual process time through AI-powered quality assessments
60%
Increase in feature usage among the target developer audience post-launch

Even with the noted inaccuracies, the feature delivered measurable value. We saw a 75% reduction in time spent on manual release review, and feature adoption among the target developer audience grew by 60% post-launch — a strong signal that the core concept resonated, even as the AI output needed refinement.

Next Case Study

Azure Build Health — Second Iteration

View Project 02