A Japanese SaaS company sent me a take-home assignment a few months ago. Their product runs internal approval workflows, the kind where a request moves up a chain of managers for sign-off before anything happens. They handed me 12 user interviews across 4 companies and one instruction: pick a problem worth solving, then spec the feature.
I expected the assignment to test tooling. It tested something else.
Ask most people what you need to design products in 2026 and you get a software list. An embeddings API. Cursor or Claude Code. v0 for prototypes. Figma. A vector database. That list is real, and I use most of it. But none of it was what the assignment was measuring, and none of it is what separates a usable AI feature from one that quietly erodes trust.
The tools that mattered were mental models. Here are the ones the assignment forced me to use.
Knowing which problems are not yours to solve
The 12 interviews surfaced five distinct complaints. The obvious move is to rank them by frequency and pick the loudest. I did the opposite for most of them.
Two of the five looked like clean AI opportunities and were not. One was about unclear approval routing. On the surface that reads like a recommendation problem, the sort of thing a model could predict. But the routing rules already existed in the product. The real friction was implicit political pre-consensus, the unwritten step where you sound out senior people before you submit. You cannot automate that, and you should not try. Another complaint was about vague rejection feedback. Tempting to build an AI reviewer. But rejections in that culture are often political rather than quality based, so a model trained to critique drafts would confidently give the wrong advice.
The first tool is knowing the difference between a complaint and a problem you can actually move. Half the skill in 2026 is declining to point a model at something it will make worse.
Reading the data that is already there
I picked the repetitive drafting problem. Every submitter writes the same kinds of requests over and over, and they already work around it by searching old approved requests and copy-pasting.
That workaround was the whole signal. Users were telling me, with their behavior rather than their words, that the answer was retrieval. They were doing semantic search by hand. The feature was just removing the manual steps.
There was a second layer. The company had more than 10,000 historical approved requests sitting in the system. Every one of them encoded something the org never wrote down: what had been politically acceptable to ask for, in what language, with what justification. I did not need to model the politics. The approved corpus already had it baked in. A model that learns from what got approved inherits the unwritten rules for free.
The tool here is reading the implicit data. The workaround tells you the primitive. The archive tells you what the model already knows without being taught.
Specifying behavior, not features
A normal functional spec says what the feature does. A spec for an AI feature has to say how the system behaves: what it retrieves, in what order, what it does with the result, and what it puts in front of the user versus what it holds back.
The pipeline I wrote was specific. On title input, debounced by 800ms so it does not fire on every keystroke, embed the text and pull the top 10 candidate requests by vector similarity. Re-rank those by request type, same-department match, and recency, weighting the last 24 months higher. Take the top 3 and generate a draft from them.
One decision in that pipeline matters more than the rest. Before showing the draft, strip every specific value from the source requests and replace it with a placeholder. A past request’s real dollar amounts, names, and dates never leak into a new user’s draft. The model hands you the structure and the language that got approved, not someone else’s private details.
The tool is understanding that the spec is now a behavior contract. You are not describing a screen. You are describing what a system will do on its own, including what it must never do.
Designing for confidence, not certainty
This is the model that separates people who have shipped AI features from people who have read about them.
A normal feature has one path. An AI feature has a path for every level of the model’s confidence, and you design all of them. For the drafting feature I defined three states based on how close the closest past request was, measured by similarity score. A strong match generated a full pre-filled draft. A partial match generated a draft and flagged the gaps. A weak match generated nothing and returned a blank structured template instead.
That last decision is the important one. When the system was not confident, it did less, and it said so. It did not paper over a weak match with a plausible-looking draft, because a plausible draft built from the wrong precedent is worse than a blank form. The user trusts the blank form. They do not trust the thing that looks finished and is subtly wrong.
The tool is treating the model’s uncertainty as a design input you build around, not an error you hide.
Designing the empty case first
There was a scenario where no similar past request existed at all. A genuinely new kind of approval, or a new company with no history in the system. The feature had nothing to offer.
Most specs treat this as an afterthought, if they mention it at all. I wrote it into the spec early and on purpose, because the empty case is where trust is won or lost. If the first time a user tries the feature it produces confident garbage, they never come back. If it cleanly says nothing close enough was found and hands over a clean template, they keep using it, and they believe it the next time it does have an answer.
The tool is finding the case where your feature is useless and designing that path with more care than the happy one.
Keeping the human load-bearing
The model never submitted anything. It generated a draft into the user’s private editing space, and the human reviewed, edited, and submitted. That was not a courtesy. It was a hard constraint, because submitting a request up a chain of managers is a social and political move with consequences for the person making it.
It is easy to read human-in-the-loop as a buzzword, a checkbox you add to make a feature sound safe. The actual tool is more specific: identify the exact point where a person’s judgment is carrying real weight, and refuse to let the model cross it. Everywhere else, automate freely. At that line, the human stays in charge.
Instrumenting for the failure mode
The last model is about what you measure. Adoption and time saved are easy. They were in the spec. But they were not the metrics I cared most about.
I added signals for harm. The clearest one: a new employee who does not yet know the unwritten rules, handed a confident draft, trusting it more than they should. Adoption goes up and quality goes down at the same time, and a dashboard that only tracks adoption would call that a win. So you instrument for the specific way the feature can hurt someone, and you watch that signal as closely as the success one.
The tool is measuring the failure you are most afraid of, not just the outcome you are hoping for.
The throughline
None of this required writing production code. It required thinking like the person who would. That is the actual shift in 2026. AI did not turn product managers into engineers in the learn-to-code sense. It exposed the judgment that was always underneath, because now you are designing the behavior of a system that acts on its own, and there is nowhere left to hide a vague decision.
The software in my list will be different in two years. The embeddings API will be faster, the prototyping tool will have a new name, the model will be three generations ahead. The seven models above will be exactly the same. That is why they are the tools that matter.
Appendix: The original assignment, anonymized
Reconstructed from my working drafts. The company, product, and any identifying details are removed. The problem categories and technical choices are mine, and no raw user research from the company is reproduced here. Exact threshold values and final page-trimmed wording live in the submitted document.
The assignment
A Japanese SaaS company building an internal approval workflow product (a “ringi” system, where a request routes up a chain of managers for sign-off) provided 12 user interviews across 4 companies. The task: select one user problem worth solving, justify the choice, and write a full product specification for an AI feature that addresses it.
Step 1: Selected problem
Five problems surfaced in the research. I selected repetitive draft writing. Submitters rewrite the same kinds of approval requests on every submission, and already work around it by searching past approved requests and copy-pasting.
| Criteria | Assessment |
|---|---|
| Impact | Affects every submitter on every submission, the highest-frequency touchpoint in the product. Felt across all levels, from new hire to department head. |
| Feasibility | 10,000+ approved documents already in the system. A capable LLM handles structured Japanese drafting well. The semantic search layer also resolves the weaker “poor search” complaint as a byproduct. |
| Strategic fit | Directly serves the product’s core promise of reducing operational burden on submitters. Builds end-user trust before touching higher-risk, more political features. |
| Risk | The AI works in the submitter’s private drafting space and is never auto-submitted into the approval chain. Output is a suggestion. The human decides what to submit. |
| Political subtext | Approved documents implicitly encode what was politically acceptable. The model learns from what worked without modeling politics explicitly. The feature accelerates an existing human process rather than replacing a decision. |
Problems I declined to solve
- Unclear approval routing. Routing rules already exist in the product. The real friction is implicit political pre-consensus, which cannot and should not be automated. A cultural onboarding problem, not an AI one.
- Slow approvals. Largely process design. Stalls are sometimes deliberate and political. Pinging senior executives unprompted is worse than no feature.
- Poor search. A valid AI use case but lower frequency than drafting. Solved concurrently as the retrieval layer underneath the drafting feature, not as a standalone primary feature.
- Vague rejection feedback. Rejections are often political rather than quality-based, so an AI reviewer would confidently give the wrong advice. A Phase 2 candidate once the drafting feature produces labeled outcome data.
Note on the research: 12 users across 4 companies is directional, not statistically representative. Frequency and adoption assumptions should be validated at scale before being treated as universal.
Step 2: Product specification
2.1 Overview
Feature name: Draft Assist. When a user starts a new approval request, the AI retrieves semantically similar past approved requests and generates a pre-filled draft for the user to review, edit, and submit. Target persona: any submitter, with semi-frequent and new submitters benefiting most, since they lack a personal archive to draw on.
2.2 Problem statement
The pain concentrates in the middle of the current flow, where the user searches old requests and copy-pastes by hand. The existing workaround validates the need. The proposed flow compresses those manual steps into a single assisted step. The human still authors and submits the final request. The no-match case (no similar past request exists) is handled as a defined edge case below.
2.3 Functional requirements
Input
| Input | Source | Notes |
|---|---|---|
| Request type | User selection (dropdown) | e.g. business trip, equipment purchase, contract approval |
| Title / brief description | User free-text | Minimum 5 characters to trigger retrieval |
| Department and submitter metadata | Session / profile | Weights results toward same-department precedents |
| Approved request corpus | Full-text + vector index | 10,000+ documents, approved submissions only |
Process
| Step | Action |
|---|---|
| 1 | On title input (debounced ~800ms), embed the input and run vector similarity search; retrieve top 10 candidates. |
| 2 | Re-rank by similarity score, request-type match, department match, and recency (last 24 months weighted higher). |
| 3 | Classify into Good / Partial / Low match based on the top result’s score. |
| 4 | If Good or Partial: pass the top 3 documents and user context to the LLM; synthesize a structured draft mapped to form fields; strip all specific values and replace them with placeholders. |
| 5 | If Low or no match: skip generation; return a blank structured template with section headers only. |
| 6 | Return the draft (or template), the top 3 source references, and the match-state label to the frontend. |
Output: editable pre-filled fields, placeholders where specific values were stripped, the match-state label, and references to the source requests.
2.4 Confidence states
The top result’s similarity score drives behavior. The exact thresholds matter less than the principle: the system does less when it is less sure, and it never hides that.
| State | Condition | Behavior |
|---|---|---|
| Good match | High similarity | Full pre-filled draft generated from top precedents. |
| Partial match | Moderate similarity | Draft generated, with gaps and low-confidence sections flagged for the user. |
| Low / no match | Below threshold | No draft. Return a blank structured template. |
2.5 Edge cases
| Scenario | Handling |
|---|---|
| No similar request exists | Return a blank structured template. Never fabricate a draft from a weak precedent. |
| Closest precedent is stale | Recency weighting demotes it. If it still surfaces, flag its age so the user does not copy outdated terms. |
| Restricted or cross-department document | Filter the corpus by the submitter’s access before retrieval. A precedent the user cannot see must never appear in a draft. |
| New company with an empty corpus (cold start) | Treat as a permanent no-match until enough approved history accrues. Default to the blank template. |
2.6 Human in the loop
The AI never submits. It generates into the user’s private drafting space. The human reviews, edits, and submits. This is a hard constraint, not a default, because submitting up the approval chain carries social and political consequences for the submitter.
2.7 Success metrics
Primary: adoption rate among submitters, reduction in time-to-submit, and draft acceptance (how much of the generated draft survives to submission).
Harm detection signals, watched as closely as the success metrics:
- New joiners accepting drafts with little or no editing, a proxy for over-trust before they know the unwritten rules.
- Submission quality or approval rate declining as adoption rises.
- High abandonment after a draft is generated, a sign the drafts are not trusted.
2.8 Risks
Two highest-priority risks per type.
| Type | Risk | Prevention |
|---|---|---|
| Technical | Retrieval or generation latency causes timeouts on input | Debounce input, cap retrieval at the top 10, set a generation timeout that falls back to the blank template. |
| Technical | Cold start for new tenants with no corpus | Default to the template. Gate the feature on a minimum corpus size. |
| User experience | A plausible draft built on a weak precedent is trusted and submitted | Confidence states. Suppress generation below threshold. Flag low-confidence sections. |
| User experience | Over-reliance erodes the user’s own drafting judgment | Position output as a starting point. Keep all fields editable. Never pre-submit. |
| Organizational | New joiners inherit bad or outdated patterns from precedent | Harm-detection metrics, recency weighting, and stale-precedent flagging. |
| Organizational | A restricted document surfaces to an unauthorized user | Access-filter the corpus before retrieval, not after generation. |