I spend most of my workday talking to my computer
- Drafting this post? Speech to text drafts.
- Coding via Cursor/Claude Code? Voice instructions.
- Research? Web Search? Dictations.
My keyboard is no longer a default, it has become a fallback.
It’s been over 3 years since the start of AI revolution, then why do most products still default to screens, buttons, clicks or at the most prompt boxes?. I’ve been thinking quite a bit about it and have come up with a useful framework to help us with this : An interface abstraction ladder.
Each rung of the moves us further from direct manipulation(clicks, buttons etc) toward intent-based interaction.
Let’s talk about the ladder.
The Five Layers of interaction
| Layer | Name | Primary Interaction | User Role | Information Density | Example |
|---|---|---|---|---|---|
| 0 | Direct Manipulation | Clicks, typing, dropdowns | Operator | Low (UI-constrained) | Excel, Figma |
| 1 | Reactive Assistance | Prompting, chat | Requestor | Medium (text exchange) | ChatGPT, Copilot |
| 2 | Proactive Agents | Suggestions, drafts, AI agents - content generators etc | Approver | High (implied intent) | Gmail Smart Reply, Social media agent |
| 3 | Conversational Partners | AV avatars, continuous context | Partner | Very High (ambient) | Apple.OpenAI pins/Rayban Smartglasses |
| 4 | Ambient Intelligence | Implied intent | Subject | Extreme (pre-conscious) | NA |
Now, I am not claiming that all software will move towards Layer 4.
The ladder is still a spectrum and different products / use-cases will fall into different rungs. But do notice the shift in user role, we go from bring an operator(we take action) to a subject(things happen to/for us). As you move up, you do less manipulating and more trusting.
Layer 0: Direct Manipulation
Traditional interfaces. You click buttons, navigate menus, fill forms. Full control but full attention required.
Wins?: Complex spatial tasks, creative work, precision operations.
Fails: High friction, screen-limited information density, requires knowing how to do everything.
Example: Bloomberg Terminal, Fighter jets. High-trust,precision control & direct manipulation is irreplaceable for certain contexts.
Layer 1: Reactive Assistance
You ask, it responds. Natural language replaces UI navigation.
Wins: Removes UI complexity, increases communication density (flexible and not limited by pre-built screens).
Fails: Lacks proactiveness / context, requires knowing what to ask.
Example: ChatGPT, Any product + a chatbot - this is where most “AI products” live today.
Its more likely that your chatbot just programmatically clicks same buttons users would manually and present the same information a chart/dashboard would, You haven’t built a new interaction model; you’ve just built a higher-latency version of Layer 0. Layer 1 should unlock capabilities impossible through UI alone.
Layer 2: Proactive Agents
The system is like a executive assistant. It takes initiative, suggests, prepares, acts within boundaries before you ask. It can be silent (Cursor’s auto-complete & bugbot, Gmail’s smart reply, Woice AI’s automatic organization of captured voice notes).
Wins: It dramatically reduces cognitive load and acts on implied intent.
Fails: Trust becomes the core promise of the product. What is boundary of autonomous execution? When does it ask? What happens when it’s wrong?
Example: If gmail had context of my docs, sheets, calender, notes etc and would auto-draft a reply I would based on the information I/It has, I would consider that a layer 2 product.
I also believe this is where the interesting product problems today are. Not “can we build it?” but “how to provide value, trust and capability in a single package?”.
Layer 3: Partners
The system is a partner, it thinks with you and is aware of what is going on thinks with you through multimodal awareness. Closest real-world example? Chatgpt voice mode - Chatgpt has context/memory (although limited), listens, responds and processes within its context and execution boundry.
Althought for me, a true layer 3 system would look like this - a ChatGPT style voice/camera + Claude code style capability / execution with skills, MCP etc + Long term context/memory + physical form factor + always on continous listening/seeing.
Wins: Lowest input friction, highest bandwidth for natural expression, persistent context across time.
Fails: Privacy concerns, possible context collapse (referencing earlier thoughts), public/private divide, loss of visual scaffolding.
Example: I don’t have any as of today but I believe the next big winners in consumer AI would be in this space - OpenAI, Apple, etc with their always-on companion hardware devices.
Why Layer 3 matters strategically? You can’t build systems that understand implied intent (L4) without continuous contextual data first. Some call it AGI, some call it world models, either way, building towards voice / vision partners will unlock layer 4.
Layer 4: Ambient Intelligence
This could take multiple forms - Robotics, Brain-computer interfaces etc - predictive systems that act on implied intent before you’re aware.
Wins: Zero input friction.
Fails: trust, agency, and privacy problems.
Example: Calendar AGI assistant that books meetings(like you would) based on priorities and patterns it’s learned without you having to instruct, command. Broad AGI is just hard to comment upon, it basically should do everyting.
The Patterns to keep in mind
1. Different layers for different contexts/use-cases
The best products may need multiple layers, not just one.
Deep focus coding? Layer 0 (direct control).
Walking and thinking? Layer 3 (voice).
Stuck on a problem? Layer 1 (ask for help).
Notion does this well: Layer 0 core editor, Layer 1 AI commands, Layer 2 auto-formatting.
2. You can’t skip rungs
Companies try jumping Layer 0 → Layer 2 and wonder why users don’t trust it. Trust must be earned progressively. Each layer requires a different trust contract:
| Layer | User Role | Trust Source | Failure Mode |
|---|---|---|---|
| 0 | Operator | Visibility | Friction |
| 1 | Requestor | Accuracy | ”Blank Page” Paralysis |
| 2 | Approver | Predictability | Over-stepping |
| 3 | Partner | Continuity | Context Collapse |
| 4 | Subject | Alignment | Loss of Agency |
Key Insight: AI features feel jarring when a product tries to take initiative (L2) before it has proven it can respond accurately (L1). Trust isn’t granted, it has to be earned rung by rung.
3. The “undo button” for abstract layers
What’s reversibility at each layer?
- Layer 0: Ctrl+Z, visual undo history
- Layer 1: Regenerate response, edit prompt
- Layer 2: Versioning/check-point to show exactly what changed and allow rollback
- Layer 3: Proactive course correction to be adaptive, contentual and helpful
- Layer 4: Too early to comment upon
4. The ladder is usually an expansion, not a replacement
Professional tools like Photoshop, Figma, Bloomberg will always need Layer 0. Direct manipulation is irreplaceable for precision work. A mature product has multiple entry points:
- L0: Core tool for deep work
- L1: Support when stuck
- L2: Automation for patterns
- L3: Ambient thought partner for workplace productivity
PM trap: Building Layer n+1 as a replacement for Layer n instead of a complement. This is why sometimes these “AI features” feel like worse versions of the original. For eg, an image generation tool which doesn’t allow direct manipulation of the generation results in poor experience. On the other hand, figma generation which can be played around with and manipulated will always trump.
So the Better question is: How do I correctly identify user-intent (L0 vs L3) and their context( past information, Stage of work, use-case)? This would help us proactive present the relevant interface to satisfy their requirements.
Also, Build trust infrastructure before adding autonomy
For layer 1, you need:
- Accuracy (Is the information producted accurate?)
- Fidelity (is it derived from the right information / no hallucination)
For Layer 2-3, you need:
- Explainability (why did it do that?)
- Reversibility (versioning for L2, course-correction for L3)
- Adjustability (how do I correct behavior?)
- Boundaries (what requires permission?)
I believe we are in the middle of the biggest HCI shift since the GUI. Most companies are stuck at Layer 0-1, adding chat to old workflows. Winners on the other hand will understand which layer serves their users. They will build the trust infrastructure to make higher layers work.Each layer serves a specific purpose and each layer feeds the one above. You can’t have ambient intelligence without conversational interfaces.
Where are you on the ladder? And more importantly, where do your users need you to be?