LV1
Product Philosophy

How I Build
AI Products.

Eight principles I actually follow when building AI systems, on evaluation, trust calibration, human-in-the-loop design, and the things most teams get wrong.

These are not best practices borrowed from a framework. They come from building AI systems in production for regulated industries, running evals, watching systems fail in interesting ways, and rebuilding with better constraints the second time.

01
Evaluation First

Ship the eval harness before the feature.

Most teams build the feature, then figure out how to measure it. I do the opposite. Before any model integration goes into production I define what good looks like, which queries should return what kinds of outputs, what failure modes are acceptable and which ones aren't, and what the severity threshold is for regulated vs. consumer contexts. The evaluation suite is a first-class product artifact, version-controlled and treated as seriously as the code it tests.

In regulated environments, a silent model failure isn't just a product bug, it's a compliance incident. The difference is whether you find out from a user report or from a regression test you wrote three weeks before launch.

02
Trust Calibration

Confidence without calibration is the most dangerous thing in AI.

An AI system that says 'I don't know' when it doesn't know is more valuable than one that answers fluently and incorrectly. Calibrating model confidence, understanding when to show uncertainty, when to escalate to a human, when to refuse, is a product design problem, not an engineering one.

I design explicit escalation paths for every AI system I build. If the model's confidence score drops below a defined threshold on a high-stakes query, the system surfaces that uncertainty visibly rather than smoothing it away. In financial and regulatory contexts, the cost of false confidence is asymmetric and severe. The design should reflect that.

03
Human-in-the-Loop

Automation should expand human judgment, not replace it.

The goal of a human-in-the-loop system isn't to add a human as a rubber stamp on AI outputs. It's to route decisions to the right decision-maker at the right moment, using automation to handle what's routine so humans can focus on what's genuinely ambiguous.

I design override mechanisms and escalation flows before I design the automation layer. That means deciding: which outputs can ship without review, which need a human checkpoint, and which should never be automated at all. The override design tells you more about what the system believes than the happy path does.

04
Model vs System

The model is not the product. The system around it is.

Most AI product failures I've seen aren't model failures. They're system failures, bad prompt architecture, no fallback logic, missing telemetry, evaluation criteria defined too late, retrieval layers that degrade silently at scale. The model is a component. The product is the orchestration, constraints, and feedback loops that make that component reliable.

This is why I build evaluation infrastructure, contract-first API boundaries, and modular classification pipelines, not because I want the engineering complexity, but because the alternative is a system where changing one thing breaks something invisible three layers down.

05
Latency & Tradeoffs

Every architecture decision is a tradeoff, not a best practice.

RAG is not always the right retrieval strategy. A larger model is not always the right model. Real-time inference is not always the right serving pattern. I built a C++ retrieval engine from scratch not to be an infrastructure engineer, but to have concrete numbers in my head when the engineering team says 'this won't scale.'

The PMs who make the best architecture calls are the ones who can reason about latency surfaces, context window tradeoffs, and inference cost curves, not because they'll implement it, but because they'll know when to push back, when to ask the right question, and when to stop trading accuracy for speed.

06
Telemetry & Feedback

If you can't observe it, you can't improve it.

Every AI system I build has three things wired in from day one: structured logging on inputs and outputs, a mechanism to capture negative signals (explicit or implicit), and a regular review cadence on the evaluation suite. Not because these are nice to have, because they're the only way to know if the system is degrading between deployments.

Model drift is real, prompt sensitivity is real, and distribution shift in production data is real. Telemetry doesn't prevent these things. It makes them visible before they become user-facing incidents.

07
Regulated Environments

Low-trust environments require different design instincts.

I work primarily in regulated industries, financial services, compliance infrastructure, private capital. These environments share a common constraint: the cost of a wrong answer is not a bad user experience, it's an audit finding, a regulatory action, or a missed investment decision worth millions.

That changes how I think about AI design: hallucination is a liability, not just a UX problem; audit trails are a product requirement, not an ops afterthought; and 'good enough' accuracy from a demo does not transfer to production at regulated-institution tolerances. Building for this context is a specific skill, and most AI product frameworks don't account for it.

08
What I've Learned Building

The hard lessons don't come from the models.

The most expensive mistakes I've seen in AI product development are not technical. They're communication failures: unclear contracts between the AI layer and the product interface, evaluation criteria defined after the feature shipped, human-override flows that nobody tested under pressure, prompt changes that broke downstream assumptions nobody documented.

The best AI PM I've found is one who treats the system design, the contracts, the constraints, the escalation paths, as seriously as the model selection. That's what I try to do.

Selected Product Thinking

Questions I keep coming back to.

Why most AI copilots fail retention after week two.
Why trust calibration matters more than raw accuracy in regulated contexts.
What PMs misunderstand about agent UX and oversight.
Building AI systems for environments where users don't trust the model by default.
Why evaluation infrastructure is the real competitive moat in AI products.
The difference between 'AI-powered' and 'AI-reliable' as a product property.

These are the essays I'm writing. If any of these are live, they'll be in Writing ↗

Want to go deeper?

See the systems I've actually built, or start a conversation.

View Projects →Start a Conversation →