Field notes

Blog

Notes on AI-native transformation, and what the teams that pull ahead do differently.

· MCPOAuth

How MCP Authorization Actually Works (and the Ways Teams Get It Wrong)

A precise walkthrough of the OAuth 2.1 resource-server model MCP uses: audience binding, token passthrough, confused-deputy, the Nov 2025 spec changes, and what to ship.

· computer-use-agentsosworld

Computer-Use Agents: What OSWorld Scores Really Tell You About Production Readiness

What OSWorld actually tests, how to read current scores honestly, and the specific gap between a benchmark number and a computer-use agent you can trust on real work.

· prompt-injectionllm-security

Prompt Injection in Production: Which Defenses Actually Hold and Which Are Theater

A taxonomy of direct and indirect prompt injection, why input filtering fails, and the architectural patterns (CaMeL, dual-LLM, capability design) that actually contain it.

· self-hostingllm-inference

Open-Weight vs Closed Models in 2026: When Self-Hosting Actually Pays Off

A practitioner's model for deciding between open-weight self-hosting and closed APIs: real break-even math, GPU utilization traps, licensing limits, and the EU AI Act split.

· llm-evalsci-cd

Evals as a Shipping Contract: How to Actually Gate LLM Changes

How to build an LLM eval suite that earns the right to block a deploy: cases from real failures, regression vs adversarial sets, noise-aware thresholds, CI gates, model-upgrade gating.

· llm-inferencegpu

Self-Hosting an LLM vs. Calling an API: The Real Cost Math

A mechanism-level cost comparison of self-hosting open-weight LLMs versus per-token APIs: GPU amortization, the utilization trap, throughput math, and where each one actually wins.

· llm-opsfinops

Why Your AI Bill Rises as Token Prices Fall: Jevons in LLM Ops

Per-token prices fell ~280x in 18 months yet enterprise inference spend doubled. Here is the mechanism, the real cost drivers, and how to instrument and cap them.

· llm-routingcost-optimization

LLM Model Routing: How to Cut Cost Without Losing Quality

How LLM routing and cascading actually cut cost, what you really save versus benchmark claims, the confidence-signal trap, and how to instrument a router so quality loss can't hide.

· AI benchmarksLLM evaluation

GDPval, Explained: What the Benchmark Measures and What It Doesn't

A technical breakdown of how GDPval is built, what "matches experts on ~half of tasks" really means, the 100x cost claim, grading reliability, and the benchmark's real limits.

· spec-driven developmentai agents

Spec-Driven Development: How Agentic Coding Actually Changes Your Repo

A technical walkthrough of spec-driven development with AI agents: the files that land in your repo, the plan-implement-verify loop, how review and CI shift, and where it breaks.

· multi-agentllm-agents

Parallel Subagent Orchestration: When Fan-Out Helps, and Exactly Where It Breaks

How to decide when to run LLM subagents in parallel versus keep a single agent: the read/write split, token-cost math, the equal-budget critique, and a runnable orchestrator skeleton.

· llm-agentsunit-economics

AI Agent Unit Economics: Cost-per-Correct and the Compounding-Error Tax

How to actually model the unit cost of an LLM agent: cost-per-correct-outcome, the compounding-error tax over multi-step runs, token re-send growth, and where the math breaks.

· ai-transformationleadership

Stop Training Your Team on AI. Start Restructuring Around It.

AI training programs teach tools within broken workflows. The best AI training is a restructured org where people learn by doing real work with agents.

· engineeringagentic-development

Agentic Development Is Not Vibe Coding. Here's How We Actually Use It.

Vibe coding is prompting and hoping. Agentic development is a structured framework where developers orchestrate AI agents in parallel. Here's the difference and why it matters.

· ai-transformationstrategy

The AI Consultancy Industrial Complex Is Failing You

Big firms charge crores for AI strategy decks that sit in Google Drive. The gap between advice and deployment is where most AI initiatives die.

· org-designai-transformation

Departments Are Dead. Welcome to Outcome Pods.

The traditional org chart optimizes for skills, not results. Outcome-based pods — small teams of humans and AI agents driving a single metric — are what actually works.

· private-equityai-transformation

AI Due Diligence: What PE Firms Should Ask Before Writing the Check

A framework for PE and VC partners evaluating portfolio company AI readiness. How to score it, what red flags to watch for, and why it matters for exit multiples.

· ai-transformationindia

AI Transformation Company India: What to Look For (And What to Avoid)

A buyer's guide for Indian companies evaluating AI transformation partners. Three categories of vendors, red flags, green flags, and what the mid-market actually needs.

· customer-supportai-transformation

We Cut Customer Support Costs by 96%. The Team Didn't Shrink -- They Leveled Up.

How we automated 96% of customer interactions at a 4.4M-subscriber platform without laying anyone off. The support team moved from cost center to growth engine.

· engineeringai-transformation

We Got 20x Engineering Productivity. Here's What That Actually Means.

20x doesn't mean 20x lines of code. It means cycle time collapse, scope expansion per person, and the elimination of everything that isn't building. Here's what happened.

· ai-transformationculture

When Non-Engineers Start Building: The Real Sign of AI Transformation

The most important signal that AI transformation is working isn't engineering velocity. It's when a promo editor builds their own automation and a marketer ships production code.

· ai-auditmethodology

The AI Audit: What We Actually Look For in 2 Weeks

Our 2-week AI audit is how we figure out where AI will have the biggest impact in your company. Here's exactly what we assess, how we assess it, and what you get at the end.

· ai-transformationleadership

Why 95% of AI Pilots Fail — And What the 5% Do Differently

Most companies treat AI adoption as a tool problem. The ones that succeed treat it as an org design problem. Here's what separates the 5% from the rest.