model.log()

Thoughts on ML, AI, and building intelligent systems.

April 2026 6 min read

Why Recommender Systems Are Harder Than You Think

Everyone assumes recommendations are just collaborative filtering with extra steps. But when your item catalog has 50M entries and user preferences shift hourly, the real challenge is building systems that fail gracefully under cold-start conditions while still surfacing serendipitous results.

The Cold Start Wall

Every recommender system tutorial starts with a nice, clean user-item matrix. In reality, a massive portion of your interactions involve either a new user or a new item. At convenience retail scale, new products rotate onto shelves weekly, and a significant chunk of customers are first-time or infrequent visitors to any given location. Your model's performance on these cold-start cases is what actually determines business value — not your NDCG on the warm-start evaluation set where everybody has 50+ interactions.

The standard tricks help: content-based fallbacks, popularity baselines, demographic priors. But the real unlock is building architectures where cold-start isn't a special case. We've had success with two-tower models where the user and item towers can operate independently — a new item gets a meaningful embedding from its features alone, and you can start serving recommendations for it within minutes of it entering the catalog.

The Feedback Loop Trap

Here's the subtler problem: your model's recommendations influence the data that trains your next model. Recommend Product A heavily → customers buy Product A → model learns customers prefer Product A → recommends it even more. This feedback loop creates a self-reinforcing cycle that narrows diversity over time, and your offline metrics will look great because you're evaluating on data that was shaped by the previous model.

Breaking this loop without tanking short-term metrics requires careful exploration strategies. Epsilon-greedy is the minimum bar, but ideally you want something like Thompson sampling that can balance exploration and exploitation more gracefully. We log counterfactual data (what would a random policy have shown?) alongside production traffic to debias our training sets. It's messy but necessary.

Latency vs. Relevance

Your model might achieve incredible offline metrics, but if it takes 200ms to generate recommendations, it's useless for real-time surfaces. We've found that a two-stage architecture works well: a fast candidate generation model (often a simple ANN lookup over learned embeddings) feeds into a more sophisticated re-ranking model that can afford to be slower because it's only scoring ~100 candidates instead of millions.

The trick is deciding where to draw the line between stages and how many candidates to pass through. Too few and your re-ranker is working with a stale candidate set. Too many and you've just moved the latency problem downstream. We ended up with a cascade of three stages — retrieval, lightweight scoring, heavy re-ranking — each filtering by roughly 100x.

The Serendipity Problem

The hardest challenge isn't relevance — it's surprise. A recommender that always suggests items similar to what a user already bought is technically accurate but boring. The real wins come from surfacing unexpected items that customers didn't know they wanted.

We've experimented with diversity-aware re-ranking using MMR (Maximal Marginal Relevance) and DPP (Determinantal Point Processes), and they help. But quantifying "delightful surprise" versus "irrelevant noise" is more art than science. The best signal we've found is tracking exploration conversions — purchases of items the user had never interacted with in the same category before.

What Actually Matters

After working on these systems in production, my biggest takeaway is that the model is maybe 30% of the problem. The other 70% is data quality, feature engineering, serving infrastructure, and monitoring. Your fancy transformer-based sequential recommender is worthless if your item embeddings are stale, your feature store has a 10-minute lag, or you can't detect when your model starts recommending out-of-stock items.

The field loves to chase SOTA architectures, but the delta between a well-tuned two-tower model and the latest research paper is usually smaller than the delta between good and bad feature engineering. Build the boring infrastructure first. The clever modeling can wait.

March 2026 7 min read

The Hidden Cost of RAG: When Retrieval Augmented Generation Goes Wrong

RAG pipelines promise grounded LLM outputs, but what happens when your retriever fetches confidently irrelevant context? I walk through failure modes I encountered building production RAG systems — from embedding drift to chunk boundary hallucinations.

Confident Irrelevance

The most dangerous RAG failure isn't when the retriever returns nothing — it's when it returns something that looks relevant but isn't. Embedding similarity is a surprisingly blunt instrument. A query about "model training" might retrieve documents about "employee training models" because the embedding space maps these to nearby regions. The LLM then confidently synthesizes an answer grounded in completely wrong context, and the user has no way to tell.

We mitigate this with a cross-encoder reranking step between retrieval and generation. It helps significantly — cross-encoders see both query and document simultaneously, so they can catch semantic mismatches that bi-encoder similarity misses. But it adds 50-100ms of latency and doesn't eliminate the problem entirely. The fundamental issue is that semantic similarity does not equal relevance for a given query intent.

A more robust approach we're exploring: training a lightweight relevance classifier on examples of "retrieved but not actually relevant" documents. When the classifier confidence is low, the system falls back to a "I don't have enough context to answer this" response instead of hallucinating. Users vastly prefer "I don't know" over confidently wrong answers.

Chunk Boundary Hallucinations

How you chunk your documents determines what your retriever can find. Split too aggressively and you lose context — a chunk might contain a conclusion without its supporting argument. Split too conservatively and you waste context window tokens on irrelevant paragraphs that just happened to be near the relevant sentence.

The nastiest failure mode: a critical piece of information split across two chunks, where neither chunk alone is retrievable for the relevant query. Chunk A says "The policy was updated in Q3." Chunk B says "The new threshold is 500 units." Neither chunk alone answers "What is the current policy threshold?" The model either misses it entirely or hallucinates a plausible-sounding number from training data.

Recursive chunking with overlap helps, but it's a band-aid. We've had better results with a sliding window approach where chunks overlap by 20-30%, combined with a chunk expansion step at retrieval time — when you retrieve chunk N, you also pull in chunks N-1 and N+1 for the LLM to reason over.

Embedding Drift

Your document embeddings are computed once (or periodically), but the world they describe changes continuously. When the facts in your knowledge base update but the embeddings don't, your retriever operates on a stale representation. In fast-moving domains — policy docs, product catalogs, internal wikis — this means your RAG system confidently returns outdated information.

We run incremental re-embedding pipelines triggered by document change events. But the real cost isn't compute — it's the subtle bugs that emerge when some embeddings are fresh and others aren't. If you update your embedding model (even a minor version bump), old and new embeddings live in slightly different vector spaces.

The pragmatic solution: version your embedding spaces. When you update the model, re-embed everything. Yes, it's expensive. The alternative — gradual drift in retrieval quality that's nearly impossible to debug — is worse.

Evaluation is a Nightmare

How do you know your RAG system is working? Traditional retrieval metrics (precision, recall, MRR) only measure the retriever in isolation. LLM evaluation metrics (ROUGE, BLEU) don't capture factual accuracy or faithfulness to retrieved context.

We've settled on a three-layer evaluation approach. First, retrieval quality: does the retriever surface the right documents? We maintain a golden set of query-document pairs and track recall@k weekly. Second, faithfulness: does the generated answer actually follow from the retrieved context? We use an LLM-as-judge approach. Third, end-to-end correctness: periodic human evaluation of sampled outputs against ground truth.

The Takeaway

RAG is not the solved problem the ecosystem wants it to be. It's a powerful pattern, but it requires serious engineering around retrieval quality, chunking strategy, freshness, and evaluation. If you're building a RAG system, budget at least as much time for retrieval engineering as you do for prompt engineering. The retriever is the bottleneck — the LLM can only be as good as the context it receives.

February 2026 7 min read

Vision Transformers in the Wild: Lessons From Deploying ViT at Scale

ViTs achieve impressive benchmark numbers, but deploying them in latency-sensitive environments tells a different story. From knowledge distillation tricks to dynamic token pruning, here is what actually worked when we needed sub-50ms inference on edge devices.

The Benchmark-Production Gap

A ViT-Base model runs at ~4ms on an A100 GPU with batch size 1. Sounds fast. But in production, you're not running on an A100 — you're running on whatever your infrastructure team approved, which might be a T4, a CPU instance, or an edge device with an NPU that has its own set of operator support quirks. That 4ms becomes 80ms real fast.

The gap isn't just hardware. Preprocessing (resize, normalize, center crop) that takes negligible time in a training loop adds measurable latency in a serving pipeline. We measured end-to-end latency on our target hardware from day one — not just model forward pass time — and it changed every optimization decision we made.

Knowledge Distillation: What Actually Works

We spent months trying to compress a ViT-Large teacher into something deployable. Intermediate layer matching consistently outperformed output-only distillation. The student model learns better representations when it's forced to mimic the teacher's internal feature maps. We used a projection head to align dimensions and minimized MSE loss on the [CLS] token representation at layers 3, 6, 9, and 12.

For our use case, distilling into a MobileNetV3 student gave us the best latency-accuracy tradeoff. The resulting model was 8x smaller and 12x faster, with only a 2.3% accuracy drop. Trying to distill into a smaller ViT (ViT-Tiny) actually performed worse — the transformer architecture doesn't degrade gracefully at small scales the way CNNs do.

Dynamic Token Pruning

The key insight about ViTs is that not all patches matter equally. For most images, the background patches carry minimal information but consume the same compute as foreground patches in every attention layer.

We implemented a learnable token pruning mechanism that progressively drops uninformative tokens at each transformer layer. By layer 6 of a 12-layer model, we're typically processing only 40-60% of the original tokens. This gave us a 1.8x speedup with less than 0.5% accuracy loss.

The trick is making the pruning decision differentiable so you can train it end-to-end. We used a Gumbel-Softmax approach for the keep/drop decision during training. At inference time, it's a hard threshold — no sampling overhead.

Quantization War Stories

INT8 quantization is table stakes — every major framework supports it, and you get a ~2x speedup with minimal accuracy loss. The interesting territory is mixed-precision quantization, where different layers get different bit widths.

We found that attention layers are significantly more sensitive to quantization than FFN layers. Our best configuration: Q/K projections at INT8, V projections and FFN layers at INT4. This required per-channel calibration and careful handling of outlier activations.

ONNX Runtime with the TensorRT execution provider was our deployment target. The export process alone took weeks of debugging. If anyone tells you ONNX export is "just torch.onnx.export()," they haven't tried it with custom attention mechanisms.

What I'd Do Differently

Start with distillation, not pruning or quantization. It's the highest-leverage optimization and gives you a fundamentally simpler model that's easier to serve and debug on commodity hardware.

Also: profile before you optimize. We spent two weeks optimizing the transformer backbone only to discover that 40% of our end-to-end latency was in image preprocessing and postprocessing. A simple switch from PIL to OpenCV for image decoding saved more time than our first round of model optimization did. The boring engineering wins beat the clever ML tricks almost every time.