Hero image for The Report That Came Back Shuffled
By Scott Armbruster

The Report That Came Back Shuffled


Every action I take across personal projects, client work, and business operations is automatically logged. Each morning a pipeline reads those session logs and distills them into this post. After 23 years in technology, this is what building with AI actually looks like: real decisions, real friction, and the patterns you only recognize after you’ve been burned by them before.

The Day at a Glance

  • Tracked down a production defect that had nothing to do with AI. A 30-year-old distributed systems failure mode, surfaced inside a modern pipeline.
  • The three-question check I now run on every parallel batch operation before it ships
  • Why observability is a prerequisite for trusting automation, not an optional layer you add later
  • Migrated a personal AI assistant’s memory from flat storage to vector search. Forty-five facts, one migration script, no rollback needed.
  • Content pipelines ran automated across 15+ sites while all of this was happening in parallel

The Ordering Bug That Predates AI

Got a message mid-morning: an automated domain status report was coming back shuffled. Wrong sequence. Looked like a pipeline failure. First instinct was a defect in the report generation logic. That was wrong. The actual problem was older than most of the tools involved. The pipeline fires a batch of status checks in parallel, which is correct, but was treating arrival order as sequence order. First response received got position one. Last response got the final slot. Network latency doesn’t respect intended sequence, so the report reshuffled itself every run depending on which upstream services answered fastest. This is not an AI problem. It shows up in any system where you parallelize work without preserving sequence metadata at dispatch time. Serializing the calls would fix it, but that kills the performance benefit of parallelism entirely. The actual fix: assign each request its sequence position before firing it, then sort results by that position on receipt, before any downstream logic touches the output. The fix made me write down three questions I should have been asking before any parallel batch operation ships. Does the output need ordered results? Is the sequence position assigned at dispatch or inferred from arrival? What does a downstream consumer do if items arrive out of order? That last question is the one that keeps catching me. Ordering bugs rarely throw exceptions. They produce subtly wrong output that passes every test written for the success path. I wrote about a related class of quiet failures in why async pipelines fail silently. Different surface, same underlying dynamic.

Observability Is Not Phase Two

The financial services work today was hardening observability for an async workflow engine: specifically, tracing spans that disappear when jobs die mid-execution, and reconciliation logic that catches runs which complete without correctly updating their state. Inngest has solid primitives for workflow orchestration, but the failure-mode test harness is still yours to build. The pattern I keep seeing: most automation failures aren’t code defects — they’re observability gaps. The system did something. You just can’t tell what. The teams I’ve watched do this well write the “job died halfway through” test before they write the happy-path test. Not as polish. As the definition of done. Today’s work extracted run management into a clean, testable unit and wrote integration tests that specifically probe ghost spans and run-reconciliation gaps. These aren’t edge cases at volume. They’re the normal failure modes, and if you’re not testing them explicitly, you’re discovering them in production.

Upgrading the Memory Architecture

Separately, migrated a personal AI assistant’s memory from flat JSONL storage to a vector-indexed semantic retrieval system. Forty-five facts, one migration script, no manual corrections. The practical difference: keyword lookup breaks when you use different words to describe the same concept. Semantic retrieval degrades far more gracefully. System-prompt injection now surfaces relevant context based on meaning, not string matching. The improvement is real — context that previously required explicit recall is now pulled automatically.

The part I haven’t resolved: where does the right quality threshold for semantic relevance actually live? The retrieval is better, but it occasionally surfaces adjacent context rather than directly relevant context. I don’t know yet whether that’s a threshold calibration problem, an embedding quality problem, or just the inherent fuzziness of meaning-based retrieval. Probably all three. And I suspect the right threshold isn’t uniform. Factual recall probably wants different sensitivity than preference or context retrieval. I’m still sitting with this one.