Append-Only Data Cleanup: When a Counter Only Goes Up

The Badge That Climbed to 4,414

I run a DTC fashion brand out of San Diego, handmade goods, and like any inventory operation we have a drift-log table. Its job is to flag the moments when physical stock and digital stock disagree. A box gets miscounted, a return gets logged twice, a unit walks off a shelf. The drift-log catches it and throws up a badge in the corner of the admin dashboard so somebody knows to look.

The badge read 4,414.

It was supposed to be transient. A handful of rows at most, the kind of warning that pops up, gets resolved, and disappears within a day. Instead it had quietly become a permanent fixture. The kind of thing you stop seeing after a few weeks. Nobody clicked it anymore. Nobody believed it.

That is the whole problem with append only data cleanup in one number. A counter that only goes up is not a metric. It's noise wearing a metric's clothes. The moment a signal can only climb, it stops carrying information. 4,414 told me exactly as much as 4,400 did, or 5,000 would have. Which is to say, nothing.

Here's the part that bugs me most. A red badge you've trained yourself to ignore is worse than no badge at all. No badge means you go check things manually because you know you're flying blind. A stale badge gives you the false comfort of instrumentation while quietly lying to you. You think you're being warned. You're being numbed.

So I sat down to fix it. Not to clear the number. To make the number mean something again.

Why It Was Append-Only by Accident

Before I touched anything, I needed to understand how a table designed to be transient became permanent.

The missing line of code

The drift-log was written to by two automated jobs: a reconcile cron that compared physical and digital counts, and an investigator job that dug into the discrepancies. Both of them knew how to write a row that said "drift detected."

The append-only accident: detection without resolution

Neither of them ever wrote the other half of the story.

There was a resolved_at column sitting right there in the schema. It was never set. Not once in the entire history of the table. Rows went in and never came out. That's the textbook definition of an accidental append-only table. Not append-only by design, where you intentionally keep an immutable ledger. Append-only by omission, because somebody (me, at some point) wrote the detection logic and never wrote the resolution logic.

Every table that records a problem needs a matching code path that records the problem being solved. If you only build the first half, you haven't built a monitoring system. You've built a landfill with a counter on top.

One historical over-correction

The missing resolution logic explained the steady climb. But it didn't explain the size. 4,414 is a lot of drift for a brand my size.

Digging in, I found the second contributor. At some point months earlier, a single over-correction event had dumped a large batch of rows into the table all at once. A reconcile job had run against bad input and flagged thousands of products as drifted in one pass.

Combine the two. One big historical dump plus no way for any row to ever leave. The table had exactly one direction it could move. Up.

Why Stale Alerts Are Worse Than No Alerts

Let me make the case for why this matters operationally, because it's easy to dismiss as cosmetic.

Stale alert vs no alert vs trustworthy alert

When the badge always says 4,414, nobody investigates the 4,415th. That's alert fatigue, and it's not a soft problem. It's a direct operational risk. Somewhere in those 4,400 rows there might be three or four that represent real, active, costing-me-money drift. Products where I genuinely have fewer units than the system thinks, or more. Those are buried under 4,400 that mean nothing.

A team that has learned to ignore the badge will keep ignoring it when it finally matters. You've trained your own people to treat your most important signal as wallpaper.

This is the same failure mode I wrote about in alerts that only fire when something is actually wrong. A signal is only as valuable as your willingness to act on it. The instant a badge becomes background noise, it has negative value, because it occupies the slot where a trustworthy signal should live.

There's a deeper version of this too. A system that displays a number it can't back up is a system that lies about its own state. The badge wasn't malfunctioning. It was reporting exactly what it was told to report. The problem was that what it was told to report had stopped corresponding to reality.

Here's the trap I want you to avoid. A builder who only clears the number without fixing the root cause is making it worse. He's resetting the gauge while leaving the leak. Two months later you're back at 4,000 and now you trust the badge even less. The fix has to address the lifecycle, not the symptom.

Layer One: Close the Loop in the Plumbing

The first thing I fixed was the thing that would prevent this from ever happening again. Everything else is just cleaning up the mess that already existed.

I added resolution logic to both the reconcile and investigator crons. Now, when a product's physical and digital counts realign (which happens naturally as stock moves and gets recounted), the corresponding drift rows get marked resolved automatically. The resolved_at column finally gets set.

This stops the bleeding going forward. No new rows pile up permanently, because the system now records both sides of the lifecycle. Drift detected, and drift resolved. The table can finally move in both directions.

This is the resolver pattern that closes the loop, and it's the part I care about most. Cleaning up 4,400 historical rows is satisfying. But if I'd only done that, I'd be back here in six months. The durable fix is the one that means the problem cannot return. Everything I built after this is one-time work. This piece is permanent.

I want to be honest about the order here, because it's the opposite of what feels natural. The instinct is to clear the backlog first, because that's the visible win. But if you clear the backlog before you fix the plumbing, you're just buying yourself a clean slate that immediately starts dirtying again. Fix the source first. Then clean up. That sequence is the difference between solving a problem and rescheduling it.

Layer Two: An Idempotent One-Shot Cleanup With a Stock-Sum Guard

With the plumbing fixed, I still had 4,400+ historical rows that would never resolve on their own, because they referred to drift events long since settled. Those needed a separate, one-time cleanup.

I wrote a one-shot script. The first property I designed in was idempotency. Run it once, run it ten times, you get the same result. That matters because cleanup scripts get interrupted, re-run, run against the wrong environment by mistake. An idempotent cleanup script is one you can run nervously and still trust.

Dry-run, apply, undo

The script had three modes.

Dry-run showed exactly what it would change without changing anything. Every row it intended to touch, printed and counted. I ran this first, read the output line by line, and confirmed the script's understanding matched mine before it modified a single byte.

Apply made the actual change.

Undo reversed it. If apply did something I didn't like, one command put everything back. No restoring from backup, no manual surgery. The reverse path was built in from the start.

Those three modes are non-negotiable for me on anything that touches live data. If a cleanup can't show you what it'll do before it does it, and can't undo what it did after, it's not a tool. It's a gamble.

The safety property that mattered most

Here's the load-bearing claim of this whole piece. None of these scripts ever touched actual inventory counts. They only changed metadata, the resolved_at flag.

The stock-sum safety guard

But saying that isn't enough. I had to prove it.

So the script summed total stock across every product before it ran and after it ran. And it refused to apply if those two numbers didn't match exactly. Not approximately. Exactly.

That's the data migration safety guard that should be standard on anything running against your live data. The script could not damage inventory, because if it had changed even one unit of stock anywhere, the before-sum and after-sum would diverge and the script would abort and roll itself back. The guard turned "trust me, it only changes metadata" into "the math proves it only changed metadata."

That's the difference between a cleanup you hope was safe and one that can demonstrate it was.

Layer Three: A Forensic Path for the Drift That Was Real

Not every row was noise. That's the part a lazy cleanup gets catastrophically wrong.

A small number of those 4,400+ rows represented genuine active drift. Products where physical and digital counts really were out of sync, right now, in a way that mattered. If I'd written a script that just nuked every unresolved row, I'd have erased the four problems I actually needed to see, along with the 4,400 I didn't.

So the cleanup logic was careful. It auto-resolved rows where the underlying drift had demonstrably settled (the counts now agreed). It deliberately did not auto-resolve rows where the counts still disagreed. Those couldn't be cleared by a script, because clearing them would be a lie. They needed a human.

For those, I built a forensic path. A view that surfaces the genuinely active drift so I or someone on my team can investigate each one and correct it properly. Find the unit. Count the shelf. Fix the number. Then let the resolver mark it resolved the right way.

The result: the badge dropped from 4,414 to 4.

Four real problems. Visible, specific, actionable. Instead of 4,414 that meant nothing.

I want to be precise about the goal here, because it's the whole point. The goal was never zero. Zero would have required either having no drift ever (impossible) or lying about the drift I had. The goal was a number I can trust. A badge that reads 4 and is correct beats a badge that reads 0 and lies, every single time. When that badge ticks to 5 next week, I'll actually go look. That's the entire reason the badge exists.

What a Trustworthy Cleanup Actually Looks Like

Step back and the pattern is the same one I'd want from anyone touching a system I depend on.

The three-layer cleanup sequence

Fix the root cause in the plumbing so the problem never recurs. Run the historical cleanup as something idempotent and reversible, with a guard that proves it didn't damage anything it claimed not to touch. And keep a forensic path for the cases that are genuinely real, so the cleanup never erases the signal it was supposed to protect.

The drift-log was one table in one system in one brand. But here's the part that should matter to you. Most of the systems running your business are full of these accidental landmines. Counters that only go up. Badges nobody trusts. Scripts that change more than they claim, with no guard to catch them when they do. Tables that record problems and never record solutions.

Nobody built those landmines on purpose. They're the residue of shipping fast and writing the first half of the logic. They accumulate quietly until one day a number reads 4,414 and you realize you stopped believing your own instruments months ago.

When I build cleanup, I build it so it can prove it was safe. That's not extra polish. That's the bar. A migration that can't demonstrate it didn't break anything hasn't earned the right to run against your data.

If any of this sounds familiar, the move isn't to start over. It's to look at the systems already running your business and find out which of your numbers you can actually trust. That's a conversation worth having before the next badge climbs past the point where anyone's looking. The principle behind all of it is the same one I keep coming back to in systems that lie about their own state: an instrument that can't be trusted is worse than no instrument at all.

Ready to bring AI leadership into your company?

I work with a small number of companies at a time. If you're serious about AI, apply to work together and I'll review your application personally.

Apply to Work Together