Validate Software Against Historical Data: Replay It (Simply Explained)

How do you know the software is actually right?

I built a payroll tax system for one of my companies. Before it could file a single real tax, it had to match reality, down to the penny.

Think about what that means. The number this system spits out becomes someone's paycheck. It gets sent to the IRS and the state. There is no "we'll fix it next month." You get one shot to be right, and being wrong costs real money and real trust.

So I had to answer a question. The same one every business owner should ask before they rely on custom software: how do I actually know it's correct?

Not "does it look right in a demo." Not "a smart person built it." How do I really know.

Most people answer that question the lazy way. They test the software against what they think should happen. They write down the rules as they understand them, check that the software follows those rules, and call it done.

That proves nothing. You're testing your own guesses against your own guesses. If you misread a tax rule, your test will happily confirm you got your mistake right.

Test against reality, not your assumptions

Here's the better way. Don't test against what you think should happen. Test against what actually happened.

Real money was already withheld from real people over real pay periods. Those exact numbers are sitting in the company's records right now. Thousands of paychecks. Every one of them already has the correct answer attached.

So I didn't try to invent the right answer. I tried to match the right answer that already existed.

I called this a replay. Take every old paycheck, feed the same starting numbers through my new system, and check whether it produces the same result that actually went out the door. Where it matches, great. Where it doesn't, I have an exact list of what to investigate.

Think of it like a new chef trying to recreate a famous dish. You don't ask him if it tastes right to him. You compare his version to the original, bite by bite, until they're identical.

Why checking the total is a trap

Here's the mistake that feels careful but isn't: just checking that the final paycheck total matches.

A matching total is not enough. Say my system is five dollars too high on one tax and five dollars too low on a deduction. The final paycheck comes out exactly right. Two mistakes, perfectly hidden, canceling each other out.

You'd ship it and never know. Until the one situation where they don't cancel, and now you're explaining to an employee why their check is wrong.

So I didn't just check the total. I checked 13 separate numbers on every single paycheck. Gross pay. Each individual tax. Each deduction. The final amount. Every piece got its own comparison.

This is the tedious part most people skip. It's also the part that catches the hidden, dangerous errors. The total tells you the system might be right. Checking every line tells you whether it actually is.

Reading the misses tells you what's wrong

When the numbers didn't match, the pattern of the misses told me the story.

Here's the trick I wish someone had told me earlier. Look at where the differences cluster.

If a lot of paychecks are off by exactly one penny, scattered randomly across many different people, that's almost never a bug. That's just a rounding difference. There are two accepted ways to round these numbers, and the IRS allows both. It produces tiny penny gaps all over the place. Worth understanding, not worth panicking over.

But if the misses pile up on one specific person or one specific check, that's a real bug or a real edge case you just found.

Here are the actual results. One state tax matched 100 percent. Every check, every person, exact. Federal income tax matched on 97.6 percent of checks.

I did not stop at 97.6 percent and call it a win. That last 2.4 percent is exactly where the truth hides.

Every miss over ten cents got investigated by name. One turned out to be a terminated employee's final check, which follows a different calculation. Another batch was just the rounding difference I mentioned. Each one explainable. Each one accounted for.

That's the discipline. You don't stop at "close enough." You chase down every remaining difference until each one has a name and the name is acceptable. A difference you can't explain isn't something to ignore. It's a bug you haven't found yet.

One missing piece, and how I worked around it

To re-run an old paycheck, I needed all the starting numbers. Most I had. But one was never saved anywhere I could find: each employee's tax election settings, the form they fill out that decides how much federal tax gets held back.

I had the result. I didn't have the setting that produced it.

So I worked backward. When you know the answer and you know the formula, you can usually figure out the missing piece. It's like knowing a recipe and the finished dish, then deducing how much salt went in.

For most employees, the answer came out clean. Only one setting could have produced their exact withholding.

But I won't pretend it was clean for everyone. For some people, two different settings produced nearly identical results, so I couldn't be sure which was real. Those I flagged for a human to review rather than counting them as confirmed.

You don't get to dress up a guess as a fact. An uncertain input means an uncertain result, and the honest thing is to say so.

What this means if you're buying software

This goes way beyond payroll. It applies to anything where software computes a number you rely on. Pricing. Billing. Commissions. Your accounting.

The dangerous state is software that looks like it works. A clean total is the most seductive lie in software, because it passes the eyeball test. The owner glances at the bottom line, it matches, everyone moves on. Meanwhile two errors are quietly hiding inside it.

The real standard isn't "the numbers are close." Close is worthless. The real standard is this: I can explain, down to the cent, every place the system disagrees with reality, and every explanation is acceptable.

What you should get out of this isn't a technical report buried somewhere. It's a plain-language document you or your accountant can read line by line, with an explanation for every difference. If nobody can produce that, you don't actually know your software is right. You just haven't been caught yet.

I'll be straight about AI's role here. AI helped me build both the engine and the testing system fast. But deciding which differences were acceptable, which worked-backward numbers to trust, and whether a penny gap was a bug or just rounding, that was all human judgment.

Anyone can generate software now. Proving it's correct before you bet the business on it is the part almost everyone skips. That's the discipline I bring.

Ready to bring AI leadership into your company?

I work with a small number of companies at a time. If you're serious about AI, apply to work together and I'll review your application personally.

Apply to Work Together