The Benchmark Saturation Trap

Part 4 of the ArtifactNet research journey. This is the April 2026 chapter. Right after hitting SOTA, in the moment where you think "one more push and it'll be perfect," we ran into the most expensive lesson of this whole series: chase the benchmark score and you destroy real-world performance. Alongside the successful codec-aware retrain and the fair benchmark we built, we've left the regression and the illusion that sat between them fully intact.

TL;DR

V10, retrained on a flood of extra SONICS data, pushed the SONICS benchmark F1 from 0.9937 to 0.99995 — but its detection rate on the newest Suno v4, the thing that matters most in production, collapsed from 99.8% to 4.3%. A textbook case of benchmark saturation.
To trust any model comparison you must control the whole pipeline. Three hidden bugs (HPSS / MLP verdict / UNet mismatches) were quietly distorting every number we read.
We built a fair-comparison benchmark framework (ArtifactBench), and it immediately exposed a hidden weakness in the model we'd been calling "SOTA," v9.3_hn: codec fragility. Re-encoding the same song to a different codec flipped the verdict (codec-pair Δ mean 0.90).
We fixed this with codec4 UNet (4-way codec-augmented retraining with MP3/AAC/Opus), producing v9.4/v9.5. MP3 real-music false positives dropped 98.7% → 8.0% → 0.5%.
Under identical conditions we ran ArtifactNet (4.2M) vs CLAM (194M) vs SpecTTTra (18.7M) across the full SONICS set, and found that the bigger the model, the more it mistook real music for AI (CLAM real-music FPR 67%).

1. Starting point: greed right after hitting SOTA

Through Part 3, ArtifactNet is a forensic framework for detecting AI-generated music (Suno, Udio, Stable Audio, MusicGen, Riffusion, etc.), and it's a small model — 4.2M parameters total. The core pipeline:

audio → STFT → ArtifactUNet (residual extraction) → HPSS (harmonic/percussive split)
     → 7-channel forensic features → ResidualCNN → per-segment P(AI) → per-song median verdict

At the start of April, the real-world SOTA was v9.3_hn: SONICS benchmark F1 0.9937 and — the number we were proudest of — it detected 871 of 873 (99.8%) of the newest, just-released Suno v4. Catching a "wild, latest-generation" model that isn't in SONICS was this model's true value.

But one thing on the SONICS leaderboard nagged at us. F1 0.9937 was SOTA, yet there seemed to be room to squeeze out a bit more below the decimal — a few percent were leaking on the hard algorithms (chirp-v2-xxl, udio-120s). "What if more SONICS training data fills that gap?" That simple greed was the beginning of V10.

2. V10: the benchmark went perfect, and production fell apart

V10 was retrained with a large injection of SONICS data. Judged on the leaderboard alone, it was a triumph.

Metric	v9.3_hn	V10
SONICS F1	0.9937	0.99995
SONICS Fake TPR	98.60%	100%
SONICS Real FPR	1.23%	0.04%

As far as SONICS was concerned, V10 was nearly perfect. Real false positives dropped from 1.23% to 0.04% and fake detection hit 100%. Had we stopped here, we would have shipped V10 to production.

Then we validated outside SONICS, and the picture inverted.

AI source	v9.3_hn	V10
Newest Suno v4 (873 tracks)	99.8%	4.3%
SONICS udio-120s	(in-benchmark)	99.5%
MoM diffrythm / udio / yue	100%	99–100%

Detection of the newest Suno v4 crashed from 99.8% to 4.3%. V10 still caught the generators present in SONICS (chirp, udio), but it missed almost the entire population of the actual latest generators in production that SONICS doesn't contain. It caught only 38 of 873 tracks.

This is benchmark saturation. When a model overfits to a particular benchmark's distribution, that benchmark score saturates against a ceiling while generalization outside the distribution collapses. V10 had memorized SONICS's chirp/udio signatures in fine detail, and in exchange it lost the ability to detect "AI music in general." SONICS F1 0.99995 wasn't skill — it was the score of memorization.

The lesson we wrote in our notes that day is short:

Don't look at benchmark F1 alone. Without real-world validation on fresh data, your skill dies while your score climbs.

V10 was scrapped, and v9.3_hn was reconfirmed as the real-world SOTA. Generalizing broadly from less data beats overfitting one benchmark with a mountain of data — that's the first pillar of this chapter.

3. A side quest: the "comparison itself" was lying

Digging into the V10 affair surfaced an even more uncomfortable fact: the pipeline we were using for comparison had bugs in it, so no number could be taken at face value. In mid-April we caught three critical mismatches.

HPSS mismatch (critical): a GPU-accelerated HPSS (average-pool approximation) was producing a different feature distribution than the median-filter HPSS used in training. Fed a distribution it had never seen, the CNN's TPR dropped by a full 50 percentage points. We forced median-filter HPSS in every environment.
MLP verdict mismatch: an old verdict layer was misreading new CNN features, flipping songs the CNN scored as real (prob 0) into AI (prob 0.73). FPR spiked to 52%. We switched to using the CNN's median verdict directly.
UNet / normalization mismatch: one UNet checkpoint was corrupted and called every real track AI, and the mel-normalization version differed between Mac and RTX, so on one side every track came out at probability 0.

The lesson shares a root with the V10 affair: you can only compare models with the whole pipeline under control. When a single misaligned HPSS wipes out 50 points, debating "which model is better" in that state is meaningless.

This experience drove the next decision: stop comparing by hand each time, and build a controlled, fair benchmark framework.

4. ArtifactBench: building a fair ring

ArtifactBench is a benchmark harness that evaluates multiple detection models on identical sources, identical sanity checks, and identical metrics. Rather than measuring a single F1, it measures all of:

Per-source AI TPR — 22 generator sources (9 AIME, 5 SONICS, 4 MoM, latest CDN suno/udio, etc.)
Per-source real FPR — FMA hard-negatives, MoM real, SONICS real, YouTube hard-negatives, etc.
Codec-pair invariance (Δ) — how much the verdict swings when the same song is re-encoded to a different codec
Sanity FAIL count — flag a source FAIL if it crosses a threshold (e.g. AI TPR < 0.9, real FPR > 0.05)

And this framework immediately dragged out the hidden weakness of the model we'd been calling "SOTA," v9.3_hn.

v9.3_hn's codec fragility

Running ArtifactBench on v9.3_hn raised 6 FAILs, and two clusters of them were shocking.

Item	v9.3_hn
FMA hard-negative (MP3 real) FPR	98.7%
MoM real (MP3 real) FPR	98.3%
Codec-pair Δ mean	0.896
Codec-pair Δ max	1.000

Encoding the same real music to MP3 caused almost all of it to be mistaken for AI (98.7%). A codec-pair Δ mean of 0.90 means feeding the same song as WAV vs MP3 splits the verdict by 0.90 — effectively a coin flip.

The cause was clear. Because the real music in training was overwhelmingly WAV, the model had learned the false correlation "MP3 codec artifact = AI signature." The SONICS benchmark never caught this weakness — SONICS doesn't run a codec-pair test. Only by building a broader, more adversarial benchmark did the weakness SONICS had been hiding finally become visible.

Here's the second pillar: a benchmark must stress the real-world distribution. A SOTA on a benchmark that never touches codecs, fresh generators, or hard-real can crumble on any of the three.

5. codec4 UNet: tackling codec fragility head-on

With the diagnosis clear, so was the fix: retrain the UNet with 4-way codec augmentation — expand the training pairs to original + MP3 + AAC + Opus so the model learns codec artifacts as "normal, non-AI signal." We call this codec-aware UNet the codec4 UNet.

A past lesson on file proved decisive here: jointly retraining UNet and CNN together creates a trade-off where MP3 FPR drops but WAV TPR dies along with it. So we retrained only the UNet with codec augmentation, then fine-tuned the CNN separately on top of it — a staged strategy.

The result (v9.4 = R1 fine-tune + codec4 UNet) was dramatic.

Item	v9.3_hn	v9.4 (codec4)
FMA hard-negative FPR	98.7%	8.0%
MoM real FPR	98.3%	0.5%
Codec-pair Δ mean	0.896	0.025
Codec-pair Δ max	1.000	0.698
ArtifactBench FAIL	6	4

The codec-pair Δ mean fell from 0.90 to 0.025 — now the same song yields nearly the same verdict under any codec. MP3 real-music false positives dropped from 98.7% to 8.0%, and MoM real fell to 0.5%.

To be honest, it wasn't a complete fix. The codec-pair max Δ was still 0.698 (one worst-case track still wobbled), and some generators like udio_extra stayed at 86% TPR. But the codec robustness that had been a "coin flip" had clearly risen to a practical level.

We then validated v9.5 (codec4 UNet + cnn_v95, a further CNN reinforcement) on the full SONICS set (~23,000 tracks).

Metric	v9.5
SONICS F1	0.9993
SONICS Real FPR	0.086% (9 / 10,510)
AUC	0.99999
Per-generator TPR	chirp-v3/v3.5/udio-30s 100%, chirp-v2 99.8%, udio-120s 99.7%

The decisive difference from V10: v9.5 scored high on SONICS while also securing codec robustness and latest-generator detection. The score followed from solving a weakness, not from chasing the score.

6. A fair 3-way comparison: bigger models distrust real music more

The real purpose of ArtifactBench wasn't to admire our own model — it was to fight competing models in the same ring. So we ported two baselines into the same environment:

CLAM — Melody-or-Machine lineage, 194M parameters
SpecTTTra (α-120s) — the SONICS paper model, 18.7M parameters

Here are the results of running the full SONICS set (~23,000 tracks; ~12,800 AI / ~10,500 real) under identical conditions.

Model	Params	SONICS F1	Real FPR	AUC
ArtifactNet v9.5	4.2M	0.9993	0.086%	0.99999
SpecTTTra α-120s	18.7M	0.8874	17.97%	0.9303
CLAM	194M	0.7652	67.16%	0.8222

The most shocking number is Real FPR. The 194M-parameter CLAM mistook 67% of SONICS real music for AI. SpecTTTra over-flagged 18% as well. The 4.2M ArtifactNet, by contrast, came in at 0.086% — only 9 wrong out of 10,510.

The interpretation: the giant models are trained to "catch AI well," so they over-suspect real music. In the real world the true cost is the false positive — branding human-made music as AI — and the model paying the highest cost was the largest one. A 4.2M model focused on residual physics (RVQ quantization traces) is stronger in practice than a model 46× its size.

(Note: ArtifactNet codec4 was also the most stable on codec-pair invariance. SpecTTTra Δ mean was 0.060 and CLAM 0.126 — both baselines wobbled under codec changes.)

7. Cross-validation: doubting our own measurements with outside research

Winning on your own benchmark is trivial. So we re-measured the same data with an independent external detector — running a Deezer ISMIR-lineage AI-music detection research model across the full SONICS set.

That external model's SONICS results validated our pipeline:

Overall [email protected] 80.6%, [email protected] 17.5% — i.e. SONICS is genuinely hard even for an external SOTA.
The chirp family was detected at nearly 100%.
But udio-30s collapsed to 41.3% TPR — the external model struggled with udio too.

This cross-validation told us two things. First, the udio family isn't our weakness alone — it's a hard class across the detection field (our v9.5 hit udio-30s 100%, far ahead of the external model). Second, the fact that the external model's numbers landed in a reasonable range means our benchmark harness is trustworthy. Sanity-checking your own measurements against an outside implementation is a habit V10 burned into us.

8. Pushing further: v96 and the hard-real wall

Codec was solved, but ArtifactBench was still pointing at one place — hard-real false positives. codec4 had brought FMA hard-negative FPR down to 8%, but not to zero. Real music that "sounds like AI because it's heavily distorted" — lo-fi, amateur uploads, some vintage recordings — remained.

So at the end of April we tried v96, heavily reinforced with hard negatives. But the more hard real we added, the more two risks reared up again — exactly as a warning on file cautioned: (a) adding hard negatives in bulk to a converged model risks overfitting and oscillation, and (b) suppressing hard real can drag AI TPR down with it. v96 showed oscillation during fine-tuning and gave no improvement clear enough to replace codec4 v9.5, so we did not ship it to production.

The hard-real problem did not close in April. And — as readers who've seen the final part of this series know — the true identity of this "weakness" would only get rewritten much later, when we discovered data leakage in our own evaluation set.

9. What April left us

This month is remembered for its lessons more than its wins.

Benchmark saturation is real. V10 pushed SONICS F1 to 0.99995 while killing newest-Suno detection to 4.3%. A single benchmark's ceiling is not the ceiling of skill.
A benchmark is not a tool but a thing to control. When the pipeline (HPSS / verdict / normalization) is misaligned, every comparison lies. That's why we built ArtifactBench.
A good benchmark manufactures weaknesses. The codec-pair test exposed v9.3_hn's 98.7% codec false-positive rate, which led to codec4 UNet.
Let the score follow from solving a weakness. v9.5 didn't chase the score, yet earned SONICS F1 0.9993 and real-music FPR 0.086%. In the same ring, the 194M CLAM had a real-music FPR of 67%.
Validate your own measurements externally. The Deezer ISMIR cross-validation confirmed udio is a field-wide hard class.

In the next part, we go after the unresolved hard-real false positives — and end up realizing that we'd been measuring the problem wrong.

ArtifactNet Research Team · April 2026

All posts Part 5: Toward Real-Time →

Try the detector on your own tracks

Upload any file or paste a URL — the same forensic pipeline described in this series.

Free demo → Dashboard