ArtifactNet research · Part 6 · 3/3 · Jun 2026
Negative results, evaluation leakage, and what we measured wrong
A research log from after ArtifactNet v9.5. This is the story of two months in which we set out to tackle the "remaining weaknesses" after hitting SOTA, only to end up questioning our own evaluation methodology. We've kept the failed experiments and the traps we found alongside the wins — because we believe that's the more valuable record in this field.
ArtifactNet is a forensic framework for detecting AI-generated music (Suno, Udio, Stable Audio, Riffusion, MusicGen, and so on). The core pipeline looks like this:
audio → STFT → ArtifactUNet (residual extraction) → HPSS (harmonic/percussive separation)
→ 7-channel forensic features → ResidualCNN → per-segment P(AI) → per-song median verdict
At just 4.2M parameters in total, it consistently outperforms large transformer-class baselines.
| System | Params | SONICS F1 | FPR |
|---|---|---|---|
| ArtifactNet v9.5 | 4.2M | 0.9993 | 0.086% |
| SpecTTTra (α-120s) | 18.7M | 0.8874 | 17.97% |
| CLAM (MoM) | 194M | 0.7652 | 67.16% |
On top of that, we recorded an ArtifactBench 4-way (four-codec evaluation) F1 of 0.9861 and a MoM 40K full F1 of 0.9832.
The heart of the design is residual physics. AI generators internally pass through neural codecs such as RVQ (Residual Vector Quantization), leaving behind characteristic quantization traces ("RVQ ghosting"). ArtifactUNet isolates only these residual components, and the CNN learns the distributional consistency of the residual. AI music has abnormally high consistency here.
The two "remaining weaknesses" we perceived at the time were:
The rest of this post is the record of what happened when we set out to attack (1).
To solve the udio weakness, we used an external deep-research run (adversarial cross-validation across 105 agents) to derive three hypotheses and implemented all of them.
Hypothesis: If we pass genuine recordings through several neural codecs (Encodec, DAC, etc.) to create training pairs, we cover the traces of diverse decoder families and improve udio generalization.
Result: udio detection rate 85% → 85%, no change. The cause was that transfer between decoder families is essentially zero. Since udio uses an undisclosed proprietary decoder, no amount of training on Encodec/DAC transferred.
Hypothesis: Borrowing from ISMIR 2025 research showing that the stride structure of transposed convolution leaves deterministic peaks in the spectrum, we use these peaks as an auxiliary feature.
Result: It was 99%+ in-distribution, but collapsed to 25–40% on unseen udio. It was weak in exactly the same spot where our main model is weak. It failed to provide independent complementarity.
Hypothesis: The segments udio misses were on average 2.6 dB quieter. In a multiplicative-mask structure, the signal in quiet segments can't get amplified, so adding a volume-invariant log-ratio channel (mel(residual) − mel(original)) should restore it.
Result: Training validation F1 was 0.9949 and looked all but perfect. Yet on held-out it collapsed to a 67% detection rate, 30% on udio. The channel had memorized the training distribution and failed to generalize.
All three approaches failed. At this point we stopped and asked: "Is changing the residual representation itself a dead end?"
A negative result is information in its own right. Taken together, the three failures reverse-reveal how ArtifactNet works:
This suspicion carries into the next section.
While evaluating a next-generation CNN candidate, we discovered widespread leakage between the training manifest and the benchmarks.
| Benchmark source | Leakage ratio |
|---|---|
| suno CDN eval set | 193 / 200 |
| udio CDN eval set | 193 / 200 |
| real recordings (YouTube hardneg, etc.) | most of them |
In other words, a large part of that "validation F1 of 0.993" was not generalization but memorization. The songs in the eval set had already appeared in training.
To fix this, we reproduced the training pipeline with the same seed to extract the exact set S of files that actually appeared in training, then subtracted S from the live data to construct a leak-free clean held-out set. In the process, we confirmed a 100% match with the training logs.
Lesson: Leaked evaluation doesn't just inflate strengths. As we'll see later, it also distorts weaknesses.
On the leak-free clean held-out set, we made a fair comparison between the next-generation CNN candidate (fine-tuned on CDN data) and v9.5.
| Model | TPR | FPR | F1 |
|---|---|---|---|
| Next-gen candidate | 99.39% | 1.36% | 0.9931 |
| v9.5 | 99.06% | 9.29% | 0.9695 |
Interestingly, the key gain was not in AI detection rate but in reducing false positives (FPR) on genuine recordings. In particular, accuracy on LP and vintage-style real recordings rose from 78% to 97%. This appears to be the result of strengthening the "genuine recording" representation by adding CDN's MP3/Opus real recordings to training.
Here we built one more thing — a multi-source hard-real benchmark (9 sources, 3,050 songs: Jamendo, FMA, YouTube, SoundCloud, lo-fi, LP, etc.). It's a stress test that removes single-source bias.
| Model | hard-real FPR |
|---|---|
| Next-gen candidate | 15.57% |
| v9.5 | 32.25% |
Cut in half, but still in the 15% range. SoundCloud amateur uploads, FMA, and lo-fi emerged as the residual weaknesses. This — to give it away early — was the real weakness.
On top of this, we introduced a codec-TTA (test-time augmentation) operating point. By converting the input to MP3/AAC/Opus and blending the results, we made the verdict consistent even when the same song differs only in format.
There were traps in the process of reflecting the research gains in the actual service too. The production endpoint had diverged in version from the main code line — the April image with the batch API and the June code with the latest detection improvements were separate branches. Simply swapping the tag would break the batch API.
The solution was to overlay the June code on top of the April batch handler (preserving the batch API while reflecting the latest CNN, codec-TTA, and loudness-weighted verdict). During deployment verification, we also caught and fixed a GPU-only runtime bug (a tensor-handling error that CPU tests didn't catch).
As a result, production was updated to the latest detection performance while maintaining batch API compatibility.
Separately from detection, we pursued research that uses the same residual physics in the opposite direction — a tool that removes RVQ ghosting from AI recordings to improve audio quality. (To be clear, the goal is not detection evasion but audio quality improvement. The actual motivation was user feedback that the ghosting in the stem multitracks Suno provides is severe.)
The core model is ComplexArtifactUNet. It reuses the backbone of the detection ArtifactUNet but handles phase as well, with complex (real/imaginary) input-output plus a complex ratio mask.
Two technically interesting points:
Another lesson: audio quality was driven more by DSP post-processing than by the model. The improvement margin of the discriminative neural network plateaued at a certain level, and the actual perceptible gains came when we combined proven signal processing such as resonance suppression (above 2.5 kHz) and automatic high-frequency correction ("much improved," per listening evaluation).
We return to the suspicion we left hanging in Section 3. If P3's structural fix failed because the diagnosis was wrong, then we had to re-measure the udio "weakness" itself, leak-free.
The problem was the data. The existing udio eval set overlapped entirely with the training pool (leakage). So, using our in-house crawling infrastructure, we freshly collected 454 new udio songs, cross-checked them against the training database by per-song ID, and excluded 113 leaked songs to build a 228-song leak-free held-out set.
| Model | udio fresh held-out detection rate | Misses |
|---|---|---|
| v9.5 | 97.8% | 5 / 228 |
| Next-gen candidate | 95.6% | 10 / 228 |
The udio weakness never existed in the first place. On a true leak-free held-out set, v9.5 detects udio well at 97.8%. The previous "85–91%" was a measurement artifact created by a particular leaked sample.
The implication is weighty. The three experiments we wrestled with all through June (P1, P2, P3) were — attempts to solve a problem that didn't exist. The alternative hypothesis of "not enough data" was rejected along with it, since just 165 training songs yielded a fresh 97.8%.
Leaked evaluation distorts both strengths and weaknesses. It inflates strengths (validation F1 of 0.993) and fabricates weaknesses (udio 85%). The most expensive lesson we earned is methodological — before claiming a new model or a weakness, always re-measure on a leak-free held-out set.
Negative results are not to be thrown away. The three failures proved that the model learns the general properties of RVQ rather than decoder fingerprints, and ultimately led us to the more fundamental problem of evaluation leakage.
The next direction is now clear. The real weakness is not udio but false positives on genuine recordings (hard-real FPR of 15.57%). SoundCloud amateur uploads, FMA, lo-fi — cases of human-made music mistaken for AI. The next quarter focuses on intensively collecting this hard-real distribution to reduce false positives.
SOTA wasn't the end; it was the beginning of having the room to look at what we'd been measuring wrong.
ArtifactNet Research Team · June 2026
Upload any file or paste a URL — the same forensic pipeline described in this series.