ArtifactNet research · Part 5 · 2/3 · May 2026
Distillation and runtime — light ≠ fast
Part 5 of the ArtifactNet research journey. This is the record of May 2026, when we tried to move SOTA detection accuracy from 4-second batch inference to streaming real-time processing. We shrank the model and it got slower; we swapped runtimes 28 ways and hit a hardware floor; and we learned the expensive lesson that a "real-time model" is not the same thing as a "detection model." Here is the engineering of forcing an accuracy model into real-time constraints — successes and failures, written plainly.
ArtifactNet's detection pipeline is fundamentally offline.
audio → STFT → ArtifactUNet (residual extraction) → HPSS → 7-channel forensic features
→ ResidualCNN → per-segment P(AI) → per-track median verdict
To judge one track we extract seven 4-second chunks and run them as a single batch. At 0.58s per track that is plenty, and since the user just wants the result, latency was never the issue.
The problem came from the opposite-direction twin research. Using the same residual physics to remove RVQ ghosting from AI audio yields a quality-restoration tool (de-artifact). And we wanted that as an audio plugin (VST3). A plugin must process and return each 256/512-sample block the host throws at it on the spot — it cannot see future frames (causal), and it cannot wait 4 seconds.
In other words, the same ArtifactUNet backbone had to live in two worlds:
| Detection | de-artifact real-time | |
|---|---|---|
| Processing unit | 4s chunk batch | tens-of-ms block streaming |
| Future frames | allowed (bidirectional) | forbidden (causal) |
| Latency | irrelevant | within hundreds of ms |
| Output | P(AI) | restored waveform |
May's work was the engineering of bridging that gap. And in the process, the intuition that "a smaller model is faster" was broken three times.
The first design was ArtifactUNetLite: 1.85M params (half of the 3.6M Teacher), a multi-rate architecture handling 44.1/48/96 kHz in a single model, and causal streaming for the plugin.
Several design decisions were nailed down explicitly:
CausalChannelLayerNorm (per-frame LN) only. GroupNorm/BatchNorm leak statistics across the time axis and break causality. A unit test enforced "future influence on past output = 0" (past Δ = 0.00).residual = m·X, clean = (1−m)·X enforced as a complementary pair.All 8 unit tests passed — the multi-rate hop stayed consistent at 2.90ms across 44.1/48/96, residual+clean reconstructed the input to 8.94e-08 error, and causality held.
Then the speed measurement was a shock. Lite ran 2.5× slower than the existing detection UNet on CPU — despite halving the parameters.
Digging in, the cause was clear:
First lesson: parameter count is not a proxy for speed. Real-time speed comes from (a) reducing frame count with a large hop, and (b) tapering channels toward the decoder to lower cumulative cost — Lite gave up both.
We did not discard Lite. The single multi-rate model, zero-lookahead causal streaming, and small weight memory (mobile/storage friendly) remained genuine value. But we nailed down that it would never again be proposed as an "RT acceleration" path.
Next was the ArtifactUNetRT series for true real-time removal. Instead of 4-second chunks, it streams short frames:
| Name | T frames | Chunk length | Use |
|---|---|---|---|
| RT8 | 8 | ~93ms | lowest latency |
| RT16 | 16 | ~186ms | recommended low-latency |
| RT24 | 24 | ~278ms | |
| RT48 | 48 | ~557ms | standard mode |
Two key techniques:
CausalCLN is replaced for inference with a standard LayerNorm pattern (numerically identical; detailed in §6).The RT series uses N_FFT=1024 (117K params) — a small spectral input optimized for low latency. It was excellent for VST3 real-time removal.
But here we fell into a trap. "Couldn't we just use this fast RT model in the detection pipeline too?" We extracted its residual and pushed it through the existing detection CNN — and CNN compatibility collapsed to 30%.
The cause was simple and fundamental. The detection CNN was trained on the N_FFT=2048 spectral basis. The RT model uses N_FFT=1024, a different input space entirely. With half the frequency resolution in the residual, the CNN sees a distribution it has never encountered. Two models with different input spaces could not even be compared directly.
Second lesson: a spectral basis changed for real-time is incompatible with downstream models. Reducing N_FFT for speed was a removal-only choice that could not cross over into detection.
The third model became the bridge between the two worlds. ArtifactUNetFast is a 534K model, 6.7× smaller than the Teacher (3.6M), that preserves N_FFT=2048. With the same spectral basis it drops straight into the detection pipeline, while also supporting causal streaming.
| Item | Spec |
|---|---|
| Parameters | 534K (6.7× smaller than Teacher) |
| Architecture | CausalDSConv + CausalCLN, 4-level U-Net, base_channels=24 |
| Input | (B, 2, 1025, T) — cat([H_mag, P_mag]), single forward (Teacher does two) |
| Output | (B, 2, 1025, T) — [mask_H, mask_P] ∈ [0, 0.5]² |
| Training | knowledge distillation from frozen Teacher |
| Loss | 1.0 × MSE(mask) + 0.1 × spectral_convergence |
| Data | AIME + Jamendo + lo-fi hiphop (12K files, 60K chunks, SR pool 44.1/48k) |
The core of the distillation is imitating the Teacher's mask while bundling the input into 2 channels so H/P are processed in a single forward (the Teacher runs H and P twice). After converging to best loss 0.06645, we compared against the Teacher on 10 tracks:
| Metric | Value |
|---|---|
| Mask Pearson | H=0.84, P=0.86 (strong correlation) |
| Mask MAE | H=0.080, P=0.073 |
| Mask distribution | Student 0.28 vs Teacher 0.31 |
| Spectral Convergence | 0.38 (residual character preserved) |
The mask is somewhat coarser than the Teacher's but the distribution and character are preserved — residual extraction works correctly.
The speed was the headline. On PyTorch it's similar to the Teacher, but moving to ONNX Runtime changes everything:
| Model | ORT CPU 4-thread p50 | RTF | Total latency |
|---|---|---|---|
| Teacher (PyTorch, 2× forward) | ~998ms | 0.25 | ~4s |
| Fast (PyTorch, 1× forward) | ~1066ms | 0.27 | ~4s |
| Fast ORT T=16 | 18.5ms | 0.10 | 204ms ✅ |
| Fast ORT T=24 | 27.8ms | 0.10 | 306ms |
A 50× speedup on ORT vs PyTorch. At T16, chunk (186ms) + inference (18.5ms) sums to 204ms total latency — inside the real-time plugin budget.
We exported three artifact forms — raw ONNX (for ORT runtime), a CausalCLN surgery version (Rust/tract-only custom op), and an onnxsim-optimized version. Note the surgery version fails session creation on ORT (custom op unsupported), so it is Rust/tract-only. This fork leads into the runtime war of the next section.
With the model fixed, the question became which runtime to run it on. To shave de-artifact real-time-removal latency (RT48 stereo), we exhausted nearly every inference engine and setting.
| Stage | RTF (stereo) |
|---|---|
| tract sequential (start) | 0.416 |
| tract parallel L‖R | 0.218 |
| ORT FP32 4‖4 migration | 0.031 |
| ORT config tuning | 0.029 |
| CausalCLN→LayerNorm ONNX surgery | 0.023 |
| theoretical minimum (custom op needed) | ~0.018 |
Simply moving from tract to ONNX Runtime improved RTF from 0.416 to 0.031 — an 18× gain (tract's sequential execution vs ORT's MLAS optimization plus L/R session parallelism). Surgery took it to 0.023. The absolute CPU floor (Ryzen 5800X, AVX2 FP32) was stereo 11.9ms, RTF 0.022.
| Setting | Latency (p50) | RTF | vs CPU |
|---|---|---|---|
| CPU stereo 4‖4T | 12.2ms | 0.0223 | 1× |
| CUDA seq L→R | 3.9ms | 0.0072 | 3.1× |
| CUDA Graph FP16 | 2.52ms | 0.0046 | 4.8× |
| TRT FP16 seq | 1.71ms | 0.0031 | 7.1× |
| TRT FP16 batch=2 | 1.44ms | 0.0026 | 8.5× |
The GPU (RTX 3060) final floor was TensorRT FP16 batch=2 at 1.44ms (RTF 0.0026). Bundling stereo L+R into the batch dimension of a single forward was always faster than multi-stream (TRT's batch fusion allocates SM resources optimally). Forcing LayerNorm to FP16 (disabling the FP32 fallback) added another 5%.
Paths that looked fast but failed or regressed:
After exhausting all 28 ORT SessionBuilder options, we concluded that 11.5–12.3ms is the absolute minimum for this hardware. Reducing it further would require model retraining (shrinking) or new hardware. This is not a defeat but a boundary established — knowing what's possible lets you decide what to give up above it.
A note on the CausalCLN→LayerNorm surgery that recurred throughout the runtime ladder. It was the smallest yet cleanest optimization.
The training-time CausalChannelLayerNorm is a custom op that computes (C,F) statistics per frame. It guarantees causality, but it does not map cleanly to standard ONNX ops in the inference graph. So we operated on the graph, swapping it for a numerically equivalent standard LayerNorm + transpose pattern.
The key point is that the output is bit-for-bit identical while only the runtime accelerates — because ORT recognizes the standard LayerNorm pattern as a fused kernel. Re-validated on ORT 1.24.x, surgery gave a measured +3.69ms gain; in the profile LayerNorm took 20.1% of the total and the surgery-created Transpose took 16.8% — yet it was a net win.
An interesting counterexample: implementing the same LayerNorm directly as a Rust custom op actually regressed mono by 1.3ms. For the small T=48 loop, ORT's MLAS fused implementation beat the loop-transposed Rust. "A hand-written kernel is always faster" was also a false intuition.
While real-time removal forked into RT/Fast, we also checked whether the detection pipeline itself could move to ORT. The answer was to export the Teacher directly to ONNX.
| Item | Result |
|---|---|
| ORT 4-thread mono | 275.8ms (<500ms target met) |
| CNN verdict agreement | 100% (9/9) |
| Numerical error | max 2.56e-06 |
| Parameters | 3.6M |
| ONNX nodes | 76 (simple UNet structure) |
Preserving N_FFT=2048 gave 100% verdict agreement with the detection CNN (numerical error 2.56e-06). This is the decisive difference from the RT model:
| Model | ORT speed | CNN compat | Use |
|---|---|---|---|
| RT (N_FFT=1024, 117K) | 12.3ms stereo | 30% | VST3 only |
| Teacher (N_FFT=2048, 3.6M) | 275.8ms mono | 100% | detection pipeline |
For detection, the answer is the large Teacher, not the small RT model — the criterion was compatibility, not speed.
This is the heart of the article. Throughout May there was a persistent temptation to use the fast models directly in the detection pipeline. It failed three times, each for a different reason.
(1) ArtifactUNetRT — spectral basis mismatch. N_FFT=1024, so CNN compatibility was 30%. (§3)
(2) ArtifactUNetLite — parameter inversion. 1.85M, yet 2.5× slower than the detection UNet on CPU. (§2)
(3) ArtifactUNetFast — a two-layer trap.
Fast preserves N_FFT=2048, so it has none of RT's compatibility problem. That raised the natural question: "What if we retrain the detection CNN on Fast?" A residual comparison revealed two things:
| UNet path | TPR (AI) | FPR (Real) |
|---|---|---|
| codec4 (current production) | 98.0% | 0.0% |
| phase2 (Fast's teacher) | 100.0% | 55.0% |
| Fast (drop-in) | 98.0% | 52.5% |
The correct order was clear — to detect with Fast, you must first re-distill from a codec-aware teacher (codec4) and then stack the CNN on top. And if real-time streaming is not the goal, detection needs no causal constraint, so a bidirectional lightweight student wins on both accuracy and efficiency.
Third and biggest lesson: real-time constraints (causal, short hop, small N_FFT, 2-channel fusion) collide head-on with detection accuracy and speed. The ambition to use one backbone for two worlds extracted a different cost every time. The conclusion was to split roles: removal uses Fast/RT, detection uses Teacher/codec4.
Two checks rounded out the month.
Adversarial evasion test. The purpose of de-artifact is audio-quality improvement, not detection evasion. Still, as due diligence, we self-checked whether our suppression tool could be abused for evasion — passing the original and outputs at suppression strengths α=0.5/1/2/4 through the detection pipeline and measuring the change in P(AI). (Detection robustness is covered in a later part.)
Lightweight SOTA comparison. Around the same time, on a small subset (SONICS fake 150 + real 150), we matched up against an external baseline (a MERT-based 2-stage model, 174M params). The 4.2M ArtifactNet led with TPR 100%, FPR 10.7%, F1 0.949, beating the 174M model (F1 0.929), and was 4.5× faster at 0.58s/track vs 2.62s/track. Given the subset scale, the point is not the absolute numbers but the edge despite a 40× parameter gap — a reconfirmation of this series' premise that a small model is the starting point for going real-time.
Parameter count is not speed. Two inversions — Lite (2.5×) and Fast (3.44×) — pointed at the same truth: real-time speed comes from hop size, channel taper, activation tensor size, and kernel launch count. Before shrinking a model, profile what is actually slow.
Real-time optimization breaks downstream compatibility. N_FFT reduction (RT) and causal/2-channel fusion (Fast) all collided with the downstream detection CNN. To reuse one backbone for multiple purposes, you must first decide which axes to share and which to fork. We shared the spectral basis (N_FFT=2048) and forked causality and channel fusion.
Runtime exhaustion is not defeat but boundary-setting. Running tract→ORT→TVM→OpenVINO→TensorRT to pin the hardware floor (CPU 11.9ms, GPU 1.44ms) let us reasonably decide, above that floor, what to give up and what to retrain.
In the next part, we take this refined SOTA out to attack its remaining weaknesses — and end up doubting our own evaluation methodology instead.
ArtifactNet Research Team · May 2026
Upload any file or paste a URL — the same forensic pipeline described in this series.