Toward Real-Time · Intrect Blog

Part 5 of the ArtifactNet research journey. This is the record of May 2026, when we tried to move SOTA detection accuracy from 4-second batch inference to streaming real-time processing. We shrank the model and it got slower; we swapped runtimes 28 ways and hit a hardware floor; and we learned the expensive lesson that a "real-time model" is not the same thing as a "detection model." Here is the engineering of forcing an accuracy model into real-time constraints — successes and failures, written plainly.

TL;DR

Detection SOTA (4.2M params, SONICS F1 0.9937) is an offline pipeline that batches seven 4-second chunks per track. To turn the audio-quality-restoration (de-artifact) tool into a real-time plugin, the same residual network had to run in streaming mode.
The first attempt, ArtifactUNetLite (1.85M), halved the parameters but ran 2.5× slower on CPU. First lesson: parameter count is not a proxy for measured speed.
The ArtifactUNetRT series (117K, N_FFT=1024, RT8–RT48, causal) fit VST3 real-time removal, but its compatibility with the detection CNN collapsed to 30% — the spectral basis was different.
ArtifactUNetFast (534K, 6.7× smaller than Teacher, N_FFT=2048 preserved) was the answer. Teacher distillation gave mask Pearson 0.84/0.86, a 50× speedup on ONNX Runtime, and 204ms total latency at T16 — a drop-in for the de-artifact pipeline.
We exhausted runtimes from tract → ORT → TVM → OpenVINO → TensorRT. The CausalCLN→LayerNorm ONNX surgery (numerically identical, faster) took CPU RTF from 0.416 to 0.023 (18×), and GPU TRT FP16 batch=2 reached the 1.44ms (RTF 0.0026) hardware floor.
The biggest lesson: these real-time models are not for detection. Plugged into the detection pipeline as-is they either slow down (parameter inversion) or their FPR collapses. We document exactly why.

1. The Starting Point: Why "Real-Time" Was Needed

ArtifactNet's detection pipeline is fundamentally offline.

audio → STFT → ArtifactUNet (residual extraction) → HPSS → 7-channel forensic features
     → ResidualCNN → per-segment P(AI) → per-track median verdict

To judge one track we extract seven 4-second chunks and run them as a single batch. At 0.58s per track that is plenty, and since the user just wants the result, latency was never the issue.

The problem came from the opposite-direction twin research. Using the same residual physics to remove RVQ ghosting from AI audio yields a quality-restoration tool (de-artifact). And we wanted that as an audio plugin (VST3). A plugin must process and return each 256/512-sample block the host throws at it on the spot — it cannot see future frames (causal), and it cannot wait 4 seconds.

In other words, the same ArtifactUNet backbone had to live in two worlds:

	Detection	de-artifact real-time
Processing unit	4s chunk batch	tens-of-ms block streaming
Future frames	allowed (bidirectional)	forbidden (causal)
Latency	irrelevant	within hundreds of ms
Output	P(AI)	restored waveform

May's work was the engineering of bridging that gap. And in the process, the intuition that "a smaller model is faster" was broken three times.

2. First Attempt — ArtifactUNetLite, Lightweight ≠ Fast

The first design was ArtifactUNetLite: 1.85M params (half of the 3.6M Teacher), a multi-rate architecture handling 44.1/48/96 kHz in a single model, and causal streaming for the plugin.

Several design decisions were nailed down explicitly:

Normalization is CausalChannelLayerNorm (per-frame LN) only. GroupNorm/BatchNorm leak statistics across the time axis and break causality. A unit test enforced "future influence on past output = 0" (past Δ = 0.00).
No time-axis stride-2 downsampling. Receptive field grows via dilation only. Shrinking the time axis misaligns streaming block boundaries.
Frequency-axis only, 4-stage stride-2. A dual head emits clean (suppression) and residual (forensic) simultaneously, with residual = m·X, clean = (1−m)·X enforced as a complementary pair.

All 8 unit tests passed — the multi-rate hop stayed consistent at 2.90ms across 44.1/48/96, residual+clean reconstructed the input to 8.94e-08 error, and causality held.

Then the speed measurement was a shock. Lite ran 2.5× slower than the existing detection UNet on CPU — despite halving the parameters.

Digging in, the cause was clear:

hop=128 means 4× more frames per chunk (T=200 vs 48). Frame count, not parameters, dominated compute.
Every stage's activation tensor stayed near-constant at C×F×T ≈ 1.7M cells. The existing detection UNet halves the tensor per stage (798k→98k), keeping cumulative cost low; Lite forbade time-axis downsampling, so a large T persisted all the way through.
Parameter reduction saves weight memory, not FLOPs. Even splitting T into 50–100 and calling multiple times yielded only a 24% improvement (1085ms→824ms).

First lesson: parameter count is not a proxy for speed. Real-time speed comes from (a) reducing frame count with a large hop, and (b) tapering channels toward the decoder to lower cumulative cost — Lite gave up both.

We did not discard Lite. The single multi-rate model, zero-lookahead causal streaming, and small weight memory (mobile/storage friendly) remained genuine value. But we nailed down that it would never again be proposed as an "RT acceleration" path.

3. ArtifactUNetRT Series — Short-Frame Causal, but a Different Language for Detection

Next was the ArtifactUNetRT series for true real-time removal. Instead of 4-second chunks, it streams short frames:

Name	T frames	Chunk length	Use
RT8	8	~93ms	lowest latency
RT16	16	~186ms	recommended low-latency
RT24	24	~278ms
RT48	48	~557ms	standard mode

Two key techniques:

prepad causal: input is padded only on the past side so the current block is processed seeing only past frames. No lookahead means streaming is possible.
layernorm surgery: the training-time CausalCLN is replaced for inference with a standard LayerNorm pattern (numerically identical; detailed in §6).

The RT series uses N_FFT=1024 (117K params) — a small spectral input optimized for low latency. It was excellent for VST3 real-time removal.

But here we fell into a trap. "Couldn't we just use this fast RT model in the detection pipeline too?" We extracted its residual and pushed it through the existing detection CNN — and CNN compatibility collapsed to 30%.

The cause was simple and fundamental. The detection CNN was trained on the N_FFT=2048 spectral basis. The RT model uses N_FFT=1024, a different input space entirely. With half the frequency resolution in the residual, the CNN sees a distribution it has never encountered. Two models with different input spaces could not even be compared directly.

Second lesson: a spectral basis changed for real-time is incompatible with downstream models. Reducing N_FFT for speed was a removal-only choice that could not cross over into detection.

4. ArtifactUNetFast — Teacher Distillation, the Drop-In Answer

The third model became the bridge between the two worlds. ArtifactUNetFast is a 534K model, 6.7× smaller than the Teacher (3.6M), that preserves N_FFT=2048. With the same spectral basis it drops straight into the detection pipeline, while also supporting causal streaming.

Item	Spec
Parameters	534K (6.7× smaller than Teacher)
Architecture	CausalDSConv + CausalCLN, 4-level U-Net, base_channels=24
Input	`(B, 2, 1025, T)` — cat([H_mag, P_mag]), single forward (Teacher does two)
Output	`(B, 2, 1025, T)` — [mask_H, mask_P] ∈ [0, 0.5]²
Training	knowledge distillation from frozen Teacher
Loss	`1.0 × MSE(mask) + 0.1 × spectral_convergence`
Data	AIME + Jamendo + lo-fi hiphop (12K files, 60K chunks, SR pool 44.1/48k)

The core of the distillation is imitating the Teacher's mask while bundling the input into 2 channels so H/P are processed in a single forward (the Teacher runs H and P twice). After converging to best loss 0.06645, we compared against the Teacher on 10 tracks:

Metric	Value
Mask Pearson	H=0.84, P=0.86 (strong correlation)
Mask MAE	H=0.080, P=0.073
Mask distribution	Student 0.28 vs Teacher 0.31
Spectral Convergence	0.38 (residual character preserved)

The mask is somewhat coarser than the Teacher's but the distribution and character are preserved — residual extraction works correctly.

The speed was the headline. On PyTorch it's similar to the Teacher, but moving to ONNX Runtime changes everything:

Model	ORT CPU 4-thread p50	RTF	Total latency
Teacher (PyTorch, 2× forward)	~998ms	0.25	~4s
Fast (PyTorch, 1× forward)	~1066ms	0.27	~4s
Fast ORT T=16	18.5ms	0.10	204ms ✅
Fast ORT T=24	27.8ms	0.10	306ms

A 50× speedup on ORT vs PyTorch. At T16, chunk (186ms) + inference (18.5ms) sums to 204ms total latency — inside the real-time plugin budget.

We exported three artifact forms — raw ONNX (for ORT runtime), a CausalCLN surgery version (Rust/tract-only custom op), and an onnxsim-optimized version. Note the surgery version fails session creation on ORT (custom op unsupported), so it is Rust/tract-only. This fork leads into the runtime war of the next section.

5. The Runtime War — From tract to TensorRT, Exhaustively

With the model fixed, the question became which runtime to run it on. To shave de-artifact real-time-removal latency (RT48 stereo), we exhausted nearly every inference engine and setting.

CPU ladder

Stage	RTF (stereo)
tract sequential (start)	0.416
tract parallel L‖R	0.218
ORT FP32 4‖4 migration	0.031
ORT config tuning	0.029
CausalCLN→LayerNorm ONNX surgery	0.023
theoretical minimum (custom op needed)	~0.018

Simply moving from tract to ONNX Runtime improved RTF from 0.416 to 0.031 — an 18× gain (tract's sequential execution vs ORT's MLAS optimization plus L/R session parallelism). Surgery took it to 0.023. The absolute CPU floor (Ryzen 5800X, AVX2 FP32) was stereo 11.9ms, RTF 0.022.

GPU ladder

Setting	Latency (p50)	RTF	vs CPU
CPU stereo 4‖4T	12.2ms	0.0223	1×
CUDA seq L→R	3.9ms	0.0072	3.1×
CUDA Graph FP16	2.52ms	0.0046	4.8×
TRT FP16 seq	1.71ms	0.0031	7.1×
TRT FP16 batch=2	1.44ms	0.0026	8.5×

The GPU (RTX 3060) final floor was TensorRT FP16 batch=2 at 1.44ms (RTF 0.0026). Bundling stereo L+R into the batch dimension of a single forward was always faster than multi-stream (TRT's batch fusion allocates SM resources optimally). Forcing LayerNorm to FP16 (disabling the FP32 fallback) added another 5%.

The graveyard of alternatives (honestly)

Paths that looked fast but failed or regressed:

TVM Relay: 1.27× faster than ORT on mono (10.6ms vs 13.5ms). But on stereo, the global thread pool caused L/R contention and regressed 5× to 62ms. ORT can isolate L/R with per-session thread pools; TVM uses a single shared pool, incompatible with the production stereo structure.
OpenVINO EP: mono 10.9ms, similar to ORT, but stereo ran 51.5ms due to serial execution inside the EP.
INT8 quantization: Zen 3 lacks VNNI, so it regressed 6.6× to RTF 0.172. On GPU too, TRT 10 rejected all of ORT's quantization paths (DynamicQuantizeLinear / QDQ bias), and INT8 landed on the same floor as FP16 — meaning at T=16 the bottleneck is kernel dispatch, not precision.
XNNPACK EP: an ARM-optimized library, so on x86 AVX2 it regressed 2.3× to mono 28.4ms.

After exhausting all 28 ORT SessionBuilder options, we concluded that 11.5–12.3ms is the absolute minimum for this hardware. Reducing it further would require model retraining (shrinking) or new hardware. This is not a defeat but a boundary established — knowing what's possible lets you decide what to give up above it.

6. CausalCLN → LayerNorm Surgery — Same Numbers, Faster

A note on the CausalCLN→LayerNorm surgery that recurred throughout the runtime ladder. It was the smallest yet cleanest optimization.

The training-time CausalChannelLayerNorm is a custom op that computes (C,F) statistics per frame. It guarantees causality, but it does not map cleanly to standard ONNX ops in the inference graph. So we operated on the graph, swapping it for a numerically equivalent standard LayerNorm + transpose pattern.

The key point is that the output is bit-for-bit identical while only the runtime accelerates — because ORT recognizes the standard LayerNorm pattern as a fused kernel. Re-validated on ORT 1.24.x, surgery gave a measured +3.69ms gain; in the profile LayerNorm took 20.1% of the total and the surgery-created Transpose took 16.8% — yet it was a net win.

An interesting counterexample: implementing the same LayerNorm directly as a Rust custom op actually regressed mono by 1.3ms. For the small T=48 loop, ORT's MLAS fused implementation beat the loop-transposed Rust. "A hand-written kernel is always faster" was also a false intuition.

7. The Detection Pipeline's Answer — Teacher ONNX

While real-time removal forked into RT/Fast, we also checked whether the detection pipeline itself could move to ORT. The answer was to export the Teacher directly to ONNX.

Item	Result
ORT 4-thread mono	275.8ms (<500ms target met)
CNN verdict agreement	100% (9/9)
Numerical error	max 2.56e-06
Parameters	3.6M
ONNX nodes	76 (simple UNet structure)

Preserving N_FFT=2048 gave 100% verdict agreement with the detection CNN (numerical error 2.56e-06). This is the decisive difference from the RT model:

Model	ORT speed	CNN compat	Use
RT (N_FFT=1024, 117K)	12.3ms stereo	30%	VST3 only
Teacher (N_FFT=2048, 3.6M)	275.8ms mono	100%	detection pipeline

For detection, the answer is the large Teacher, not the small RT model — the criterion was compatibility, not speed.

8. The Most Expensive Lesson — A "Real-Time Model" Is Not a "Detection Model"

This is the heart of the article. Throughout May there was a persistent temptation to use the fast models directly in the detection pipeline. It failed three times, each for a different reason.

(1) ArtifactUNetRT — spectral basis mismatch. N_FFT=1024, so CNN compatibility was 30%. (§3)

(2) ArtifactUNetLite — parameter inversion. 1.85M, yet 2.5× slower than the detection UNet on CPU. (§2)

(3) ArtifactUNetFast — a two-layer trap.

Fast preserves N_FFT=2048, so it has none of RT's compatibility problem. That raised the natural question: "What if we retrain the detection CNN on Fast?" A residual comparison revealed two things:

Speed inversion (again). Fast (534K, 6.7× smaller) ran 3.44× slower than codec4 (3.6M) in the detection batch (seven 4s chunks, GPU) — 440 vs 128ms. The cause was the causal DSConv's memory-bandwidth bottleneck + CausalCLN's permute/LayerNorm + frequency-only downsampling (keeping a large T) + 2-channel I/O launching many small kernels. The structure optimized for real-time streaming became poison for batch processing.
FPR collapse. Detecting with Fast-extracted residuals broke real-audio FPR to 52–55%. But the culprit was not Fast. Fast's distillation teacher was a codec-naive older model (phase2), and extracting residuals with that same teacher (phase2) also gave 55% FPR. The phase2 lineage had never seen the real-audio hard-negative distribution that the current production UNet (codec4) had learned.

UNet path	TPR (AI)	FPR (Real)
codec4 (current production)	98.0%	0.0%
phase2 (Fast's teacher)	100.0%	55.0%
Fast (drop-in)	98.0%	52.5%

The correct order was clear — to detect with Fast, you must first re-distill from a codec-aware teacher (codec4) and then stack the CNN on top. And if real-time streaming is not the goal, detection needs no causal constraint, so a bidirectional lightweight student wins on both accuracy and efficiency.

Third and biggest lesson: real-time constraints (causal, short hop, small N_FFT, 2-channel fusion) collide head-on with detection accuracy and speed. The ambition to use one backbone for two worlds extracted a different cost every time. The conclusion was to split roles: removal uses Fast/RT, detection uses Teacher/codec4.

9. Side Notes — Adversarial Evasion and Lightweight SOTA Comparison

Two checks rounded out the month.

Adversarial evasion test. The purpose of de-artifact is audio-quality improvement, not detection evasion. Still, as due diligence, we self-checked whether our suppression tool could be abused for evasion — passing the original and outputs at suppression strengths α=0.5/1/2/4 through the detection pipeline and measuring the change in P(AI). (Detection robustness is covered in a later part.)

Lightweight SOTA comparison. Around the same time, on a small subset (SONICS fake 150 + real 150), we matched up against an external baseline (a MERT-based 2-stage model, 174M params). The 4.2M ArtifactNet led with TPR 100%, FPR 10.7%, F1 0.949, beating the 174M model (F1 0.929), and was 4.5× faster at 0.58s/track vs 2.62s/track. Given the subset scale, the point is not the absolute numbers but the edge despite a 40× parameter gap — a reconfirmation of this series' premise that a small model is the starting point for going real-time.

10. Lessons and What's Next

Parameter count is not speed. Two inversions — Lite (2.5×) and Fast (3.44×) — pointed at the same truth: real-time speed comes from hop size, channel taper, activation tensor size, and kernel launch count. Before shrinking a model, profile what is actually slow.

Real-time optimization breaks downstream compatibility. N_FFT reduction (RT) and causal/2-channel fusion (Fast) all collided with the downstream detection CNN. To reuse one backbone for multiple purposes, you must first decide which axes to share and which to fork. We shared the spectral basis (N_FFT=2048) and forked causality and channel fusion.

Runtime exhaustion is not defeat but boundary-setting. Running tract→ORT→TVM→OpenVINO→TensorRT to pin the hardware floor (CPU 11.9ms, GPU 1.44ms) let us reasonably decide, above that floor, what to give up and what to retrain.

In the next part, we take this refined SOTA out to attack its remaining weaknesses — and end up doubting our own evaluation methodology instead.

ArtifactNet Research Team · May 2026

← Part 4: The Benchmark Saturation Trap All posts Part 6: Beyond SOTA →

Try the detector on your own tracks

Upload any file or paste a URL — the same forensic pipeline described in this series.

Free demo → Dashboard