← cd .. ~/jw3/posts/temporal-drift

Why temporal drift is the cheapest signal in video face-recognition defense

Most adversarial ML defenses ask the wrong question. They look at a single frame and ask: does this look manipulated? For video face recognition, that's the wrong unit of analysis. The attacker isn't trying to fool one frame — they're trying to fool a pipeline that runs across dozens of them. And that requirement, temporal consistency, is exactly what betrays them.

This is the core finding from my research into adversarial detection for video-based FR systems. The short version: temporal drift analysis outperforms ensemble disagreement by a significant margin, generalizes across attack families the detector was never trained on, and requires no gradient access or assumptions about the attack. It's also conceptually simple enough to explain in a job interview, which matters.

The attack that motivated this

The ReFace framework (Hussain et al., 2022) demonstrated that real-time adversarial attacks on face recognition are possible using Adversarial Transformation Networks — ATNs. Unlike PGD or FGSM, which require multiple backward passes per input, ATNs learn a feed-forward mapping from clean face to adversarial face. One forward pass, imperceptible perturbation, done. They reduced FR accuracy from 82% to 16.4% on commercial APIs like AWS Rekognition and Azure Face.

The threat model is clean: the attacker generates perturbations fast enough to apply them to live video frames without any model access on the defender's side. Traditional image-level defenses — which look for low-level artifacts — don't see this coming because the perturbations are imperceptible by construction. SSIM above 0.94, PSNR above 34 dB. Visually identical to benign frames after 20 epochs of training.

What temporal drift actually measures

Benign video sequences have a property that's easy to overlook: they're physically continuous. A person's face moves naturally across frames — pose shifts, expression changes, lighting variation — and the embedding vectors produced by a face recognition model track that continuity. The drift between consecutive embeddings is smooth and low-variance.

Adversarial perturbations break this. Even when the ATN is trained to maintain temporal consistency (which ours was, using multi-frame batches), the perturbations introduce irregular spikes in embedding trajectories that don't match natural facial motion. The drift metric captures this directly:

# per-frame drift d_t = ||e_t - e_{t-1}||_2 # temporal anomaly score S_temp = α₁·μ_d + α₂·d_max + α₃·σ²_d # detection rule y = 1 if S_temp > θ_temp else 0

The weights α are tuned via ROC-based threshold optimization on a validation set. The score combines mean drift, maximum drift, and drift variance — three complementary views of the same signal. A benign sequence sits below threshold with low variance. An adversarial sequence spikes.

Results

89.3%
ATN detection accuracy
90.0%
PGD generalization
0.947
ROC-AUC

The cross-attack generalization numbers are the most interesting part. The detector was trained exclusively on ATN-generated attacks, then tested on PGD and FGSM without retraining:

  • PGD: 90.0% accuracy, 100% recall, AUC 1.000 — PGD creates stronger temporal disruptions than ATN, so the trained detector has an easy time.
  • FGSM: 83.3% accuracy, AUC 0.922 — the hardest case. Single-step perturbations produce lower-magnitude drift, giving the detector less to work with.
  • Mixed injection: at 10% frame injection rate (1 adversarial frame out of 10), accuracy is still 72.3%. At 50% it's back to 89.3%.

The implication is that temporal drift is an attack-agnostic signature — not a feature of any specific attack family, but a consequence of the physical constraint that perturbations have to remain consistent across frames while also shifting embeddings away from their benign values. You can't fully satisfy both requirements simultaneously, and drift is what leaks.

Why ensemble disagreement failed

The ensemble approach — running four FR models (ArcFace, FaceNet, CosFace, SphereFace) and looking for divergence in their embeddings under attack — landed at 54.7% accuracy. Near coin-flip. The reason is straightforward once you look at the correlation structure: these models share training data and similar architectural choices, so adversarial perturbations transfer similarly across all of them. They move coherently rather than diverging. Correlation between embedding shifts was above 0.85.

The lesson: ensemble diversity in the architectural sense doesn't automatically give you diversity in the adversarial sense. If your ensemble members were trained on similar data with similar objectives, a single perturbation that fools one will tend to fool all of them in the same direction.

The classifier fusion (logistic regression over both detectors' outputs) matched the temporal detector's performance almost exactly — 89.3%, AUC 0.947. The ensemble signal added essentially nothing. In a more diverse ensemble, that gap might widen, but it would need architectures trained on genuinely different data and with different objectives.

What I'd do differently

A few open threads worth pursuing:

  • Adaptive attacks: an adversary who knows the detector is watching for drift can try to smooth their perturbations across frames. Our ATN was trained for imperceptibility, not drift suppression. A detector-aware attacker is a harder problem.
  • Short sequences: below 3–4 frames, there isn't enough temporal context to make a reliable call. For real-time surveillance that might only capture a few frames, you need something complementary.
  • Physical attacks: glasses, masks, adversarial patches. These create different drift signatures than digital perturbations, and I don't have data on whether the same thresholds hold.
  • LSTM integration: replacing the simple drift metric with a learned temporal model (LSTM over the embedding sequence) could capture more subtle drift patterns that the hand-crafted score misses.

But the core finding holds: for digital adversarial attacks on video FR, temporal coherence is the cheapest signal you can buy. It's gradient-free, model-agnostic, and it generalizes. That's a combination that's hard to beat as a first-line defense.

↓ read the full paper · PDF