Communication drift

Eighty-three percent of phishing emails are now AI-generated. That number, from a 2024 industry report, is alarming not just for its size but for what it implies: the attacker's biggest bottleneck — writing a convincing email that sounds like your CFO — has effectively been removed. GPT-4o-class models can impersonate writing style well enough to fool people who know the sender.

The defenses haven't kept up. SPF, DKIM, DMARC — header-based authentication tells you the email came from where it claims to come from, not whether the person who sent it actually wrote it. URL reputation filters miss spear-phish that use clean domains. Keyword-based classifiers are blind to stylistically correct text that happens to ask for a wire transfer.

Our research explored a different angle: what if you could flag an email because it doesn't sound like the person it claims to be from? Stylometry — the statistical analysis of writing style — is the underlying tool. The specific implementation is what we called a DriftScore: a per-sender measure of how far an incoming email deviates from that sender's historical writing baseline.

The dataset and setup

We used the Enron Corporate Email corpus — real organizational emails from 1999–2003, which gives us genuine corporate communication patterns with enough per-sender volume to build meaningful baselines. After preprocessing and sender normalization, we retained 11,106 legitimate emails and generated 1,128 synthetic anomalous emails using GPT-4o-mini, prompted to produce phishing-style content while preserving each sender's writing profile.

The synthetic generation pipeline was intentionally "defanged" — real phishing content wasn't produced, just structurally anomalous emails that preserved sender style at the macro level while introducing suspicious cues (urgency, financial action language, unusual requests). The goal was to simulate what a sophisticated AI phishing tool would actually produce: something that mostly sounds like you, but is slightly off.

What we measured

Each email was converted into a stylometric feature vector covering:

Global writing attributes: word count, sentence length, type-token ratio, average word length, character count
Behavioral indicators: punctuation density, modal verb counts, urgency cues, financial action cues, imperative verb counts, greeting/sign-off presence
Sender-relative deltas: difference in word count vs sender average, deviation in sentence length, punctuation density delta, vocabulary overlap with sender's high-frequency words

The sender-relative features are the core insight. A short email isn't suspicious in isolation — but a short email from someone who always writes long emails is. The DriftScore operationalizes this:

# DriftScore: normalized deviation from sender baseline
DriftScore = (1/k) · Σ |x_i - μ_i| / (σ_i + ε)

# where:
# x_i   = feature value for the incoming email
# μ_i   = sender's mean for that feature (from training emails)
# σ_i   = sender's std dev for that feature
# k     = number of features
# ε     = small constant for numerical stability

Higher score = the email is more stylistically unusual relative to this specific sender's history. The threshold is tuned per-sender using a validation set, which means the system automatically calibrates to individual writing variability — a high-variance writer gets a more relaxed threshold than someone who writes very consistently.

What experiment 1 showed

The supervised classification experiment — where both legitimate and synthetic anomalous emails are labeled and available at training time — demonstrated that the feature space carries real discriminatory signal:

93.9%

Random Forest accuracy

0.985

ROC-AUC

87.1%

Logistic Regression accuracy

These numbers are strong, but they represent an upper bound — the easy case where you know what anomalous emails look like during training. Experiment 2 is the realistic case.

What experiment 2 actually showed

Sender-specific anomaly detection — building a baseline from each sender's legitimate emails only, then scoring incoming mail against it — is genuinely hard. Isolation Forest achieved precision 0.24, recall 0.74, F1 0.36. One-Class SVM was similar. The DriftScore alone as a threshold produced nearly identical numbers.

This isn't a failure of the approach — it's an honest characterization of the problem. AI-generated phishing that is specifically designed to match a sender's writing style will, by construction, have a low drift score. The signal exists, but it's not strong enough to be a standalone classifier in the adversarial case.

Where it gets more useful: the two-stage warning system. An email only triggers a warning when it exceeds the sender-specific DriftScore threshold and contains at least one supporting suspicious cue (urgency language, financial action phrasing, unusual imperative verbs, low content-word overlap). This reduces false positives without collapsing recall entirely.

In strict mode — requiring both anomaly model and DriftScore to fire, plus one supporting cue — precision rose to 0.28 with recall at 0.49. That's not high enough to replace existing defenses, but it's a meaningful secondary signal, especially for high-value targets like executives where even a low-precision warning is worth surfacing.

The honest takeaway

Stylometry as a standalone phishing detector isn't there yet, at least not against attackers who know they need to match the target sender's style. But as a complementary behavioral layer — one more signal in a defense-in-depth stack — it has real value. The DriftScore is particularly useful because it gets harder to evade as the legitimate sender's corpus grows. The attacker needs to match not just average style, but the full statistical distribution of that sender's writing habits across hundreds of emails. That's a nontrivial constraint.

Future work worth doing: richer stylometric representations (syntactic parse features, discourse structure), larger real-world phishing corpora for validation, and per-sender threshold tuning at scale. The current proof-of-concept is limited by dataset size and by the fact that our synthetic anomalies were generated by a model being asked to slightly deviate — a more adversarial generation approach would be a harder test.

↓ read the full paper · PDF

Communication drift — catching AI-generated phishing without URL filters

The dataset and setup

What we measured

What experiment 1 showed

What experiment 2 actually showed

The honest takeaway