Catching Hidden Harm in AI Conversations: Notes from a Leaderboard

On the first of April, an email from Hiroshi Yamakawa at the University of Tokyo landed in a safety group I follow. He was announcing something called the AI Immune System Detection Challenge. The setup caught my attention right away.

The task was not the usual machine versus human text detection. Both sides were machine. Both were AI generated. The data consisted of short exchanges between AI agents, and some of those exchanges had harmful intent quietly woven into otherwise normal conversation. My job was to tell the harmful samples from the safe ones.

This framing matters. Detecting that a paragraph was written by an AI is one thing. Detecting that an AI agent is smuggling dangerous intent past a human reader is another thing entirely. The organizers describe the work as an empirical test of what they call Layer 1 Edge Sensors, the outermost defense layer in an AI Immune System architecture. As multi agent societies scale, the speed of agent to agent exchange will outrun any human moderator. The only workable monitoring is peer monitoring, with agents flagging each other in real time.

So this was not a stylometry exercise. The technique I built would feed into something they call BEAD, Behavioral Embedding Anomaly Detection. That is a real safety pipeline. I decided to go in seriously.

As of this writing I am sitting at rank five with a score of 0.96422 F1. Fourth place is 0.96447. The gap is 0.00025, which is about one or two flipped predictions out of two thousand one hundred. Submission window closes on the 31st of May. I am still in it. What follows is what I have learned so far.

The first trap

My first instinct was to reach for the best classifier I had, which was a fine tuned ensemble of six DeBERTa v3 models. I threw it at the problem and got a score of 0.96028. Not bad for a first try. I assumed I just needed to stack a few clever features on top of it and I would climb the board.

So I picked the most recent methods I could find in the machine generated text detection literature. DivEye from September 2025. Lastde from October 2024. SpecDetect from August 2025. Binoculars from 2024. These are serious papers. They use time series analysis of surprisal sequences, spectral decomposition, and cross model perplexity ratios. I spent three days implementing them from scratch on my GPT two and Phi two log probability caches.

They all lost to a three line python function that counts surprising tokens.

Here is what the function does. For every token in a text, GPT two small gives it a log probability. Some tokens will have log probability below negative five, meaning GPT two thought they were very unlikely. Those are the surprising tokens. If you count how many of those appear in a sample, you get one feature. Harmful samples have fewer of them than non harmful samples, because harmful samples ride on the kind of bland conversational scaffolding that language models assign high probability to. The count alone gives an AUC of 0.94 on my data.

For comparison, DivEye’s best feature gave 0.70. Lastde gave 0.82. SpecDetect gave 0.85. Binoculars gave 0.64. Not one of the methods came close to three lines of counting.

I stared at the screen for a while. How could a simple count beat published 2025 research? The answer, once I thought about it, became clear. Every one of those methods was built and tested on long text, typically essays of a hundred words or more. At twenty four tokens you cannot measure rhythm. You cannot compute second order differences of surprisal with any stability, because there are only twenty of them. You cannot do spectral analysis because your signal is too short to have meaningful frequency content. Simple counting survives because it only needs a few events to fire. And in a twenty four token exchange, harm often shows up as a handful of bland filler words that could not possibly be flagged by any rule based system.

I stacked the count feature on my DeBERTa ensemble. The leaderboard moved from 0.96028 to 0.96245. First lesson confirmed. I wrote it down and moved on.

The correlation wall

If counting simple things works, I thought, then why not stack ten such signals? I tried every probe I could think of. I extracted hidden states from GPT two small at all twelve layers and ran a linear classifier on each representation. Best was layer eight, AUC 0.947. I did the same for RoBERTa base. Best layer eight, AUC 0.957. I did it for Phi two, a twenty seven times larger model. Best layer sixteen, AUC 0.963.

These are enormous single feature AUCs. By any normal standard they should all help a stacking model. They did not. Not one of them passed the cross validation test I used to gate submissions.

Here is why. I computed the correlation between each probe’s output and my running best meta model. RoBERTa layer eight, correlation 0.78. GPT two layer eight, correlation 0.78. Phi two layer eight, correlation 0.83.

A pattern was forming. Bigger models gave higher AUC but also higher correlation with my base. They were not seeing anything new. They were seeing the same class separating direction my DeBERTa ensemble had already learned during fine tuning, just expressing it more cleanly.

I spent another three days trying to walk around this wall. I probed at every layer of every model. I concatenated multiple layers. I loaded pretrained sparse autoencoders from the Joseph Bloom GPT two SAE collection and probed 24,576 interpretable features individually. I decomposed Phi two’s parallel attention and MLP branches separately, since those run independently inside each decoder layer. Every single experiment hit the same wall. Every competent encoder converges on the same subspace. My ensemble already lived there.

At this point I had to stop chasing encoder outputs.

The detective work

Before giving up on them entirely, I started looking at it from the lens of a mathematician. I wanted to know how much new information is there in these encoders, given what I already have.

This question has a proper answer. It is the conditional mutual information of the encoder given my base. To estimate it on RoBERTa layer eight, I reduced the seven hundred and sixty eight dimensional hidden state to its top ten principal components, then measured the log loss drop when those ten components were added on top of my base predictions.

The number came out to 0.0017 nats.

That is a very small number. Converting it to F1 impact through the cross entropy loss caps any possible gain at about plus 0.002 on the leaderboard. Every single stack I had been running, and would ever run on these features, was fighting for a 0.002 point window. And I still had to find the exactly right predictions to flip.

Around the same time I did a geometric analysis. The direction my classifier was using through RoBERTa layer eight space made an angle of 84.5 degrees with the top principal component. That is almost perpendicular. The class signal lives in low variance directions, sitting orthogonal to what the encoder spends most of its representational energy on. This fits the intuition that harm hides in quiet places. The model’s big variance axes are occupied by ordinary conversational structure. The harm signal rides in the subspace the model barely notices.

One more test. I fitted Fisher Linear Discriminant on the full training set and got an in sample AUC of 0.9882, almost perfect. Then I refit it per fold, out of sample. AUC dropped to 0.5928. A thirty point gap. This is textbook high dimensional overfitting. When your feature dimension divided by sample count is 0.16, as mine was, unregularized Fisher memorizes training samples. The 0.9882 was never real. The 0.5928 was the truth.

These three numbers together painted the picture. The signal exists, but it is small, and it lives where the encoder spends little of its capacity, and any naive extraction of it overfits horribly. No amount of clever feature engineering walks past this.

The one flip that changed everything

For two days I thought about something else. Then a small observation hit me.

My best stack, which I called v40, used a decision threshold of 0.1438. This threshold came from optimizing F1 on my out of fold predictions. It was the optimal cutoff on training data.

But the test set is not the training data. The test class balance, the test noise structure, the boundary cases, can all shift the optimal threshold by a tiny amount. The probability scores are fine. The cutoff can still be off.

So I made a new submission. Exactly the same probabilities as v40. Different threshold. I used 0.15. That is only 0.006 higher. Only one prediction, out of the entire two thousand one hundred sample test set, flipped from TRUE to FALSE. Its v40 probability was 0.1466, just below the new cutoff.

The leaderboard went from 0.96258 to 0.96422. That one flip, on one sample, was worth about as much as seven full signal family experiments combined. I sat there for a moment and let that sink in.

I tried further threshold values. Going to 0.155 flipped one more sample and the leaderboard dropped back to 0.96068. Going to 0.16 dropped further. The sweet spot was narrow. Exactly one specific test sample, at probability 0.1466, was wrong in v40. Nothing else nearby was wrong. If I flipped the neighbor, I lost ground.

That was the difference between rank eight and rank five. One well chosen flip.

The broader picture

Zooming out, what I am actually doing here is building a toy version of what Yamakawa and his collaborators are calling Layer 1 of an AI Immune System. The claim behind the project is that as agent to agent communication scales, humans cannot supervise every exchange in real time. So other agents have to do it. The first line of defense is an edge sensor that flags suspicious exchanges quickly, within about fifteen to thirty seconds, before they propagate.

Whether or not you buy the full AIS framework, the underlying empirical question is real. Can you cheaply detect hidden harm in a very short agent exchange, using only surface level language model signals. Based on my results so far, the answer is a cautious yes, with a hard ceiling.

The count of surprising tokens captures about 94 percent of the discriminative information. A fine tuned classifier ensemble adds a few more points. Beyond that, the conditional mutual information bound says there is very little left to extract from passive observation. If you want to push further, you would need to either collect more labeled data, train a dedicated classifier rather than stack probes on generic encoders, or change the observation channel itself, for example by looking at things other than text.

This is a useful thing to know if you are designing a real monitoring pipeline. Simple features go a long way. New features hit diminishing returns fast. Threshold calibration matters more than most feature engineering. Cross validation with proper held out threshold fitting saves you from shipping garbage.

Where I am now

I am still in this competition. Rank five. Score 0.96422. Submission window closes on the 31st of May. I have about five weeks.

Fourth place is 0.96447. The gap is 0.00025. At this stage that is one or two more correct predictions.

I have strong mathematical reason to believe another encoder probe will not help. Conditional mutual information says no. I have tried twelve signal families and every one fails under rigorous cross validation. But, I would like to stumble into one.

Two paths remain. The first is Ghostbuster style fine tuning, where I would train two separate GPT two models, one on harmful samples only, one on safe samples only, and use their perplexity ratio as a new feature. That signal is genuinely new. It is not a probe of an encoder and it is not derived from my existing cached log probabilities. The second path is synthetic data augmentation, where I fine tune a small model on safe text, generate a few thousand harmful styled short samples, add them to training, and retrain the ensemble. That is the move that won the comparable 2024 Kaggle AI detection challenge. Both take real compute. Both might help. Neither is guaranteed.

I am near the frontier of what this particular setup can do. If I find one more correctly flippable prediction through a truly orthogonal signal, I move up one rank. If I do not, rank five holds and I still walk away with a clearer picture of where the information ceiling sits for this kind of problem.

What I would tell you

If you are tackling a short text anomaly detection problem, start with the simplest feature you can compute. Count the surprising tokens. In a safety context, harm often hides in bland language that a language model finds unsurprising, so low counts of surprising tokens are a strong prior for harmful content. The smart methods in recent detection papers assume long text. They are not wrong. They are just out of their element.

Never trust your optimal training threshold for the test set. Try a small sweep around it. Sometimes a one percent shift in threshold is worth more than a week of feature engineering.

When you are at the top of a leaderboard, correlation between features becomes the enemy. A high AUC probe that correlates 0.8 with your base is worse than nothing. It duplicates what you already have while consuming capacity in your stacking model and making cross validation lie to you.

Estimate conditional mutual information when you can. It tells you, in principle, the ceiling for any feature you could add. Above that ceiling, further feature engineering is a lottery ticket.

And when you are rank five with a narrow gap to rank four and five weeks left, you can either try the next principled experiment or accept where you are. I am trying one more thing. The story continues.