As AI researchers, we are often seduced by near-perfect internal validation metrics. In fact, today’s deep learning models (including large language models) are notable for the remarkable accuracy they can achieve on specific tasks.
However, the fact that they get a lot right does not necessarily mean that the patterns they have learnt are reliable or that they have actually «understood» the root of the problem. Sometimes, this apparent accuracy is a mirage. Data-driven models simply seek out regularities in the data, which are then instantiated as patterns; these regularities may have nothing to do with the semantics of the problem to be solved. The result is an AI that shines in the laboratory but fails spectacularly in the real world.
In behavioural science, this phenomenon is known as the Clever Hans effect, named after the famous horse which, in the early 20th century, astonished the public with its supposed ability to solve mathematical problems. For a time, the animal was credited with cognitive abilities to solve arithmetic problems that had previously been thought to be unique to humans. The reality was less magical: the horse could not add, but was an expert at interpreting the visual and unconscious signals its trainer gave it through body language when posing the problems.
Similarly, an AI model suffering from the ‘Clever Hans effect’ may perform exceptionally well in situations that resemble those it has already memorised during training. However, when we remove the «false clues» and apply it to real-world contexts, its performance plummets and it becomes completely ineffective.
The latest paper by Andrés Montoro Montarroso, a founding member of I2SC, in collaboration with researchers from the University of Jaén and the University of Granada, examines the robustness and generalisation ability of machine learning models. It is entitled «Enhancing disinformation detection with explainable AI and named entity replacement». In this paper, they propose a methodology for identifying spurious features using explainability techniques to improve the generalisation ability of machine learning models in the task of automatically detecting disinformation.
When applying this approach, they observe that, although internal accuracy (on the dataset used to train the model) falls slightly (for example, from 98% to 96%), the ability to generalise in unfamiliar environments (datasets other than the training set, but within the same domain) increases dramatically, achieving an average improvement of 44.14% in generalisation performance on external data.
Optimising a model to win a competition by focusing solely on accuracy is relatively straightforward; building one that does not inherit the biases of its data is the real challenge. Trustworthy AI requires us to prioritise robustness and explainability over accuracy rates in a controlled environment.

