MethodologyDecember 8, 202511 min read

Beyond Binary Classification

Structured forensic reasoning with vision-language models

DeepSight Research

vision-language modelsforensic reasoningdimensional scoringexplainabilityVLM

Abstract

A 2025 study concluded that "LLMs are not yet ready for deepfake image detection." We argue this finding, while technically correct, reflects the wrong question. By decomposing forensic analysis into eight structured dimensions and using VLMs as feature extractors rather than binary classifiers, we achieve both higher accuracy and inherent explainability. We describe the approach, its theoretical basis, and the style-bias failure modes it mitigates.

A June 2025 study by Xu et al. evaluated leading vision-language models — GPT-4o, Claude, Gemini 2.5 Flash, and Grok 3 — on deepfake detection benchmarks and concluded, in their title, that "LLMs Are Not Yet Ready for Deepfake Image Detection." The finding, while technically accurate on the metric they measured, obscures a more nuanced reality with significant implications for practical detection systems.

The study measured VLMs on binary classification accuracy — is this image real or fake? On this metric, VLMs indeed underperform specialized detectors. But binary accuracy was not the only dimension measured. The researchers also noted that VLMs demonstrated "notable strengths in generating natural-language rationales, often including references to facial symmetry, lighting inconsistencies, or texture anomalies," with Claude and ChatGPT producing "the most coherent and detailed justifications."

This finding aligns precisely with our experience, and it suggests that the research community has been asking the wrong question. The question is not "can VLMs classify images as real or fake?" — a task for which specialized models are better suited. The question is: "what forensic observations can VLMs make that other methods cannot?"

Our approach decomposes forensic analysis into eight structured dimensions: anatomical consistency, texture coherence, lighting physics, text rendering, background plausibility, symmetry patterns, repetitive artifacts, and semantic logic. Rather than asking a VLM for a single yes-or-no determination, we ask it to evaluate each dimension independently on a numerical scale, accompanied by natural-language reasoning.

This dimensional decomposition achieves three things simultaneously. First, it reduces the variance of VLM outputs. A model that might waver between "real" and "fake" on a binary question can still reliably assess that an image has implausible hand anatomy while exhibiting natural lighting. The per-dimension scores are more stable than the aggregate classification. Second, it produces explainable results — users see not just a verdict but a structured breakdown of what the system observed and why. This is not merely a UX benefit. It is a trust-calibration mechanism: users who understand why a system reached its conclusion are better equipped to evaluate that conclusion. Third, and most importantly, it transforms the VLM from a classifier into a feature extractor. The dimensional scores become inputs to a broader fusion framework, where they are weighted and combined with non-semantic signals that the VLM cannot access: metadata provenance, spectral features, noise topology.

The Xu et al. study revealed a fascinating failure mode that validates this approach. GPT-4o systematically misclassified vintage-style diffusion images as real, suggesting that certain aesthetic features — sepia tones, film grain, deliberate imperfections — act as implicit authenticity priors in the model's internal representation. This is a style bias: the model has learned that "old-looking" correlates with "real," a heuristic that worked historically but fails catastrophically against generators that can synthesize any aesthetic.

This is precisely the kind of failure that single-model systems cannot overcome. In a multi-signal framework, a VLM that is fooled by stylistic mimicry will be contradicted by statistical signals that are agnostic to aesthetic choices. The vintage-looking diffusion image may fool the semantic layer, but it cannot fool the noise topology analyzer, which detects the unnaturally uniform noise distribution characteristic of diffusion outputs regardless of the visual style applied on top of it.

We consider structured forensic reasoning to be among our most important methodological contributions — not because the technique is complex, but because it reframes the role of large language models in the detection pipeline. VLMs are not detectors. They are forensic observers. The distinction is not semantic. It is architectural, and it determines how much value you extract from what is, by necessity, the most expensive inference call in the stack. Treating a $0.003 API call as a binary oracle wastes most of its analytical power. Treating it as a structured sensor that reports eight independent measurements extracts far more signal from the same expenditure.

References

[1]Xu et al. "LLMs Are Not Yet Ready for Deepfake Image Detection." arXiv:2506.10474, 2025.
[2]Fernandez et al. "Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review." arXiv:2502.15176, 2025.
[3]Cozzolino et al. "Raising the Bar of AI-generated Image Detection with CLIP." CVPR Workshop on Media Forensics, 2024.

See the research in action