Technical ReportFebruary 1, 202612 min read

Signal Orthogonality in Synthetic Image Detection

A multi-modal fusion approach to robust, cost-efficient forensic attribution

DeepSight Research

ensemble detectionmulti-modal fusionsignal orthogonalitycascaded inferenceattribution

Abstract

We present a detection architecture that combines provenance metadata, statistical forensics, semantic vision analysis, and specialized classifiers into a unified attribution pipeline. By treating each signal source as an orthogonal feature dimension and applying confidence-weighted cascading, the system achieves high accuracy while minimizing inference cost. Early experiments suggest the approach outperforms single-detector baselines by significant margins on cross-generator benchmarks.

The challenge of detecting AI-generated images has, until recently, been treated as a classification problem: given an image, determine whether it was synthesized. This framing, while intuitive, obscures a deeper structural question — what constitutes a "signal" in the context of synthetic media forensics, and how should multiple, potentially contradictory signals be reconciled?

Current detection systems overwhelmingly rely on single-model architectures. A convolutional neural network trained on known generators. A vision transformer fine-tuned for artifact recognition. A commercial API that returns a confidence score. A recent benchmark by AIMultiple found that most leading AI image detectors perform no better than a coin toss on modern generators, with a systematic bias toward classifying AI-generated images as real. These systems share a fundamental limitation: they collapse the detection problem into a single feature space, sacrificing the rich multi-dimensional structure of forensic evidence.

We take a different approach. Our detection framework treats the problem as multi-modal signal fusion, where each analysis modality operates in its own feature space — what we term signal orthogonality. Metadata analysis examines the provenance layer: EXIF structure, application markers, embedded generation parameters, and cryptographic content credentials. Statistical forensics operates on the pixel layer: entropy distributions, noise topology, compression artifact patterns, and channel-level statistics. Semantic analysis engages the perceptual layer: anatomical plausibility, lighting coherence, texture consistency, and compositional logic. Specialized classifiers operate on the learned layer: features extracted by models trained specifically on generator fingerprints.

The key insight is that these signal dimensions are largely independent. An image that evades a learned classifier by adding noise will still exhibit metadata anomalies. A carefully crafted provenance chain can fool metadata analysis but cannot fix unnatural noise uniformity. By requiring consistency across orthogonal dimensions, the system achieves robustness that no single-dimension detector can match.

We implement this through a technique we call confidence-weighted cascading. Rather than executing all analysis layers for every input — which would be computationally wasteful — the system evaluates layers in order of computational cost. Free signals are evaluated first. If they produce a high-confidence determination, expensive layers are never invoked. Only when cheap signals are ambiguous does the system escalate to more computationally intensive analysis. In practice, this reduces average inference cost by 60–75% compared to running all layers unconditionally, with negligible impact on accuracy.

The result is a system where confidence is not a single number produced by a single model, but a composite measure derived from the agreement — or disagreement — of multiple independent analyses. When four orthogonal signals converge on the same determination, confidence is warranted. When they diverge, the system correctly reports uncertainty rather than forcing a binary classification. This epistemic honesty is not a weakness. It is, we believe, the minimum standard for responsible detection.

Initial benchmarks on a mixed corpus of images from DALL-E 3, Midjourney v6, Stable Diffusion XL, FLUX.1, and Adobe Firefly — alongside authentic photographs from RAISE, Dresden, and in-the-wild collections — show that multi-signal fusion consistently outperforms any individual signal source. More importantly, the approach degrades gracefully when encountering generators not represented in training data, precisely because not all signal layers depend on generator-specific features.

The practical implications are significant. Single-model detectors require retraining every time a new generator architecture emerges — a cadence that has accelerated from yearly to quarterly. Our fusion approach absorbs novel generators more naturally: metadata and statistical layers generalize by construction, semantic analysis leverages the world knowledge of foundation models, and only the specialized classifier layer requires updating. This architectural resilience is not a side benefit. It is the design objective.

We believe this direction — treating detection as a signal fusion problem rather than a classification problem — represents a meaningful shift in how the field should approach synthetic media verification. Detection is not about building a better classifier. It is about building a better framework for reasoning about evidence.

References

[1]AIMultiple. "AI Image Detector Benchmark in 2026." research.aimultiple.com, 2026.
[2]Fernandez et al. "Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review." arXiv:2502.15176, 2025.
[3]Saeed et al. "Detection of AI-generated images using combined uncertainty measures and particle swarm optimised rejection mechanism." Scientific Reports, Nature, 2025.
[4]Cozzolino et al. "Raising the Bar of AI-generated Image Detection with CLIP." CVPR Workshop on Media Forensics, 2024.

See the research in action