User expresses interest in testing the VeridisQuo deepfake detector on different datasets like Celeb-DF or DFDC to improve its robustness and evaluation.
Hey everyone, My teammate and I just finished our deepfake detection project for university and wanted to share it. The idea started pretty simple: most detectors only look at pixel-level features, but deepfake generators leave traces in the frequency domain too (compression artifacts, spectral inconsistencies...). So we thought, why not use both? **How it works** We have two streams running in parallel on each face crop: * An EfficientNet-B4 that handles the spatial/visual side (pretrained on ImageNet, 1792-dim output) * A frequency module that runs both FFT (radial binning, 8 bands, Hann window) and DCT (8×8 blocks) on the input, each giving a 512-dim vector. Those get fused through a small MLP into a 1024-dim representation Then we just concatenate both (2816-dim total) and pass that through a classifier MLP. The whole thing is about 25M parameters. The part we're most proud of is the GradCAM integration — we compute heatmaps on the EfficientNet backbone and remap them back onto the original video frames, so you actually get a video showing which parts of the face triggered the detection. It's surprisingly useful for building intuition about what the model picks up on (spoiler: it's mostly around blending boundaries and jawlines, which makes sense). **Training details** We used FaceForensics++ (C23) which covers Face2Face, FaceShifter, FaceSwap and NeuralTextures. After extracting frames at 1 FPS and running YOLOv11n for face detection, we ended up with \~716K face images. Trained for 7 epochs on an RTX 3090 (rented on vast.ai), took about 4 hours. Nothing crazy in terms of hyperparameters — AdamW with lr=1e-4, cosine annealing, CrossEntropyLoss. **What we found interesting** The frequency stream alone doesn't beat EfficientNet, but the fusion helps noticeably on higher quality fakes where pixel-level artifacts are harder to spot. DCT features seem particularly good at catching compression-related artifacts, which is relevant since most real-world deepfake videos end up compressed. The GradCAM outputs confirmed that the model focuses on the right areas, which was reassuring. **Links** * GitHub: [https://github.com/VeridisQuo-orga/VeridisQuo](https://github.com/VeridisQuo-orga/VeridisQuo) This is a university project so we're definitely open to feedback if you see obvious things we could improve or test, let us know. We'd love to try cross-dataset evaluation on Celeb-DF or DFDC next if people think that would be interesting.