VoxParadox
Adversarial benchmark to evaluate language dominance on audio LLMs' paralinguistic understanding, with controlled linguistic-acoustic contradiction (2k clips, 10 tasks). Proposed PCLM, a novel layer fusion strategy improving Audio Flamingo 3 by 48% and Qwen2-Audio by 57% coupled with DPO.
Benchmark
Speech LLMs
ICML 2026
PoisonedGraphRAG
Knowledge corruption attacks against GraphRAG achieving up to 70% attack success rate with GPT-4o-mini. Developed a benchmark with highly inter-connected named-entity graphs and an extensive per-question indexing pipeline.
RAG
Adversarial NLP
Security
MSECap
Fusion-based audio-textual speech emotion captioning with early, late, and X-Norm fusion strategies. X-Norm achieved 20%↑ GPT-evaluated match rate over early/late baselines using prefix-conditioned decoders with a pretrained LLM.
Multimodal
Emotion
Captioning
Gold Labels to Trap Labels
A new method for detecting benchmark leakage by exposing model overfitting to mislabeled or low-confidence examples in SNLI. Identifies dataset outliers via unanimous aggregation of four BERT-based models, human annotation, and clustered Borda ranking with confidence-based safeguards.
NLI
Dataset Reliability
Benchmarking