Publications

Hanna, M., Ameisen, E. (2026). Latent Planning Emerges with Scale. To appear in ICLR 2026

Nakajima, K., Zuiderveld, J., Pezzelle, S. (2026). Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models. To appear in EACL 2026

Bavaresco, A., de Heer Kloots, M., Pezzelle, S., Fernández, R. (2026). Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models. To appear in EACL 2026

Surikuchi, A., Fernández, R., Pezzelle, S. (2026). Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments. arXiv preprint

Huang, Y., Barlacchi, G., Pezzelle, S. (2026). Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance. arXiv preprint

Prins, Z., Wildenburg, F., Cinà, G., Pezzelle, S. (2026). Is my model perplexed for the right reason? Contrasting LLMs’ Benchmark Behavior with Behavior Specified via Token-Level Perplexity. arXiv preprint

Hanna, M., Belinkov, Y., Pezzelle, S. (2025). Are formal and functional linguistic mechanisms dissociated in language models? Computational Linguistics

Rakotonirina, N.C., Hamdy, M., Campos, J.A., Weber, L., Testoni, A., Fadaee, M., Pezzelle, S., Del Tredici, M. (2025). From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions. ACL 2025

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A., Takmaz, E., Testoni, A. (2025). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. ACL 2025

Paci, W., Panunzi, A., Pezzelle, S. (2025). They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse. ACL 2025 Findings

Bertolazzi, L., Pezzelle, S., Bernardi, R. (2025). How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects. arXiv preprint

Zaranis, E., Farinhas, A., Santos, S., Canaverde, B., Moura Ramos, M., Surikuchi, A. K., Viveiros, A., …, Elliott, E., Dimiccoli, M., Bansal, M., Lanz, O., Bernardi, R., Fernández, R., Pezzelle, S., Niculae, V., Martins, A. F. T. (2025) Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding. arXiv preprint

Surikuchi, A., Fernández, R., Pezzelle, S. (2025). Natural Language Generation from Visual Events: Challenges and Future Directions. arXiv preprint

Hanna, M., Pezzelle, S., Belinkov, Y. (2024). Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms. CoLM 2024

Surikuchi, A., Fernández, R., Pezzelle, S. (2024). Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition. EMNLP 2024 Findings

Wildenburg, F., Hanna, M., Pezzelle, S. (2024). Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST! ACL 2024 Findings

Cinà, G., Fernandez-Llaneza, D., Deponte, L., Mishra, N., Röber, T. E., Pezzelle, S., Calixto, I., Goedhart, R., Birbil, Ş. İ. (2024). Fixing confirmation bias in feature attribution methods via semantic match. arXiv preprint

Pezzelle, S. (2023). Dealing with Semantic Underspecification in Multimodal NLP. ACL 2023