Hanna, M., Ameisen, E. (2026). Latent Planning Emerges with Scale. To appear in ICLR 2026
Nakajima, K., Zuiderveld, J., Pezzelle, S. (2026). Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models. To appear in EACL 2026
Bavaresco, A., de Heer Kloots, M., Pezzelle, S., Fernández, R. (2026). Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models. To appear in EACL 2026
Surikuchi, A., Fernández, R., Pezzelle, S. (2026). Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments. arXiv preprint
Huang, Y., Barlacchi, G., Pezzelle, S. (2026). Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance. arXiv preprint
Prins, Z., Wildenburg, F., Cinà, G., Pezzelle, S. (2026). Is my model perplexed for the right reason? Contrasting LLMs’ Benchmark Behavior with Behavior Specified via Token-Level Perplexity. arXiv preprint
Hanna, M., Belinkov, Y., Pezzelle, S. (2025). Are formal and functional linguistic mechanisms dissociated in language models? Computational Linguistics
Rakotonirina, N.C., Hamdy, M., Campos, J.A., Weber, L., Testoni, A., Fadaee, M., Pezzelle, S., Del Tredici, M. (2025). From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions. ACL 2025
Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A., Takmaz, E., Testoni, A. (2025). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. ACL 2025
Paci, W., Panunzi, A., Pezzelle, S. (2025). They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse. ACL 2025 Findings
Bertolazzi, L., Pezzelle, S., Bernardi, R. (2025). How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects. arXiv preprint
Zaranis, E., Farinhas, A., Santos, S., Canaverde, B., Moura Ramos, M., Surikuchi, A. K., Viveiros, A., …, Elliott, E., Dimiccoli, M., Bansal, M., Lanz, O., Bernardi, R., Fernández, R., Pezzelle, S., Niculae, V., Martins, A. F. T. (2025) Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding. arXiv preprint
Surikuchi, A., Fernández, R., Pezzelle, S. (2025). Natural Language Generation from Visual Events: Challenges and Future Directions. arXiv preprint
Hanna, M., Pezzelle, S., Belinkov, Y. (2024). Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms. CoLM 2024
Surikuchi, A., Fernández, R., Pezzelle, S. (2024). Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition. EMNLP 2024 Findings
Wildenburg, F., Hanna, M., Pezzelle, S. (2024). Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST! ACL 2024 Findings
Cinà, G., Fernandez-Llaneza, D., Deponte, L., Mishra, N., Röber, T. E., Pezzelle, S., Calixto, I., Goedhart, R., Birbil, Ş. İ. (2024). Fixing confirmation bias in feature attribution methods via semantic match. arXiv preprint
Pezzelle, S. (2023). Dealing with Semantic Underspecification in Multimodal NLP. ACL 2023
