Computational Linguistics Seminar Series at ILLC

Computational Linguistics Seminar

The CLS is the Computational Linguistics Seminar of the University of Amsterdam. Seminars are open to all interested researchers and students of all levels from UvA and elsewhere. The CLS is organized and sponsored by the Institute for Logic, Language and Computation's NLP and Digital Humanities research unit, and sponsored by the ELLIS Unit Amsterdam.

Contact

The seminar is organised by Martha Lewis, Charlotte Pouw, Fengxiang Cheng, Seth Aycock and Vera Neplenbroek.

To receive notifications about upcoming talks and the Zoom details, please join the CLS mailing list.

Subscribe to the announcement mailing list.

Calendar

To make sure you do not miss any talk, you can add the CLS agenda to your calendar.

Link to the CLS calendar.

Location

The CLS usually takes place on Tuesdays at 15:30 in room L3.36 at LAB42 in Amsterdam Science Park or via Zoom. Other days and locations are occasionally possible. See the details for each talk. To receive the details please subscribe to the CLS mailing list. The links to participate on Zoom will be distributed via the mailing list on the day of the seminars.

To reach L3.36 or other rooms on the third floor of LAB42 as a student or guest, please use the elevators, which are behind the VIA study association office, or the stairs located behind the elevators. Other stairs may have gates or doors that can only be passed by UvA employees.

Upcoming Talks

Workshop + Defence (Xiaoyu Tong)

Workshop "Beyond Text: Conceptual, Social, and Cultural Knowledge of Large Language Models"

Thursday July 2nd, 14:00-17:00. Room L1.01 at LAB42, Amsterdam Science Park. This event is in-person only.

On Friday July 3rd Xiaoyu Tong will defend her PhD thesis titled "Dissecting Incongruity: Metaphor and Humour Understanding of Large Language Models". This will be marked the day before by a mini-workshop on the theme: "Beyond Text: Conceptual, Social, and Cultural Knowledge of Large Language Models"

Programme:

14:00-14:20 Arrival
14:20-14:30 Welcome and introduction
14:30-15:00 Xiaoyu Tong, ILLC UvA: Nuances of multimodal humor for large language models
15:00-16:00 Pia Sommerauer, VU: Stereotype communication in humans and LLM personas
16:00-17:00 Desmond Elliott, U Copenhagen: Towards Understanding and Explaining Vision-Language Models

Abstracts:

Xiaoyu Tong, ILLC UvA: Nuances of multimodal humor for large language models

Humor plays important roles in communication and, as a major manifestation of human creativity, exhibits both universality and cultural variance. Understanding and using humor properly is a task that large language models (LLMs) must strive to accomplish. In this talk, I will present two studies on LLMs' processing of multimodal humor. The first study is concerned with one of the most common humorous mechanisms: metaphor use. Taking inspiration from well-established metaphor and humor theories, we developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs, and created the Hummus (Humorous Multimodal Metaphor Use) dataset, providing expert annotation for 1k New Yorker cartoons with humorous captions. We used this dataset to test state-of-the-art LLMs on their ability to detect and understand metaphor use for humorous purposes. The second study deals with cultural variation in humor. We collected funniness and emotion annotations for the New Yorker cartoons in the Hummus dataset from four diverse cultures: the U.S., Mexican, Polish, and Chinese. Through data analysis, we revealed a significant effect of culture and metaphor use on humor appreciation. We also found subtle cultural differences with regard to the emotion categories associated with high/low funniness ratings. These findings serve as a solid basis for evaluating LLMs' cultural alignment in humor processing.

Pia Sommerauer, VU: Stereotype communication in humans and LLM personas

Stereotypes are mental images we have of social categories. Stereotypes can be harmful when they erase the individuality of people and lead to prejudice and discrimination. But how do such stereotypes emerge and grow? Communication scientists argue that they are primarily shared and reinforced via communication and, more specifically, language and linguistic biases. In this talk, I am going to present interdisciplinary work on an ongoing project and an initial study on stereotypes in LLM personas. The goal of the project is to understand how different linguistic biases emerge and interact. Communication scientists have found that there is systematic variation in whether people describe a situation that is in line with their stereotypic expectations or a situation that goes against stereotypic expectations. For example, people are more likely to describe expected situations in more generic and abstract terms (women are emotional) and unexpected situations in more specific and concrete terms (the man burst into tears). Many of these biases have been examined in highly controlled experiments where participants could choose between pre-defined descriptions that test particular linguistic patterns. It is not yet well-understood how these patterns occur and interact in freely generated language. In this project, we aim to understand such patterns and interactions, so that they can be used to ‘measure’ the risk of communicating stereotypes in texts. In a complementary study, we tested whether persona-prompted LLMs exhibit the same systematic variation observed in humans. We extract self-described stereotypes from Reddit and prompt LLMs to describe stereotype-expected and stereotype-unexpected situations derived from this corpus. We measure abstraction, genericity, and negation and find that LLMs remain generic and contain stereotypical tropes regardless of the situation.

Desmond Elliott, U Copenhagen: Towards Understanding and Explaining Vision-Language Models

Large Language Models can be transformed into Vision-Language Models by learning a mapping that projects visual tokens from a vision encoder in the embedding space of the language model. This mapping can be as simple as a learned linear layer, which raises questions about how language models can be so effective as multimodal models. In this talk, I will present two recent studies on understanding and explaining how Vision-Language Models work. In the first study, I will discuss how we can uncover the knowledge encoded in vision and language models, respectively, by probing them for conceptual knowledge representation. We find that vision models encode a surprising amount of encyclopaedic and functional knowledge, in contrast to long-standing beliefs about what can be learned from images alone. In the second part of my talk, I will present LatentLens, a new method for interpreting the representation of visual tokens in language models using contextualized representations of textual nearest-neighbours. The LatentLens offers improved interpretability compared to existing techniques throughout the layers of Vision-Language Models.

Past Talks

Andrea de Varda, MIT

Large Language Models as models of human language(s) and higher-level cognition

Friday 19th June 2026, Time 14:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Video

Large language models (LLMs) have recently emerged as powerful candidates for modeling several domains of human cognition. Because they operate over natural language, they provide flexible representations that can be evaluated against human behavior and brain activity. In this talk, I will present a set of studies that use LLMs to test how far this modeling approach can go—first in the domain of language, and then in higher-level reasoning. In the first part, I ask whether multilingual language models can explain how the human brain processes the extraordinary diversity of the world's languages. Using fMRI data from native speakers of 21 languages spanning 7 language families, we show that model embeddings reliably predict brain responses within languages and, crucially, transfer zero-shot across languages and families. These results point to a shared representational component in the human language network, largely driven by semantic content, that aligns with the representations learned by multilingual models. In the second part, I move beyond language to ask whether LLMs can serve as models of human reasoning, from two angles. First, the brain shows striking functional specialization, with distinct networks for language, formal reasoning, social reasoning, and physical reasoning. Is this modular organization a general principle of intelligent systems, or an accident of biological evolution? Using circuit analyses across 46 tasks in these four domains, we show that LLMs develop a modular architecture mirroring the brain. This convergence suggests modularity is a general principle of intelligence. Second, analyzing large reasoning models, we show that the number of reasoning steps they take predicts human reaction times across seven diverse tasks. This holds both within tasks, reflecting item difficulty, and across tasks, capturing broad differences in cognitive demand.

David Graus, University of Amsterdam

From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation

Tuesday 16th June 2026, 15:30. Room L2.06 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

This talk presents recent work on turning legal text into executable decision logic, and outlines how I’m planning to extend this line of research into compliance and regulatory change. I first summarize my ICAIL 2026 paper, From Legal Text to Executable Decision Models, where I use a real-world dataset of 95 production decision models from the Dutch Environment and Planning Act to evaluate how intermediate structured representations (semantic roles, input–output constraints) affect LLM-based generation of executable decision graphs. Enriching legal text with I/O constraints substantially improves structural similarity to gold models and achieves around 50% functional equivalence, while producing more compact models that remove redundant pass-through logic. I then sketch a few follow-up directions I am working on/thinking about: (1) using LLMs to extract structured compliance controls from overlapping regulations (NIS2, BIO2, ISO 27001, NIST) against a manually curated control framework; (2) detecting and resolving conflicting rules across frameworks, inspired by work on defeasible logic and normative conflicts; and (3) adapting executable rules to evolving law via LLM-generated regulatory “delta reports” and semantic drift analysis, with changes propagated into a living compliance register.

Esther Ploeger, Utrecht University

Two Perspectives on Diversity in NLP Evaluation

Tuesday 9th June 2026, 15:30. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Evaluating NLP models requires more than checking whether they perform well on a handful of convenient examples. To meaningfully assess general model capabilities, we need benchmarks that capture a diverse range of inputs. How can we systematically design benchmarks that reflect such diversity? The first part of this talk examines language diversity in multilingual NLP evaluation. I outline methods for identifying when benchmarks overrepresent closely related languages and present strategies for ensuring broader linguistic coverage, inspired by linguistic typology. The second part turns to diversity within a single language. Using machine translation evaluation as a case study, I demonstrate how low benchmark diversity can distort evaluation outcomes and limit what we can reliably infer about model performance. Taken together, these two perspectives emphasize the importance of diversity in future benchmark design.

Sarath Sivaprasad, CISPA Helmholtz Center for Information Security

Heuristics in LLM Decision Making

Tuesday 19th May 2026, 15:30. Room TBA at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Video

When large language models are deployed in real world with vast possible action spaces, what guides their choice of a single next action? In this talk we delve into the heuristics underlying LLM response sampling. Similar to human cognition, LLMs rely on two interacting components: a descriptive component that reflects the statistical distribution of possibilities, and a prescriptive component that reflects an implicit value weighted ideal. This dual structure also appears in how models represent prototypes mirroring human prototype theory and fast, system-1 like judgments. As a result, LLMs act as value optimizers, consistently shifting their samples toward high-value or idealized options. This can potentially explain their real-world behavior like being greedy explorers and value bias in how they pick options. We will discuss empirical evidence across concepts and model families, the mechanisms driving these biases, and the implications for reasoning, exploration, alignment, and safe deployment of value guided generative systems.

Workshop + Defence (Oskar van der Wal)

Workshop "Measuring and mitigating bias in AI"

Wednesday April 29th, 11am-3pm. Chirurgisch Theater, Universiteitsbibliotheek, UvA. This event is in-person only.

On Wednesday April 29 at 4pm, Oskar van der Wal will defend his PhD thesis titled "Taking a Step Back: Measuring and Mitigating Bias in Language Models" in the Agnietenkapel. Before that, there will be a workshop on the theme of "Measuring and mitigating bias in AI". Register here for the workshop to qualify for free lunch: https://amsterdamnlp.github.io/workshop/

Jos van Campen (MD Geriatrician), Miriam Goudsmit (Clinical psychologist), OLVG

Dementia in Older Adults with a Migration Background

Tuesday 21st April 2026, 15:30. Room L1.02 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Amsterdam’s elderly population is growing, with many residents born outside the Netherlands. Aging often brings multiple chronic conditions, including$Amsterdam’s elderly population is growing, with many residents born outside the Netherlands. Aging often brings multiple chronic conditions, including dementia—a progressive cognitive decline affecting daily living. By 2040, nearly 500,000 people in the Netherlands are expected to have dementia, with rates rising faster among migrants, especially Turkish and Moroccan Dutch. To address this, OLVG Amsterdam established a specialized outpatient clinic for older adults with a migration background. Diagnosing dementia in this group is challenging due to language barriers, low literacy, limited health literacy, and cultural taboos. Standard cognitive tests are often unsuitable, prompting the development of the Cross-Cultural Dementia Screening (CCD), a culture-fair tool for low-educated individuals. CCD enabled research such as the SYMBOL study, which found dementia prevalence 2–4 times higher among first-generation migrants, linked to risk factors like education, hypertension, diabetes, and obesity. Memory clinic assessments include interpreters, interviews with patients and relatives, physical exams, lab tests, imaging, cognitive tests (CCD, RUDAS), and informant questionnaires (IQCODE). Existing datasets support further research. We will address the challenges we encounter in practice where we think AI research could help us.

Jaap Jumelet, University of Groningen

MultiBLiMP: Massively Multilingual Linguistic Evaluation

Tuesday 14th April 2026, 15:30. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Large language models are becoming increasingly more multilingual, but evaluation of their linguistic ability remains limited to a small set of high-resource languages. Resolving this lack of wide linguistic evaluation will have benefits for two research directions: multilingual NLP developers would be able to test in more rigorous detail the fluency of their model on low-resource languages, and computational linguists would obtain resources to investigate questions on typology and language acquisition at a multilingual scale with greater granularity. These two directions have been our driving force in developing a massively multilingual benchmark, called MultiBLiMP. In my talk I will describe our approach, giving an overview of prior findings from MultiBLiMP v1, which focused on agreement violations, and our current approach to creating MultiBLiMP v2, focused on word order.

Kanishka Misra, UT Austin

Hypothesis generation as a bridge between Human and Machine CogSci

Tuesday 2nd April 2026, 15:30. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom (Recording).

The success of LMs in demonstrating non-trivially interesting linguistic behavior has raised a lot of excitement about the transfer of insights to (human) language science. But in order for this excitement to materialize, appropriate bridges need to be built from studies on computational models (Machine Cognitive Science) and those in humans (Human Cognitive Science). In this talk, I will discuss one possible bridge between the two systems: the development of hypothesis generation methods from which we can obtain novel predictions about human language that are sufficiently specific to be tested in the lab. I will present a case study where we used LMs trained on child-directed speech to generate novel hypotheses about the generalization of a novel dative verb to the alternate construction, given exposure to only one dative construction. That is, if pilked is observed in the double-object (She pilked me the ball), then under what conditions is it also found acceptable in the prepositional-object (He pilked a book to her)? Our method yields two new experiments that follow from our generated hypotheses, which we propose to test with human learners in the lab.

Claire Stevenson, University of Amsterdam

Learning to solve analogies: why do children excel where AI models fail?

Tuesday 31st March 2026, 15:30. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Recent work with large language models (LLMs) concludes that analogical reasoning, using what you know about one thing to infer knowledge about a new, somehow related instance, has emerged in these systems. My lab has conducted a series of behavioural and mechanistic‑interpretability studies to investigate whether analogical reasoning has indeed emerged, and, if so, whether the developmental phenomena resemble those of humans or follow a different trajectory. We provide evidence of similarities in children’s and LLMs’ development in learning to solve analogies, while also highlighting key differences. I will propose a theory of how LLMs’ reasoning abilities are developing and conclude with a discussion of developmental insights that could help AI models achieve human‑like analogical reasoning.

Mayank Jobanputra, Saarland University & University of Edinburgh

Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities

Tuesday 17th March 2026, 15:30. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.

Elisa Bassignana, IT University of Copenhagen, Pioneer Center for Artificial Intelligence

The AI Gap: How Socioeconomic Status Affects Language Technology Interactions

Tuesday 10th March 2026, 15:30. Room L2.06 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Socioeconomic status (SES) fundamentally influences how people interact with each other and, more recently, with digital technologies like large language models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from ‘diverse socioeconomic backgrounds’ about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entails a higher level of abstraction, conveying requests more concisely, and topics like ‘inclusivity’ and ‘travel’. Lower SES correlates with higher anthropomorphization of LLMs (using ”hello” and ”thank you”) and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to create a digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.

Filip Ilievski, Vrije Universiteit Amsterdam

Analogy: The Hidden Architecture of Storytelling

Tuesday 10th February 2026, 15:00. Room L2.07 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Analogy drives human generalization, yet NLP research has largely overlooked structural analogies in favor of simpler proportional ones. This talk bridges cognitive psychology and AI to address this gap. I will introduce ARN—a cognitively aligned benchmark for reasoning in narratives—and discuss the findings of our systematic study of LLM analogical abilities. Subsequently, I will discuss methodologies for enhancing LLM structural mapping through abstraction, causal modeling, and optimization, and their effects on narrative analogy identification. Finally, the talk will explore the future of analogical reasoning in multimodal domains, extending these insights to video, internet memes, and abstract puzzles.

Federico Adolfi, Ernst-Strüngmann Institute for Neuroscience (Max-Planck Society)

A computational perspective on the challenge of inner interpretability

Tuesday 27th January 2026, 16:00. Room L2.06 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Post-hoc inner interpretability plays an increasingly central role at the intersection of modern AI and the cognitive and brain sciences. As we grapple with interpretability illusions while scaling up interpretability heuristics, we often lack clarity about the problems these heuristics are deployed to solve and what their solutions should look like. These considerations are relevant to the interpretation of artificial and natural systems, and impact our assessment of their alignment. To sharpen, conceptually and formally, the problems that interpretability is tasked with solving and what it takes to get there, I will make the case for an integrative approach combining complexity-theoretic and experimental work. I will spell out one such strategy to guide the discovery of (i) interpretability methods with adequate performance and (ii) knowledge of the conditions that make it feasible, to complement existing approaches. As a case study, I will showcase work on circuit discovery for inner interpretability. I will survey its goals and obstacles, and present results that help explain the promising and sometimes puzzling performance of current heuristics. While previewing frameworks for a two-way exchange of ideas and tools between mechanistic interpretability and cognitive neuroscience, I will move towards a computational meta-theory of complex-systems interpretability.

Tomáš Musil, Charles University

Independent Component Analysis of Language Model Semantics

Tuesday 16th December 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

We investigate how Independent Component Analysis (ICA) can reveal semantic structure in transformer language models. Applied to Llama models across multiple layers, ICA extracts components that group semantically related words more effectively than Principal Component Analysis (PCA). We evaluate both methods using automated tests: a word intruder detection task and a category reconstruction test that checks whether identified patterns can be recognized and extended. ICA shows substantially better performance, with middle layers demonstrating the clearest semantic organization. The evaluation framework we develop enables systematic analysis of model representations without extensive human annotation. Our results suggest that semantic features in these models emerge as statistically independent components, enabling analysis of semantic structure without prior commitments about the nature of meaning.

Duarte Alves, Instituto Superior Técnico Lisboa

Towards Adaptable Multilingual Language Models

Tuesday 2nd December 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

With rapid advances in language technologies, there is a growing demand for systems that operate effectively in multilingual settings. Yet current models remain predominantly English-centric and difficult to adapt, limiting their usefulness in multilingual scenarios not considered during their initial design. This work addresses this gap by exploring adaptation methods for multilingual tasks and by developing models that better support multilingual applications. We first measure the impact of adaptation techniques in machine translation, showing that while fine-tuning improves translation quality, it harms in-context learning; to overcome this, we introduce a fine-tuning strategy that mixes in-context examples to recover both abilities. We then present a recipe for building translation-oriented models by extending pre-training with monolingual and parallel data and fine-tuning on diverse translation instructions, yielding Tower, which outperforms open-source alternatives and competes with proprietary models. Building on this recipe, we develop EuroLLM, an open language model specialized for European languages. Finally, we revisit multilingual encoders in light of recent decoder-only advances and systematically study dataset and training choices, resulting in the EuroBERT family, which matches or surpasses competing models across multilingual, mathematical, and code tasks.

Mert Yazan, Hogeschool van Amsterdam

The New Paradigm of Information Access: Conversations, Perceptions, and Personalization

Tuesday 25th November 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

The way we access information has undergone a rapid change with the influx of LLM-backed chatbots. In a short period of time, chatbots shifted user expectations drastically by providing an experience where the information is presented in a conversational and easily consumable manner. While chatbots did not increase the accuracy of information access, they offer convenience. Chatbots are present in many domains, from education to medicine to customer service. The pervasive nature of LLMs is even visible in academia: almost all new research in any NLP subfield, one way or another, incorporates LLMs. Most of the attention regarding user experience is based on improving how LLMs function: agentic systems, tool usage, and reducing hallucinations, to name a few. However, how users perceive this new paradigm of information access has not been investigated thoroughly. The conversational nature of the interactions brings the important question of how users express their needs and how LLMs understand them. Given the friendly style of chatbots and personalization, users form a social tie with the chatbot and process information through that tie. Therefore, to better align chatbots, we have to take personal backgrounds (digital literacy, education, previous experience, age, etc.) into account and study how the conversational style of a chatbot influences user perceptions.

Dennis Ulmer, ILLC, University of Amsterdam

From Verbalized To Anthropomimetic Uncertainty in LLMs

Tuesday 18th November 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Human users increasingly communicate with large language models (LLMs), however the trustworthiness and perceived legitimacy of LLMs is undermined by the frequent overconfidence in their output, especially when its reliability is questionable. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces, but currently falls short of its potential. In this talk, I will give a brief overview over uncertainty communication in humans, and how the current state of research in NLP overlooks its nuances. I further analyze data and training biases that shape uncertainty expressions in LLMs in unintuitive ways, and discuss future research directions towards anthropomimetic uncertainty: uncertainty communication in LLMs that more closely intimates that of humans in order to avoid unexpected behaviors and to increase reliability and trust. To this end, I present preliminary results of a novel method to finetune LLMs towards fluent and calibrated uncertainty expressions.

Gaurav Kamath, McGill University

Measuring Word Meaning Change Across Time and Speaker Age

Tuesday 21st October 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

A central question in the study of language change is whether or not it is generational. Under this picture, language change is an iterative, generation-by-generation process: new generations of speakers introduce innovations, while older generations maintain their prior linguistic patterns, and the language evolves as the newer generations replace older ones. Conversely, language change could be a zeitgeist phenomenon, in which changes are universally adopted by speakers simultaneously, across ages and generational groups. In this talk, I present work recently published in PNAS, which asks this question in the context of word meaning change. We analyze meaning change in over 100 words across more than 7.9 million U.S. congressional speeches, to observe whether, when a word sense rises or falls in prominence, adult speakers from different generations uniformly adopt it, or those from older generations conserve their prior usage. We use masked language models to identify different senses of each word, and then model the prevalence of each of these word senses as a function of time and speaker age. We find that most words show only a small effect of speaker age; across almost 140 years of Congress, older speakers typically take longer than younger speakers to follow changes in word usage, but nevertheless do so within a few years. Our findings suggest that despite minor age-based differences, word meaning change among adults is broadly a zeitgeist process, and that older adult speakers are able readily able to adopt new word usage patterns.

Davide Ceolin, Centrum voor Wiskunde en Informatica (CWI)

Navigating the Political Compass: Evaluating Multilingual LLMs across Languages and Nationalities

Tuesday 14th October 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Large Language Models (LLMs) have become ubiquitous in today’s technological landscape, boasting a plethora of applications and even endangering human jobs in complex and creative fields. One such field is journalism: LLMs are being used for summarization, generation, and even fact-checking. However, in today’s political landscape, LLMs could accentuate tensions if they exhibit political bias. In this work, we evaluate the political bias of the most used 15 multilingual LLMs via the Political Compass Test. We test different scenarios, where we vary the language of the prompt while also assigning a nationality to the model. We evaluate models on the 50 most populous countries and their official languages. Our results indicate that language has a strong influence on the political ideology displayed by a model. In addition, smaller models tend to display a more stable political ideology, i.e. ideology that is less affected by variations in the prompt. This is joint work with Chadi Helwe (KAUST) and Oana Balalau (INRIA).

Alexandre Kabbach, Chubu Gakuin University

Normality and Intelligence: revisiting the Turing test

Thursday 2nd October 2025, 16:30. Room L2.06 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day, time and room.

Joint LiRa-CLS talk.

In this presentation, I propose to revisit the Turing test through the concept of *normality*. My core argument is that the statistical interpretation of the normal—understood as the *average* both in the normative and mathematical sense of the term—proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this presentation argues that the Turing test is a test of *normal intelligence* as assessed by a *normal judge* characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call *artificial smartness* rather than artificial intelligence *per se*, insofar as they deviate from the original goal of Turing for the modeling or artificial minds. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind—a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.

For more information, see the pages of the ILLC's LIRa (Logic and Interactive Rationality) Seminar: https://projects.illc.uva.nl/lgc/seminar/2025/06/joint-cls-and-lira-session-alexandre-kabbach

Melanie Mitchell, Santa Fe Institute

AI's Challenge of Understanding the World

Friday 16th May 2025, 15:00. Room L1.02 at LAB42, Amsterdam Science Park. Note the unusual day, time and room.

I will survey a debate in the artificial intelligence (AI) research community on the extent to which current AI systems can be said to "understand" language and the physical and social situations language encodes. I will describe arguments that have been made for and against such understanding, hypothesize about what humanlike understanding entails, and discuss what methods can be used to fairly evaluate understanding and intelligence in AI systems.

For more information, see the page of Amsterdam Lectures in AI and Society 2025: https://clclab.netlify.app/2025/05/07/alias2025

Jirui Qi, University of Groningen

Are LLMs consistent across languages? An empirical and model-internal analysis of retrieval augmented generation (RAG) in multilingual contexts.

Tuesday 22nd April 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, independently from retrieval quality, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.

Baohao Liao, Language Technology Lab, University of Amsterdam

2D Efficient Serving of Reasoning Models

Tuesday 18th March 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Deploying large, long chain-of-thought reasoning models, like DeepSeek-R1 and OpenAI o1, efficiently presents new challenges for companies and all users. In this talk, I will introduce two of my recent works that focus on optimizing memory usage and accelerating inference speed for these models:

1. Model Compression for Reduced GPU Memory – In ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning, I propose a simple yet highly effective method that achieves state-of-the-art compression quality while enabling full finetuning of a 70B model on a single A6000-48GB GPU.

2. Faster Inference for Reasoning Models – Reward-Guided Speculative Decoding for Efficient LLM Reasoning explores how collaboration between large and small reasoning models can significantly speed up inference (up to 4.4x fewer FLOPs).

Julius Cheng, University of Cambridge

Similarity-Augmented Prediction Methods for Neural Machine Translation

Tuesday 11th March 2025, 16:00. Room L1.01 at LAB42, Amsterdam Science Park (note the unusual room), plus live streaming on Zoom.

In natural language, there are usually many ways to say the same thing: the answer to a question can be said multiple ways, and there are many good translations of the same sentence. Training language models (LMs) trained with maximum likelihood estimation on large and diverse corpora leads to issues such as high entropy and poorly calibrated probabilities. There is a growing body of work that addresses this by analyzing distributions in terms of semantic space rather than token space by measuring similarities between possible outputs. In this talk, I make the case for and present current progress on these "similarity-augmented methods", including my own work on 1) minimum Bayes risk for prediction, 2) similarity-sensitive entropy for uncertainty quantification, and 3) Bayesian optimization + Gaussian process regression for reranking.

Laura Ruis, University College London

From Tokens to Thought: How do LLMs learn to reason?

Tuesday 25th February 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

The past years of progress are driven by the increasing scale of datasets, computational power, and model size. While large language models (LLMs) are rapidly saturating benchmarks, this talk takes a step back to understand how they learn. Now that the number one rule of machine learning does not apply anymore --- we cannot separate train from test in the way we used to --- our understanding of generalisation must evolve. To shed light on how LLMs might be learning to reason from data, I discuss my work which shows that the way reasoning skills are acquired from data is fundamentally different from the acquisition of factual information. We show models can learn from procedural knowledge in pretraining, indicating they acquire generalisable strategies from next-token prediction.

Martha Lewis, ILLC, University of Amsterdam

Compositional Approaches to Modelling Language and Concepts

Wednesday 12th February 2025, 16:00. Room L3.33 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual changed day and unusual room.

Recent neural approaches to modelling language and concepts have proven quite effective, with a proliferation of large models trained on correspondingly massive datasets. However, these models still fail on some tasks that humans, and symbolic approaches, can easily solve. Large neural models are also, to a certain extent, black boxes - particularly those that are proprietary. There is therefore a need to integrate compositional and neural approaches, firstly to potentially improve the performance of large neural models, and secondly to analyze and explain the representations that these systems are using. In this talk I will present results showing that large neural models can fail at tasks that humans are able to do, and discuss alternative, theory-based approaches that have the potential to perform more strongly. I will give applications in language, reasoning, and vision. Finally, I will present some future directions in understanding the types of reasoning or symbol manipulation that large neural models may be performing.

Raffaella Bernardi, University of Trento

The Interplay between Language and Reasoning

Thursday 6th February 2025, 16:30. Room L0.06 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day, unusual time and unusual room.

Large Language Models, and ChatGPT in particular, have recently grabbed the attention of the community and the media. Having reached high language proficiency, attention has been shifting toward its reasoning capabilities. It has been shown that ChatGPT can carry out some simple deductive reasoning steps when provided with a series of facts out of which it is tasked to draw some inferences. In this talk, I will argue for the need for models whose language generation is driven by an implicit reasoning process and a communication goal. To support my claim, I will present two papers recently produced within my group: one evaluates LLMs' formal reasoning skills (Bertolazzi et al., 2024) and the other focuses on LLMs' information-seeking strategies (Mazzaccara et al., 2024); to this end, we take syllogisms and the 20-Questions game as test beds. These tasks have been used extensively in cognitive sciences to study human reasoning skills, hence they provide us with a variety of experiments to inspect the language and reasoning interplay in LLMs.

Mini-workshop for Rochelle Choenni's PhD defense

Wednesday 22nd January 2025, 16:00-18:00. Room OMHP C0.23 at Oude Manhuispoort 4-6, Center campus (note the unusual location and time!).

16h00-16h50: Arianna Bisazza (University of Groningen): Studying Language Evolution and Acquisition with Modern Neural Networks
17h00-17h50: Goran Glavaš (University of Würzburg): How Many Words is A Picture Really Worth? On Training and Evaluating Multilingual Vision-Language Models

Studying Language Evolution and Acquisition with Modern Neural Networks

Arianna Bisazza (University of Groningen)

Why do human languages look the way they do? And what makes us so good at learning language as we grow up? In this talk, I'll propose that modern NNs are a valuable tool to simulate and study processes of human language evolution and acquisition, provided they are used in the right way. That means: under controlled setups where training data, model architecture, and learning setup are known and can be changed across experiments. I will then present two lines of research following this approach, namely: (1) simulating processes of language change using small NN-agents that learn to communicate with pre-defined artificial languages, and (2) simulating the acquisition of syntax by training LMs on more realistic input data, such as child-directed language. After presenting some of my work in these directions, I'll end with a discussion of the value of interdisciplinarity and the importance of experimenting in small controlled setups, rather than focusing all our efforts on the evaluation of Large Pre-trained Language Models.

How Many Words is A Picture Really Worth? On Training and Evaluating Multilingual Vision-Language Models

Goran Glavaš (University of Würzburg)

Large Vision-Language Models (LVLMs), commonly obtained by aligning a pretrained visual encoder (e.g., a Vision Transformer, ViT) to a pre-trained large language model (LLM), have recently led to impressive results not only in image captioning, but also on a wide range of visual understanding and reasoning tasks (e.g., visual question answering). Nonetheless, there are a number of factors involved, ranging from the architecture of the alignment module to the exact "training mix" (i.e., training tasks and data) that strongly determine the effectiveness of the resulting LVLM. Moreover, LVLMs (much like their text-only counterparts), are not inherently multilingual and suffer from hallucination. In this talk, I'll explore training and evaluation protocols for LVLMs, focusing in particular on (i) efficiently training competitive massively multilingual LVLMs, (ii) training with grounding objectives, reported to reduce hallucinative tendencies of LVLMs, and (iii) pitfalls of existing LVLM evaluation and possible remedies.

Ana Lucic, ILLC, University of Amsterdam

Counterfactual explanations for Structured Data

Tuesday 14th January 2025, 16:00. Room L3.36 at LAB42, Amsterdam Science Park.

Model explainability has become an important problem in artificial intelligence (AI) due to the increased effect that algorithmic predictions have on humans. Explanations can help users understand not only why AI models make certain predictions, but also how these predictions can be changed via counterfactual explanations. Given a data point and a trained model, we want to find the minimal perturbation to the input such that the prediction changes. We frame the problem of finding counterfactual explanations as a gradient-based optimization task and first focus on tree ensembles. We then extend our method to accommodate graph neural networks (GNNs), given the increasing promise of GNNs in real-world applications such as fake news detection and molecular simulation.

André Martins, Instituto Superior Técnico and Unbabel

Dynamic Sparsity and Reranking Laws for Language Generation

Thursday 24th October 2024, 11:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day and unusual time.

In the first part of the talk, I describe how sparse modeling techniques can be extended and adapted for facilitating dynamic sparsity in neural models, where different neural pathways are activated depending on the input. The building block is a family of sparse transformations induced by Tsallis entropies called alpha-entmax, a drop-in replacement for softmax, which contains sparsemax as a particular case. Entmax transformations are differentiable and (unlike softmax) they can return sparse probability distributions, useful for routing and interpretability. They can also be used to design new Fenchel-Young loss functions, replacing the cross-entropy loss. Variants of these sparse transformations and losses have been applied with success to machine translation, natural language inference, visual question answering, Hopfield networks, reinforcement learning, and other tasks.

In the second part of the talk, I provide a communication-theoretic perspective of generator-reranker systems. Reranking is a commonly used strategy for making large language models (LLMs) more accurate and for reducing hallucination rates, but to which extent are they able to do so? We draw a parallel between this strategy and the use of redundancy to decrease the error rate in noisy communication channels. We conceptualize the generator as a sender transmitting multiple descriptions of a message through parallel noisy channels. The receiver decodes the message by ranking the (potentially corrupted) descriptions and selecting the one found to be most reliable. We provide conditions under which this protocol is asymptotically error-free even in scenarios where the reranker is imperfect (governed by Mallows or Zipf-Mandelbrot models) and the channel distributions are statistically dependent. We use this framework to obtain reranking laws validated empirically on two real-world tasks using LLMs: text-to-code generation with DeepSeek-Coder 7B and machine translation of medical data with TowerInstruct 13B.

Verna Dankers, University of Edinburgh

Analysing memorisation in classification and translation through localisation and cartography

Wednesday 16th October 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day.

Memorisation is a natural part of learning from real-world data: neural models pick up on atypical input-output combinations and store those training examples in their parameter space. That this happens is well-known, but which examples require memorisation and where in the millions (or billions) of parameters memorisation occurs are questions that remain largely unanswered. In this talk, I first elaborate on the localisation question by examining memorisation in the context of classification in fine-tuned PLMs, using 12 tasks. Our findings give nuance to the generalisation-first memorisation-second hypothesis dominant in the literature and find memorisation to be a gradual process rather than a localised one. Secondly, I discuss memorisation from the viewpoint of the data using neural machine translation (NMT) models by putting individual data points on a memorisation-generalisation map. I illustrate how the data points' characteristics are predictive of memorisation in NMT and describe the influence that subsets of that map have on NMT systems' performance.

Cassandra Jacobs, University at Buffalo

Understanding constraint satisfaction and prediction in large language models with information flow

Monday 24th June 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day.

Constraint satisfaction (CS) theories of language processing (Seidenberg & MacDonald, 2006) proposed that interpreting ambiguous sentences like, "Kelly saw the boxer at the pet store" requires probabilistically balancing a wide variety of linguistic constraints, especially interactions between lexical information. Today, the scale of large language models (LLMs) allow us to re-evaluate constraint satisfaction theories. In this talk, I present an approach that quantifies the amount of work taking place in the attention matrices to characterize how readily lexical ambiguity is resolved. I first show that changing the ambiguity of a single word in a sentence reshapes the attention matrices largely in accordance with CS theories. Then, I show that lexical ambiguity does not have a uniform effect on the sentence representation depending on the "directionality" of the LLM (e.g., GPT-2 vs. RoBERTa), and that ambiguity manifests differently at different layers. Building on this, I turn to the probabilistic dimension of CS theories and present work that shows that token embeddings can capture human linguistic uncertainty. Further layerwise analyses show that higher layers in LLMs generally provide the best fit to human behavior, but models vary in the information flow trajectories that give rise to probabilistic predictions, and systematically under-estimate the probabilities of human decision-making. I conclude with discussion of in-progress work to better align LLMs with human linguistic decisions.

Tanise Ceron, University of Stuttgart

Evaluating political biases in LLMs: framework, challenges, and societal implications

Tuesday 4th June 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Due to the widespread use of large language models (LLMs) in ubiquitous systems, we need to understand whether they embed a specific worldview and what these views reflect. Recent studies report that, prompted with political questionnaires, LLMs show left-liberal leanings. However, it is as yet unclear whether these leanings are reliable (robust to prompt variations) and whether the leaning is consistent across policies and political leaning. In this talk, I will present the results of our study where we propose a series of tests which assess the reliability and consistency of LLMs' stances on political statements based on a dataset of voting-advice questionnaires collected from seven EU countries and annotated for policy domains. We then evaluate LLMs ranging in size from 7B to 70B parameters and observe to what extent they are consistent in terms of political worldview and political orientation. Finally, I’ll discuss the importance of taking these biases into account, and how they raise relevant design questions in use case applications.

Pushkar Mishra, Meta (Facebook) AI Research

Making Large Language Models Safe: A case study of Llama2

Friday 26th April 2024, 13:00. Room L1.01 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day and unusual room.

Large Language Models (LLMs) have seen a lot of interest from all over the world, specially since ChatGPT became the fastest growing consumer internet app in history. As we enter a new era of possibilities with AI, new challenges also present themselves. In July of 2023, Meta open-sourced the largest language models to date, making it one of the most important moments in the development of AI. Llama2 was the first LLM of its size and capabilities to be open-sourced; both the base LLM as well as a version fine-tuned for chat were released publicly for researchers to industry practitioners to leverage. In this talk, I will recap the journey of making Llama2 models safe and robust against misuse in hate speech, misinformation, etc. The talk will cover the technical details of how we defined what is safety for an LLM, the strategies we leveraged to train and fine-tune the models towards being safe, and the evaluations we conducted to verify that we had the level of safety we desired. I will also discuss the challenges that remain, and what the possible directions to address those are.

Paul Röttger, Università Bocconi

Evaluating Values and Opinions in Large Language Models

Thursday 25th April 2024, 16:00. Room L3.33 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day and unusual room.

Much recent work seeks to evaluate values and opinions in large language models (LLMs), motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations using multiple-choice surveys and questionnaires: real users do not ask LLMs survey questions. In my talk, I will present recent work in which we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs. I will also outline the steps we are now taking to build more realistic unconstrained evaluations for political values and opinions in LLMs.

Anouck Braggaar, Tilburg University

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Tuesday 16th April 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

With this literature review we aim to give an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda for the future of dialogue system evaluation. We conducted a systematic review of four databases (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. We found a wide variety in both constructs and methods. Especially the operationalisation is not always clearly reported. Newer developments concerning large language models are discussed in two contexts: to power dialogue systems and to use in the evaluation process. We hope that future work will take a more critical approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with recommendations for evaluation and suggestions for outstanding questions.

Matthias Lindemann, University of Edinburgh

Structural Inductive Biases for Better Systematic Generalization with Sequence-to-Sequence Models

Tuesday 2nd April 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Sequence-to-sequence models have been hugely popular in NLP and have been applied to tasks ranging from grapheme-to-phoneme conversion to semantic parsing. However, even with pre-training, standard sequence-to-sequence models often lack structural inductive biases to generalize systematically outside of the distribution they are trained/fine-tuned on. For example, they have been shown to struggle with longer inputs or unseen combinations of seen tokens/phrases. In this talk, I'm going to present two approaches to introducing stronger structural inductive biases that help with systematic generalization.

In the first half of the talk, I focus on semantic parsing and propose a neural architecture that decomposes the seq2seq problem into predicting small fragments of the output and then permuting them into the correct order. One of the technical challenges I address is training the model without supervision for what the fragments are. Despite not having an explicit notion of (syntax) trees, this approach performs well on generalization to deeper recursion than seen during training.

In the second half, I focus on the inductive bias of Finite State Transducers (FSTs), which have been used traditionally in areas such as phonology and morphology. Given a representation of an automatically generated FST and an input string, I propose to pre-train a Transformer to predict the output of the FST on the given input. The experiments show that this leads to a better inductive bias for downstream FST-like tasks. Empirically, the pre-training also makes the model simulate transitions between FST states in its hidden representations - without the model being explicitly trained to do so.

Christopher Summerfield, University of Oxford

Using language models to help people find common ground

Thursday 14th March 2024, 16:00. Room L3.33 at LAB42, Amsterdam Science Park, plus live streaming on Zoom. Note the unusual day and unusual room.

Technology and democracy have a chequered history. I will talk an opportunity to use AI to help people find common ground. We trained system of large language models (LLMs) that received diverse written opinions on a potentially controversial issue to generate a written consensus statement that maximised endorsement from group members. We found that statements written by the LLM were endorsed more readily than those written by humans. When we measured group members' stance before and after the process, we found that afterwards consistently tended to converge on a common side of the argument. We used tools from NLP to study the properties of the consensus statement that made this possible. We then used the tool to run a virtual citizens' assembly.

Mini-workshop: Evaluation of Dutch Language Models

Tuesday 27th February 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park.

16h00-16h45: Wietse de Vries (GroningenNLP): DUMB: A Benchmark for Smart Evaluation of Dutch Models (joint work with Martijn Wieling, Malvina Nissim)
16h45-17h15: Zoë Prins (ILLC, UvA), Blimp-NL: Building a large Dutch corpus to measure knowledge of grammar and grammaticality judgments in language models and humans (joint work with Michelle Suijkerbuijk, Marianne de Heer Kloots, Jelle Zuidema & Stefan Frank -- CLS Radboud & ILLC UvA)

Pia Sommerauer, Vrije Universiteit Amsterdam

Analyzing linguistic subtleties in language models: detecting shifts in connotation and removing biases

Tuesday 20th February 2024, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

(Large) language models achieve impressive results on various tasks based on purely distributional data; they still rely on learning (possibly complex and sophisticated) associations between words and their contexts. It is difficult to tell to what degree language models can reflect a human-like understanding of semantics and how specific information is encoded. Can models pick up subtle differences in connotation? To what degree can contextualized language models actually reflect contextual information? And, assuming specific information is captured by the model, can we remove it without changing other information? In this talk, I present two studies that aim to examine what information language models can infer based on distributional data and how we can manipulate this information: Firstly, I will present insights from a study about detecting semantic shifts in connotation between political communities using static and contextualized embeddings. We present a small, expert-informed dataset for synchronic shift detection between political communities in Dutch and English. Our experiments show that static and contextualized models can (if applied in the right way), detect subtle shifts in politically loaded terms. Secondly, I will discuss findings from a study about model debiasing methods that rely on removing specific information from models. We propose a new method for bias removal that interferes less with the embedding space than previously proposed methods. As such, it has potential for causal probing approaches. Both studies raise questions about the representations and behavior of language models.

Jerry Spanakis, Maastricht University

Find and free the law: How NLP can help access to legal resources

Tuesday 28th November 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Law is difficult to grasp, sometimes even for lawyers, let alone citizens. We all face issues related to marriage, debt, employment, however we have little to no knowledge about our rights and fundamental legal processes. Experts legal assistance comes at a cost, therefore there is a need to facilitate justice for all by guaranteeing the right to equal access to legal resources. In this talk, I will present recent work on building the right datasets and how to use the right NLP techniques towards building useful applications for citizens’ legal issues and questions. More specifically, two enhancements will be presented: (a) a graph-augmented dense statute retriever that incorporates the structure of legislation via a graph neural network improves retrieval performance and (b) a retrieval augmented LLM to provide end-to-end question answering system.

Miloš Stanojević, DeepMind

The role of syntax in the world of large language models

Tuesday 21st November 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Large Language Models (LLMs) have shown impressive results in a recent period to the extent that some cognitive scientists are claiming that syntactic theories should be abandoned as an explanation of human language in favour of LLMs. I will provide evidence that syntax is still beneficial both in scientific and engineering pursuits with human language. First, LLMs do not provide a prediction nor an explanation of what are the universal properties of all human languages, unlike the syntactic theory considered here. Second, neural activity of some brain regions during language processing can be accounted for better by an incremental syntactic parser than by a LLM surprisal. Finally, LLMs can work even better if augmented with a syntactic compositional structure. If that is so, you might ask, why is syntax not more popular in NLP then? I believe it is because the modern hardware accelerators (GPUs and TPUs) are not optimal for tree-like computation which makes it difficult to scale syntactic models, i.e. syntax is losing a hardware lottery. To address this problem we have created a JAX library, called SynJAX, that makes training of large scale syntactic models on GPUs and TPUs possible.

Clara Meister, ETH Zürich

On the Use of Language Model Embeddings for Evaluation in Natural Language Generation Tasks

Tuesday 10th October 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

A good automatic evaluation metric for language generation ideally aligns strongly with human judgements of text quality. Yet, there is currently a dearth of such metrics, which inhibits the rapid and efficient progress of language generation systems. One exception to this is Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation and the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet we show that both in theory and in practice, their proposed estimator of this divergence is quite a poor approximation. This begs the question: why does Mauve work so well? In this talk, I'll discuss our investigation of the empirical design choices behind Mauve that lead to its high correlation with human quality assessments. We find that its use of language model embeddings is critical for its success, and that while it is sensitive to syntactic- and coherence-level features of text, Mauve often ignores surface-level features. I'll discuss the implications of these findings for the trustworthiness of Mauve and for future directions of language generator evaluation metrics.

Roberto Dessì, Facebook AI Research Paris and Universitat Pompeu Fabra

Cross-domain image captioning with discriminative finetuning

Tuesday 27th June 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Neural captioners are typically trained to mimic human image descriptions, without any awareness of the function that such descriptions might have, leading to biased and vague captions. In this talk, I’ll empirically show that adding a self-supervised objective to a neural captioner helps to recover a plain, visually descriptive language that is more informative about image contents. In particular, I’ll describe experiments where we take an out-of-the-box neural captioner, and we finetune it with a discriminative objective. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify the target image among a set of candidates.

In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, the discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner trained only with supervised learning and without finetuning. I’ll further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla captions or ground-truth captions for human subjects tasked with an image discrimination task. If time allows, I’ll conclude the talk by drawing a connection between our work and reinforcement learning from human feedback (RLHF), a recently introduced method powering models like ChatGPT and InstructGPT.

Lukas Galke, Max Planck Institute for Psycholinguistics

What makes a language easy to deep-learn?

Tuesday 16th May 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

A fundamental property of natural languages is their compositional structure, allowing us to describe new meanings systematically. However, neural networks notoriously struggle with systematic generalization and do not necessarily benefit from compositional structure in emergent communication simulations. In this talk, I will present results of comparing neural networks with humans in learning and generalizing a new language. The experiments closely replicate an artificial language learning study (conducted originally with human participants) with deep neural networks, while evaluating memorization and generalization capabilities with respect to the degree of structure in the input language. The results show striking similarities between humans and deep neural networks: More structured linguistic input leads to more systematic generalization, better convergence between humans and neural network agents, and between different neural agents. After confirming this structure bias with a custom recurrent neural network architecture, we repeat the experiment with a Transformer-based large language model (GPT-3), which shows a similar benefit of structured linguistic input for systematic generalization. These findings show that the underlying structure of languages is crucial for systematic generalization. Due to the correlation between community size and linguistic structure in natural languages, our findings underscore the challenge of automated processing of low-resource languages. Nevertheless, the similarity between humans and machines opens new avenues for language evolution research.

Sarenne Wallbridge, University of Edinburgh

Speech as a multi-channel system: Quantifying perceptual channel value

Tuesday 28th March 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Speech is one of the most complex and intrinsic modalities of human communication. When we speak, we convey information through both the lexical channel of which words are said, and the non-lexical channel of how those words are spoken. The problem of representing lexical information has dominated the field of speech processing and spoken language understanding. The problem of representing non-lexical information, however, has received less focus. The non-lexical channel contains a host of information, some of which serves important communicative functions such as indicating communicative intent or marking novel information, while other features pertaining to a speaker’s environment or identity may be less relevant during communication. Understanding which components of the lexical and non-lexical channels are perceptually salient is crucial for modelling human comprehension of spoken language and could lead to more efficient methods of representing speech.

In my work, I aim to quantify the perceptual value of the lexical and non-lexical components of speech for comprehension, specifically how much they constrain expectations of upcoming communication. In this talk, I will present our investigations into quantifying the value of the lexical and non-lexical channels in spoken dialogue. I will discuss when current language models align with this aspect of perception and when they diverge, as well as how we can use them to study perception. Finally, I will conclude by discussing potential approaches for quantifying the value of lexical and non-lexical information in terms of compression and entropy reduction.

Giada Pistilli, Sorbonne Université and HuggingFace

Ethics of Large Language Models: considerations and case studies

Tuesday 17th January 2023, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

In the field of NLP, Large Language Models (LLMs) hold an increasingly important space in industrial and academic research. With their development, researchers and users are increasingly aware of their potential negative effects on society. The trend toward ever-larger models and datasets seems to prevail within this environment, but ethical reflections and considerations do not always accompany it. This talk aims to shed light on the potential risks when using LLMs without the proper precautions. Those same risks exist in a general narrative where AI is still perceived as a "black box" behind which its developers disguise their responsibility to their users. After a brief introduction to ethics as a philosophical discipline, I will illustrate how its conceptual tools help prevent risks related to LLMs and lay the foundation for responsible development in their application cases. Subsequently, I will show how to operationalize ethics when applied to LLMs development projects, leaning on the BigScience and BigCode open science case studies.

Tejaswini Deoskar, Utrecht University

Three generalisation problems in NLP

Tuesday 13th December 2022, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

In this talk, I will discuss ongoing research on several topics, on the general theme of generalisation in natural language models. First, I will talk about the generalisation problem in analytically complex syntactic parsers, where it is necessary to go beyond supervised models, for instance for parsing out-of-domain data or low-resource languages; specifically I will present recent results on constructing complex category types (in CCG or other categorial grammars) that are unseen in the training data, on-the-fly. Second, I will discuss a use-case for a syntactic parser applied to a new domain: detecting syntactic markers of “agency” in language use. Loss of agency is correlated to psychological conditions like depression or fatigue syndrome, and often expressed in the language produced by patients (e.g. in excessive use of passives). Automatic detection of such markers can help medical professionals intervene and predict recovery in online treatments. Third, I will discuss recent research on incorporating image-external knowledge for “contextualised” image-captioning: here we develop a generalisable system that can identify broad-coverage external knowledge relevant to an image. The system can generate informative as well as factually-correct captions, and be applied to various image-language scenarios.

Raquel G. Alhama, Tilburg University

Linguistic Productivity: the Case of Determiners

Tuesday 22nd November 2022, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Having heard "a pimwit", English-speakers know immediately that “the pimwit” is possible, even if they haven’t heard the phrase before. Researchers from diverse theoretical perspectives agree that this type of productivity can be explained with syntactic categories (namely, determiner and noun), but have long debated whether it is necessary to assume that such categories are present from birth, or instead they can be learned from the input. In our work, we track determiner production and onset of productivity in a large sample of children. Our approach differs from previous work in at least four interrelated ways. First, we model determiner productivity with a data-driven model that is not pre-equipped with any notion of syntactic categories, and investigate to what extent this model, when trained solely on caregiver child-directed utterances, can reproduce behavioral patterns of children. Second, rather than quantifying the strength of the evidence for abstract grammatical categories in children’s early speech, we propose a new metric that quantifies the onset of grammatical productivity for individual children. Third, to be able to observe the onset, we base our studies on a large longitudinal dataset that allows us to track determiner productivity at early learning stages. Finally, we use our model to find out instances of true generalization, i.e. determiner+noun productions that have not been seen in the input data (and hence cannot be a result of imitation). Results show gradual emergence of determiner productivity in child language, suggesting that the syntactic category is learned from the input in a bottom-up fashion.

Andrey Kutuzov, University of Oslo

What can go wrong with pre-trained language models for semantic change detection

Tuesday 8th November 2022, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

Large-scale contextualized language models are currently often used for semantic change detection: both LSTMs and Transformers. The results are impressive, but are LM-based systems always correct? In this talk, I will qualitatively analyze questionable outputs of such systems on the example of the degrees of semantic change predicted for English words across 5 decades.

It seems that LM-based methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term. Pre-trained language models are prone to confound changes in lexicographic senses and changes in contextual variance. Notably, this is different from the types of issues observed in methods based on static word2vec-like embeddings. Additionally, contextualized language models often merge together syntactic and semantic aspects of lexical entities. I will discuss such cases in detail, complete with examples, an attempt of their linguistic categorization, and a range of possible future solutions.

Albert Gatt, Utrecht University

Language modelling in low-resource scenarios: Two case studies

Tuesday 11th October 2022, 16:00. Room L3.36 at LAB42, Amsterdam Science Park, plus live streaming on Zoom.

The success of large-scale neural language models has brought about new challenges for low-resource languages. For such languages, training data is not as easily available as it is for languages such as English. To take an example, widely-used multilingual models such as mBERT exclude languages with a small Wikipedia footprint. By the same token, in massively multilingual resources harvested from the web, the data for these languages also tends to be of very low quality.

In this seminar, I will discuss work in progress which addresses low-resource scenarios for multilingual NLP.

First, I describe some efforts towards making existing multilingual models transferrable to new languages, using adversarial techniques. It turns out that the effectiveness of such techniques is strongly influenced by the fine-tuning we perform to adapt models to downstream tasks, as well as by the nature of the tasks themselves.

I will then consider some more recent work on training Transformer models from scratch in a low-resource setting. Here, our research shows that in the absence of very large pretraining datasets, excellent results can be achieved if we trade off limited size in favour of quality and diversity.

Alberto Testoni, University of Trento

Generating Natural Language and Strategic Questions in Multimodal Referential Games

Tuesday 30th August 2022, 16:00. Room B0.204 at Science Park 904, plus live streaming on Zoom.

Recent years have witnessed an explosion of NLP models for many different tasks, both in text-only and multimodal (vision & language) settings. Impressive results have been obtained on multimodal encoders, whereas decoders have received less attention. In my work, I focus on the latter aiming to study the problem-solving reasoning behind natural language generation. To this end, I take referential grounded dialogue games as a testbed. I will discuss the main issues affecting generative systems and explore how the weaknesses of the encoder affect the choice of the decoder by focusing on the interpretation of negatively answered questions. I will then present a cognitively-inspired re-ranking decoding strategy for promoting the generation of strategic questions. I will compare this strategy to a wide variety of different decoding algorithms proposed in the literature, together with an in-depth analysis of their hyper-parameter configurations. Finally, I will briefly mention some ongoing works on exploring how modeling human uncertainty can lead to better natural language generation systems and an investigation of pragmatic phenomena that allow humans to efficiently solve referential games.

Hila Chefer, Tel Aviv University

Transformer Explainability: obtaining reliable relevancy scores and using them for promoting robustness

Tuesday 28th June 2022, 16:00. Room F1.15 at Science Park 107 plus live streaming on Zoom

Transformers have revolutionized deep learning research across many disciplines, starting from NLP and expanding to vision, speech, and more. In my talk, I will explore several milestones toward interpreting all families of Transformers, including unimodal, bi-modal, and encoder-decoder Transformers. I will present working examples and results that cover some of the most prominent models, including CLIP, BERT, LXMERT, and ViT. I will then present our recent explainability-driven fine-tuning technique that significantly improves the robustness of Vision Transformers (ViTs). The loss we employ ensures that the model bases its prediction on the relevant parts of the input, rather than supportive cues (e.g., background). This can be done with very little added supervision in the form of foreground masks, or without any such supervision.

Lisa Bylinina, Bookarang

Polarity-sensitivity in large language models

Tuesday 14th June 2022, 16:00. Room F3.20 at Science Park 107 plus live streaming on Zoom

I will discuss some of my recent (ACL 2022; CogSci 2022) experiments with large language models. We take polarity-sensitivity as a case study and take a closer look at linguistic representations of monolingual (BERT, GPT-2) and multilingual (multilingual BERT, XLM-RoBERTa) pre-trained language models. The overarching question is: what do these models learn about NPI licensing? We test this with simple 'observational' methods and with somewhat more baroque interventional ones and compare (some of) the results with human behavioural data. I hope these experiments lead to a more general discussion about the relation between LM data, psycholinguistic data and linguistic theory. This is joint work with Alexey Tikhonov.

Abdellah Fourtassi, Aix-Marseille University

Towards a quantitative theory of children's communicative development in the wild

Tuesday 31th May 2022, 16:00. Room F0.22 at Science Park 107 plus live streaming on Zoom

To learn how to communicate with people around them, children have to master the linguistic content (e.g., what words mean) and understand how to encode and decode communicative intents (e.g., how words are used in dialog). While both aspects have been studied extensively over the last few decades, we do not have a complete theory of how they develop and interact. This slow progress is due, in part, to the fact that traditional research methods are often restrictive/de-contextualized and do not reflect children's real learning environment which is largely multimodal, socially embedded, and culturally variable. In this talk, I will argue that opportunities for collaborative data collection/pooling about children's learning in more ecologically valid settings as well as advances in data processing at scale provide new tools that can be utilized to help answer lingering scientific questions, making much more plausible the prospect of a quantitative theory of child communicative development in the wild.

Harm de Vries, ServiceNow Research

Explorations in Task-Oriented Dialogue, Text2Code Models and Continuous Prompt Tuning

Tuesday 26th April 2022, 16:00. Room F1.15 at Science Park 107 plus live streaming on Zoom

In this talk, I will cover a (diverse) set of ongoing research threads:

How can we pursue open dialogue research with a real group of end users? I'll discuss why this is challenging in the current research landscape, and explain our efforts on releasing a large conversational dataset of Statistics Canada.
I’ll present some surprising findings on continuous prompt-tuning for semantic parsing; Our experiments suggest that prompt-tuned language models are capable of outputting formal meaning representations that are very far from the pre-training distribution.
Together with HuggingFace and CMU, we will soon launch an open scientific collaboration around large Text2Code models. I'll briefly explain the current state of Text2Code models and talk about our plans for this summer.

Douwe Kiela, Hugging Face

Improving Multimodal Evaluation and Exploring Foundational Language and Vision Alignment

Tuesday 5th April 2022, 16:00. Room F3.20 at Science Park 107 plus live streaming on Zoom

In this talk I will cover some recent work that tries to improve how we do model evaluation in multimodal settings, focusing on the new Adversarial VQA and Winoground evaluation datasets. After that, I will talk about our latest vision and language "foundation model", called FLAVA: a single holistic universal transformer that targets all modalities at once and that shows impressive performance on a wide range of tasks.

Maartje ter Hoeve, University of Amsterdam

Towards Interactive Language Modeling

Tuesday 8th March 2022, 16:00. Room F3.20 at Science Park 107 plus live streaming on Zoom

Interaction between caregivers and children plays a critical role in human language acquisition and development. Given this observation, it is remarkable that explicit interaction plays little to no role in artificial language modeling -- which also targets the acquisition of human language, yet by artificial models. Moreover, an interactive approach to language modeling has the potential to make language models substantially more versatile and to considerably impact downstream applications. Motivated by these considerations, we pioneer the space of interactive language modeling. First we present a road map in which we detail the steps that need to be taken towards interactive language modeling. We then lead by example and take the first steps on this road map, showing the initial feasibility of our approach. As such, this work aims to be the start of a larger research agenda on interactive language modeling.

Grzegorz Chrupała, Tilburg University

Learning language from Peppa Pig

Tuesday 8th February 2022, 16:00. Live streaming on Zoom.

Attempts to computationally model or simulate the acquisition of spoken language via grounding in the visual modality have a long tradition but have gained momentum since around 2015 with the revival of neural networks. Current neural approaches are able to spot associations between the spoken and visual modality, and use these to represent speech and image/video data in a joint vector space. A major limitation of these works are the datasets used to train them. Most consist of static images or videos paired with spoken descriptions of what is depicted, and thus guarantee a strong correlation between speech and the visual world by construction. A child learning a language faces a very different and harder task: in the real world the coupling between the linguistic and the visual is much looser, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal, such as voices of specific people and environmental sounds. The current study is a first step towards simulating such a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Evaluation and analysis results indicate that despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.

Arthur Bražinskas, University of Edinburgh

Abstractive opinion summarization

Tuesday 11th January 2022, 16:00. Live streaming on Zoom.

Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. These short summaries can help users make better purchasing decisions by condensing useful information in hundreds or even thousands of reviews. However, due to the high cost of summary production, datasets large enough for supervised learning were absent until recently. This lead to a variety of extractive methods that construct summaries from review sentences. However, these methods often produce incoherent summaries with unimportant details. This presentation will focus on abstractive approaches that generate summaries using a free vocabulary and thus can yield more coherent texts. We will discuss summarizers trained in unsupervised, few-shot, and supervised regimes. These models combine principles of latent probabilistic models, variational inference, and reinforcement learning. In our unsupervised model (Copycat), we treat the product and review representations as latent continuous variables. At test time, we induce summarizing representations and map them to summarizing texts. In the supervised model (SelSum), we decompose the system into a selector (posterior) and summarizer. The selector treats reviews as latent categorical variables and selects a summary-relevant subset in training. Only the small subset is passed to the summarizer, which results in computational and memory savings. The system is trained end-to-end using variational inference and reinforcement learning. Finally, we fit another selector (prior) that selects subsets of informative reviews to summarize in test time.

Malihe Alikhani, University of Pittsburgh

Learning to Connect Images and Text for Natural Communication

Tuesday 21st December 2021, 16:00. Live streaming on Zoom.

From the gestures that accompany speech to images in social media posts, humans effortlessly combine words with visual presentations. However, machines are not equipped to understand and generate such presentations due to people’s pervasive reliance on commonsense and world knowledge in relating words and images. I present a novel framework for modeling and learning a deeper combined understanding of text and images by classifying inferential relations to predict temporal, causal, and logical entailments in context. This enables systems to make inferences with high accuracy while revealing author expectations and social-context preferences. I proceed to design methods for generating text based on visual input that use these inferences to provide users with key requested information. The results show a dramatic improvement in the consistency and quality of the generated text by decreasing spurious information by half. Finally, I sketch my other projects on human-robot collaboration and conversational systems and describe my research vision: to build human-level communicative systems and grounded artificial intelligence by leveraging the cognitive science of language use.

Denis Paperno, Utrecht Institute of Linguistics

On Compositional Generalization of Transformer Models for Toy Tasks

Tuesday 30th November 2021, 16:00. Room A1.14 at Science Park 904.

Toy tasks such as interpreting the arithmetic language (Hupkes et al. 2018) or SCAN (Lake and Baroni 2018) are designed to help us detect and analyze compositional semantic behavior of machine learning models. Most results using these toy tasks have been achieved in analyzing and interpreting recurrrent neural network models. However, the state of the art in NLP is now defined by the Transformer model (Vaswani et al. 2017) which, while in principle having the same theoretical expressive capacity as recurrent models, have a different structure and achieve different learning outcomes. The talk will share some observations on Transformers' compositional generalization behavior on toy tasks.

Vlad Niculae, Informatics Institute (IvI), University of Amsterdam

Sparse Latent Structure with Overlapping Constraints

Tuesday 16th November 2021, 16:00. Live streaming on Zoom.

Structured representations are a powerful tool in machine learning, in particular for natural language: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. Such representations are at odds with deep neural networks, which conventionally perform smooth, soft computations, learning dense, inscrutable hidden representations. We present SparseMAP, a strategy for inferring differentiable combinatorial latent structures, alleviating the tension between discrete and continuous representations through sparsity. SparseMAP computes a globally-optimal combination of a very small number of structures, and can be extended to arbitrary factor graphs (LP-SparseMAP), only requiring access to local maximization oracles. Our strategy is fully deterministic and compatible with familiar gradient-based methods for training neural networks. We demonstrate sparse and structured neural hidden layers, with successful empirical results and visualization properties.

Multiple presenters, University of Amsterdam

The Beyond the Imitation Game Benchmark (BIG-bench) challenge

Tuesday 5th October 2021, 16:00. Live streaming on Zoom.

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and provide concrete evidence of their capabilities and limitations. In the community building spirit of ELLIS-Amsterdam, we have formed three teams mixing Bachelor's, Master's, and PhD students and have contributed three tasks to the benchmark. In the seminar, we will briefly introduce the BIG-bench challenge and then the three teams will present their benchmarking tasks. The Metaphor Understanding task tests the capability of language models to understand English metaphors. It consists of two subtasks: in the first one, a language model is asked to correctly map a metaphorical expression to its correct literal paraphrases; in the second one, the model needs to map a literal paraphrase to the corresponding metaphorical expression. The two subtasks form a new dataset that takes into account the lessons learned from existing models and benchmarks. The Implicit Relations task evaluates a model's ability to infer relations between characters from short passages of English narratives, where the relations are left implicit. In each example, a passage and a question of the form "What is X to Y?" is presented, and the model must select the correct relation. Our new dataset makes use of 25 labels ranging from familial relations to professional relations. Finally, the Fantasy Reasoning task assesses a language model's ability to reason within situations that go against common sense or in some way violate the rules of the real world; humans do this easily, e.g., when reading a science fiction book. We collect a corpus of contexts that language models are extremely unlikely to be familiar with, paired with yes-no questions.

References to the three projects with the corresponding list of authors can be found on github: Metaphor Understanding, Implicit Relations, Fantasy Reasoning.

Robert Hawkins, Princeton

Coordinating on meaning in communication

Tuesday 18 May 2021, 17:00 [note unusual time]. Live streaming on Zoom.

Languages are powerful solutions to the complex coordination problems that arise between social agents. They provide stable, shared expectations about how the words we say correspond to the beliefs and intentions in our heads. However, to handle an ever-changing environment with new things to talk about and new partners to talk with, linguistic knowledge must be flexible: we give old words new meaning on the fly. In this talk, I will present work investigating the cognitive mechanisms that support this balance between stability and flexibility. First, I'll present a large corpus of natural-language communication in the classic "tangrams" task that allows us to quantitatively characterize the dynamics of ad hoc convention formation with a single partner. Second, I'll ask how these ad hoc conventions may be generalized to broader communities. I'll introduce a theoretical framework re-casting communication not as a transmission problem but as a meta-learning problem which may be formalized via hierarchical probabilistic inference: dynamics within an interaction are driven by ad hoc partner-specific adaptation while community-level conventions are gradually abstracted away from many interactions and provide a stable prior for new partners. Finally, I'll explore several proposals about how this computational framework can be implemented at scale to allow artificial agents to form natural-language conventions, adapting to human partners in real time. Taken together, this line of work aims to build a computational foundation for a more dynamic view of meaning and common ground in communication.

Clara Meister, ETH Zürich

If Beam Search is the Answer, what was the Question?

Tuesday 23 March 2021, 16:00. Live streaming on Zoom.

Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.

Svitlana Vakulenko, University of Amsterdam

Conversational Question Answering at Scale

Tuesday 23 February 2021, 16:00. Live streaming on Zoom.

Conversational question answering (QA) requires the ability to correctly interpret a question in the context of previous conversation turns. This talk presents the current advancements in this field, specifically focusing on the question rewriting approaches. The advantages of using question reformulation in the conversational settings are manifold: (1) reuse of existing models, datasets and approaches for information retrieval; (2) more transparency in the prediction results; (3) ability to deploy the models in a distributed environment, where the individual components do not share a common representation. Our experiments demonstrate that question rewriting is not only effective at setting the state-of-the-art performance on conversational QA but also allows to evaluate robustness of the question answering approaches.

Douwe Kiela, Facebook AI Research

Rethinking Benchmarking in AI

Tuesday 19 January 2021, 16:00. Live streaming on Zoom.

The current benchmarking paradigm in AI has many issues: benchmarks saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts, have unclear or imperfect evaluation metrics, and do not measure what we really care about. I will talk about my work in trying to rethink the way we do benchmarking in AI, specifically in natural language processing, focusing mostly on the recently launched Dynabench platform.

Bryan Eikema, ILLC, University of Amsterdam

The Inadequacy of the Mode in Neural Machine Translation

Tuesday 1 December 2020, 16:00. Live streaming on Zoom.

Neural sequence generation systems oftentimes generate sequences by searching for the most likely sequence under the learnt probability distribution. This assumes that the most likely sequence, i.e. the mode, under such a model must also be the best sequence it has to offer (often in a given context, e.g. conditioned on a source sentence in translation). Recent findings in neural machine translation (NMT) show that the true most likely sequence oftentimes is empty under many state-of-the-art NMT models. This follows a large list of other pathologies and biases observed in NMT and other sequence generation models: a length bias, larger beams degrading performance, exposure bias, and many more. Many of these works blame the probabilistic formulation of NMT or maximum likelihood estimation. We provide a different view on this: it is mode-seeking search, e.g. beam search, that introduces many of these pathologies and biases, and such a decision rule is not suitable for the type of distributions learnt by NMT systems. We show that NMT models spread probability mass over many translations, and that the most likely translation oftentimes is a rare event. We further show that translation distributions do capture important aspects of translation well in expectation. Therefore, we advocate for decision rules that take into account the entire probability distribution and not just its mode. We provide one example of such a decision rule, and show that this is a fruitful research direction.

References:

[pre-print]

Michael Franke, University of Osnabrück

Title: Theory-driven probabilistic modeling of language use: a case study on quantifiers, logic and typicality

[Joint work with Bob van Tiel (Nijmegen) and Uli Sauerland (Berlin)]

Tuesday 10 November 2020, 16:00. Live streaming on Zoom.

Theoretical linguistics postulates abstract structures that successfully explain key aspects of language. However, the precise relation between abstract theoretical ideas and empirical data from language use is not always apparent. Here, we propose to empirically test abstract semantic theories through the lens of probabilistic pragmatic modelling. We consider the historically important case of quantity words (e.g., `some', `all'). Data from a large-scale production study seem to suggest that quantity words are understood via prototypes. But based on statistical and empirical model comparison, we show that a probabilistic pragmatic model that embeds a strict truth-conditional notion of meaning explains the data just as well as a model that encodes prototypes into the meaning of quantity words.

Mini-symposium on Meaning Variation in Social Contexts

(Special event on the occasion of Marco Del Tredici's PhD defense)

Thursday 5 November 2020, 14:30 - 17:30. Live streaming on Zoom. Talks will be around 20-30 minutes with time for questions and discussion. Program below.

14:30-14:40 | Welcome

14:40-15:20 | Dirk Hovy, Bocconi University, Milan

Title: Reach across the aisle: Why NLP is perfect for computational social science

Language is the ultimate social medium: we communicate not just to convey information, but to entertain, gossip, console, convince, and so much more. Social sciences have long explored this connection between behavior and language to learn more about the people and societies who use it. As the amount of language data grows exponentially, traditional methods are no longer sufficient, but NLP can help address this issue. In addition to what language can tell us about society, we now find out what NLP can tell us about language. This combination opens a wide range of exciting new applications, and to answer questions that were out of reach for years. However, it introduces NLP into areas that were previously the sole domain of social sciences, which also presents the challenge of finding a balance between methodology and theoretical motivation. In this talk, I will show how using NLP in social sciences can do more than what we thought before. I will illustrate NLP's role in social sciences with some ongoing research, and discuss a number of open questions and challenges for the field(s).

15:20-16:00 | Ekaterina Shutova, ILLC, University of Amsterdam

Title: Meta-learning for few-shot word sense disambiguation

16:00-16:10 | Break

16:10-16:50 | Katrin Erk, University of Texas at Austin

Title: How to marry a star: Probabilistic constraints for meaning in context

Context has a large influence on word meaning; not only local context, like in the combination of a predicate and its argument, but also global topical context. In computational models, this is routinely factored in, but the question of how to integrate different context influences is still open for theoretical accounts of sentence meaning. We start from Fillmore's "semantics of understanding", where he argues that listeners imagine the situation behind a given utterance using all their knowledge about words and the world. We formalize this idea as a "situation description system". This is a generative model of utterance understanding, which characterizes understanding as probabilistically describing the situation underlying the utterance.

16:50-17:30 | Mario Giulianelli, ILLC, University of Amsterdam

Title: A usage-based approach to lexical semantic change

Traditional semantic change detection algorithms rely on the assumption that a single word type representation is sufficient to model the different usages of a word. In this talk, I'll present a usage-based approach for the detection and analysis of lexical semantic change that relies on contextualised word representations obtained from a neural language model—one for every occurrence of a word of interest. After introducing this method, I'll discuss the validity of contextualised embeddings as word usage representations and show that they capture a variety of synchronic and diachronic linguistic phenomena. We'll see how this is reflected in the accuracy of the proposed approach tested on historical corpora in four languages, and compare ways to make the method more robust. Finally, I'll give an overview of the types of change detected by our usage-based approach and propose ideas to automate this finer-grained analysis.

Felix Hill, DeepMind

Title: An approach to language understanding in machines based on prediction, perception and action

Tuesday 13 October 2020, 16:00. Live streaming on Zoom.

Massive language models like GPT-3 can do amazing things with language, and this raises the interesting question of whether such text-based models could ever really "understand" it. One clear difference between GPT-understanding and human understanding is that GPT3 doesn't learn to connect language to its actions or its perception of the world it inhabits. In this talk, I'll discuss an approach to language understanding in which a neural-network-based agent is trained to associate words and phrases with things that it learns to see and do. First, I'll provide some evidence for the promise of this approach by showing that the interactive, first-person perspective of an agent affords it with a particular inductive bias that helps it to extend its training experience to generalize to out-of-distribution settings in ways that seem natural or 'systematic'. Second, I'll show the amount of 'propositional' (i.e. linguistic) knowledge that emerges in the internal states of the agent as it interacts with the world can be increased significantly by it learning to make predictions about observations multiple timesteps into the future. Third, I'll show how meta-learning and an explicit multi-modal external memory can afford agents the ability to learn new words in a single experience with an object (i.e. fast-mapping) and to combine this fast knowledge with longstanding semantic knowledge to interpret novel instructions. Finally, I'll connect GPT and agent-based learning in a more literal way, by showing how an agent endowed with representations from a massive language model can achieve substantial (zero-shot) transfer from template-based language to noisy natural instructions given by humans with access to the agent's world.

References:

[1] [2] [3] [4] [5]

Raquel Alhama, Max Planck Institute for Psycholinguistics [cancelled]

Title: Recognizing Words from Fixations: a Perceptually-constrained Connectionist Account

Tuesday 17 March 2020, 16:00. Room F2.19 at SP 107.

Reading requires rapid recognition of words in printed text. Existing models of visual word recognition account for this mechanism by mapping the perceived letter strings into lexical units. In our work, we explore whether this process is mediated by the statistical properties of the input writing systems. Adopting an information-theoretic perspective, we analyze two languages from different families (English and Hebrew), and we find key differences in the available information contained in the letters in different parts of the word (beginning vs. ending) for converging on a lexical candidate. We test the implications of these cross-linguistic differences in a novel perceptually-constrained connectionist model of visual word recognition. The simulations account for a number of behavioral phenomena. First, our model predicts a tendency to fixate slightly closer to the beginning of the word. Second, we demonstrate cross-linguistic differences in the likelihood of fixating at other locations due to availability of information-content. Our model makes the novel prediction, which we confirmed by behavioral data, that words with an atypical distribution of information-content across letters are better recognized when fixating at an unusual location in a word. Overall, our research shows how the mechanism of visual word identification is tuned to the perceptually-constrained regularities of the writing systems, thereby driving proficient reading.

Felix Hill, DeepMind [cancelled]

Title: An approach to language understanding in machines based on prediction, perception and action

Tuesday 3 March 2020, 16:00. Room F1.15 at SP 107.

Models like BERT or GPT-2 can do amazing things with language, and this raises the interesting question of whether such text-based models could ever really "understand" it. One clear difference between BERT-understanding and human understanding is that BERT doesn't learn to connect language to its actions or its perception of the world it inhabits. In this talk, I'll discuss an alternative approach to language understanding in which a neural-network-based agent is trained to associate words and phrases with things that it learns to see and do. First, I'll provide some evidence for the promise of this approach by showing that the interactive, first-person perspective of an agent affords it with a particular inductive bias that helps it to extend its training experience to generalize to out-of-distribution settings in ways that seem natural or 'systematic'. Second, I'll show the amount of 'propositional' (i.e. linguistic) knowledge that emerges in the internal states of the agent as it interacts with the world can be increased significantly by it learning to make predictions about observations multiple timesteps into the future. This underlines some important common ground between the agent-based and BERT-style approaches: both attest to the power of prediction and the importance of context in acquiring semantic representations. Finally, I'll connect BERT and agent-based learning in a more literal way, by showing how an agent endowed with BERT representations can achieve substantial (zero-shot) transfer from template-based language to noisy natural instructions given by humans with access to the agent's world.

Jonas Groschwitz, Saarland University

Title: Making neural compositional semantic parsing work

Wednesday 19 February 2020, 11:00. Room F1.15 at SP 107.

In this talk, I will discuss our parser for semantic graphs such as Abstract Meaning Representation (AMR). Our approach combines neural models with mechanisms from compositional semantic construction. Key to this approach is the Apply-Modify (AM) algebra, which we developed to both reflect linguistic principles and yield a simple parsing model. In particular, the AM algebra allows us to find consistent latent compositional structures for our training data, which is crucial when training a compositional parser. The parser then employs neural supertagging and dependency models to predict interpretable, meaningful operations that construct the semantic graph. The result is a semantic parser with strong performance across diverse graphbanks, that also provides insights to the compositional patterns of the graphs.

Duygu Ataman, University of Zürich

Title: A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

Friday 24 January 2019, 11:00. Room F1.15 at SP 107.

Translation into morphologically-rich languages challenges neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. The former is based on word segmentation algorithms optimized using corpus-level statistics with no regard to the translation task. The latter learns directly from translation data but requires rather deep architectures. In this paper, we propose to translate words by modeling word formation through a hierarchical latent variable model which mimics the process of morphological inflection. Our model generates words one character at a time by composing two latent representations: a continuous one, aimed at capturing the lexical semantics, and a set of (approximately) discrete features, aimed at capturing the morphosyntactic function, which are shared among different surface forms. Our model achieves better accuracy in translation into three morphologically-rich languages than conventional open-vocabulary NMT methods, while also demonstrating a better generalization capacity under low to mid-resource settings.

Aida Nematzadeh, DeepMind

Title: Learning language by observing the world and learning about the world from language

Tuesday 17 December 2019, 16:00. Room F2.19 at SP 107.

Children learn about the visual world from implicit supervision that language provides. Most children learn their language, at least to some extent, by observing the world. Recently released datasets of instructional videos are interesting as they can be considered a rough approximation of a child’s visual and linguistic experience -- in these videos, the narrator performs a high-level task (e.g., cooking pasta) while describing the steps involved in that task (e.g., boiling water). Moreover, these datasets pose challenges similar to those children need to address; for example, identifying relevant activities to the task (e.g., boiling water) and ignoring the rest (e.g., shaking head). I will present two recent projects where we study the interaction of visual and linguistic signals in these videos: (1) We show that using language and the structure of tasks is important in discovering action boundaries. (2) I will discuss how visual signal improves the quality of unsupervised word translation, especially for dissimilar languages, and where we do not have access to large corpora.

Arabella J. Sinclair, University of Amsterdam

Title: Modelling Speaker Adaptation in Second Language Learner Dialogue

Tuesday 29 October 2019, 16:00. Room F3.20 SP 107.

Understanding how tutors and students adapt to one another within Second Language (L2) learning is an important step in the development of better automated tutoring tools for L2 conversational practice. Such an understanding can not only inform conversational agent design, but can be useful for other pedagogic applications such as formative assessment, self reflection on tutoring practice, learning analytics, and conversation modelling for personalisation and adaptation. We compare L2 dialogue at different levels of student ability to fluent conversational dialogues in order to identify how adaptation takes place in terms of the linguistic complexity, lexical alignment and the dialogue act usage demonstrated by the speakers within the dialogue. Finally, with the end goal of an automated tutor in mind, student alignment levels are used to compare dialogues between student and human tutor with those where the tutor is an agent. We find that the adaptation measured by speakers in L2 dialogue differs from fluent dialogue, and changes depending on learner proficiency. We also find different types of learner behaviours within automated L2 tutoring dialogues to those present in human ones, using alignment to measure this. We frame these findings as useful in identifying users who interact with tutoring agents as intended within future large online dialogue learning tools, with an emphasis on how these can be used to improve tutoring dialogue agents.

CLS Mini-Workshop with Aurelie Herbelot, Stella Frank, and Desmond Elliot

Tuesday 15 October 2019, 15:00 - 17:30. Room F1.15 at SP 107.

Aurelie Herbelot, University of Trento

Title: Speaker-dependence in distributional semantics

15:00-15:45

One long-standing puzzle in semantics is the ability of speakers to refer successfully in spite of holding different models of the world. This puzzle is famously illustrated by the cup/mug example: if two speakers disagree on whether a specific entity is a cup or a mug (i.e. if their interpretation functions differ), how can they align so that the entity can still be talked about? Another puzzle, coming to us through lexical and distributional semantics, is that word meaning seems to be infinitely flexible across utterances, indeed much more so than the traditional notion of sense would have it. This makes the alignment process between speakers even more unpredictable. In this talk, I will report on a series of experiments aiming at investigating differences in language use through distributional semantics techniques. I will sketch what such differences can tell us about the ability of speakers to align at a model-theoretic level.

Stella Frank, University of Edinburgh

Title: A model of rational accommodation to inexperienced speakers

15:45-16:30 (followed by a 15-minute coffee break)

Communication is made easier when speakers use language in similar ways. When speakers come to an interaction with slightly different languages they often adjust their languages to be more similar, in a process of alignment or accommodation. In this talk I consider interactions in which one speaker is a more experienced speaker than the other, such as interactions between a native and non-native speaker: in this case the native speaker could improve communication by accommodating to the non-native speaker. Accurate accommodation requires making inferences about the other's language, which we can model in a Bayesian framework. In a dialogue between two rational agents, a native speaker agent who accommodates and a non-native learner agent, the learner ends up with a simplified language, due to a reinforcing effect between an initially underinformed learner and an accommodating native speaker. This result gives a possible mechanism for the negative correlation between the proportion of non-native speakers of a language and language complexity.

Desmond Elliot, University of Copenhagen

Title: Compositional Generalization in Image Captioning

16:45-17:30

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

Vinodkumar Prabhakaran, Google

Title: NLP and Society: Towards Socially Responsible NLP

Friday 27 September 2019, 16:00. Room F1.15 at SP 107.

As natural language processing (NLP) techniques are increasingly being used in various day-to-day applications, there is growing awareness that the decisions we as researchers and developers make about our data, methods, and algorithms have immense impact in shaping our social lives. In this talk, I will outline the growing body of research on ethical implications of machine learning and NLP technologies, especially around questions about fairness and accountability of the models we build and deploy into the world. I will discuss ways in which machine learned NLP models may reflect, propagate, and sometimes amplify social stereotypes about people, potentially harming already marginalized groups. I will also briefly discuss various ways to address these issues, both through mitigation strategies and through increased transparency.

Zeynep Akata, University of Amsterdam

Title: Representing and Explaining Novel Concepts with Minimal Supervision

Tuesday 10 September 2019, 16:00. Room F1.15 at SP 107.

Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image properties which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence where we show (1) how to generalize image classification models to cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label, explain why/not the predicted label is chosen for the image.

Daniel Beck, University of Melbourne

Title: Natural Language Generation in the Wild

15 July 2019, 2pm, F1.15, SP 107

Traditional research in NLG focuses on building better models and assessing their performance using clean, preprocessed and curated datasets, as well as standard automatic evaluation metrics. From a scientific point-of-view, this provides a controlled environment where different models can be compared and robust conclusions can be made. However, these controlled settings can drastically deviate from scenarios that happen when deploying systems in the real world. In this talk, I will focus on what happens *before* data is fed into NLG systems and what happens *after* we generate outputs. For the first part, I will focus on addressing heterogeneous data sources using tools from graph theory and deep learning. In the second part, I will talk about how to improve decision making from generated texts through Bayesian techniques, using Machine Translation post-editing as a test case.

Caio Corro, University of Amsterdam

Title: Learning a neural parser in a low-resource scenario with a structured latent variable model

23rd May 2019, 2pm, F 1.15, SP 107

Discrete structures such as dependency trees are often used to inject prior linguistic knowledge into statistical models. Many systems are built on top of a pipeline that starts with predicting a linguistic structure (e.g., syntactic or semantic representations) using a parser and then makes a task-specific prediction relying on this predicted structure (e.g., choose a polarity label in sentiment analysis). Unfortunately, most parsers rely on large amounts of manually-annotated data for training, which is available only for a small fraction of languages and domains. Therefore, it is appealing to rely on other forms of supervision to learn the parameters of the parser. On the one hand, raw text data is available in many languages. It can be used for semi-supervised learning to complement a small set of available annotated data. On the other hand, even when annotated data is not available, assuming a structured representations of sentences can be beneficial, as it provides inductive biases about the structure of the language. In this case, we want to induce task-specific structured representations of language in such a way as to benefit a given downstream task. In other words, an inductive bias is injected in the model, i.e. structures are good for natural languages, but no assumption is made about the appropriate content: the parser is trained end-to-end while optimizing performance on the downstream task. In practice, structures induced in this way tend not to resemble any accepted syntactic or semantic formalism as it lets the model induce the one which is better suited for the particular downstream task.

In this talk, I will explain how both problems can be cast as learning the parameters of a statistical model with structured latent variables. During training, exact inference in these models requires marginalizing over latent variables which is intractable (e.g. summing over all dependency trees for a given sentence). Recently, differentiable Monte-Carlo estimation (i.e. the reparametrization trick) has been explored for training statistical models parametrized with neural networks. We follow this line of work and introduce a differentiable relaxation which we use to approximate samples and compute gradients with respect to the parser parameters. Our method (Differentiable Perturb-and-Parse) relies on differentiable dynamic programming over stochastically perturbed arc weights. We show the effectiveness of our approach on several tasks and datasets.

Iacer Calixto, University of Amsterdam

Title: Integrating vision, language, and world knowledge in the context of (natural) language generation

28th May 2019, 4pm, F 1.15, SP 107

In this talk I would like to discuss how to integrate vision, language, and world knowledge in the context of (natural) language generation. I will start by discussing our most recent paper just accepted for publication at ACL 2019, and I will wrap up by contextualising my research interests for the next 3 years under my Marie-Curie project "IMAGINE: Improving language generation with world knowledge". In our ACL paper we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language, and is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. I will show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kádár, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Regarding my research agenda for the next 3 years, I will discuss how to represent world knowledge by learning general-purpose multi-modal knowledge base representations, as well as how to incorporate these representations into (and improve) natural language generation.

David Poeppel, New York University and Max-Planck-Institute

Title: Brain rhythms and the encoding of structure

Jointly with the ABC lecture: 14th May 2019, 4pm, REC M 1.02

How language, music, and other complex sequences are represented and computed in the human brain is a fundamental area of brain research that continues to stimulate as much research as it does vigorous debate. Some classical questions (and persistent puzzles) - highlighting the tension between neuroscience and cognitive science research - concern the role of structure and abstraction. Recent findings from human neuroscience, across various techniques (e.g. fMRI, MEG, ECoG), suggest that the brain supports hierarchically structured abstract representations.
New data on the role of brain rhythms show that such neural activity appears to underpin the tracking of structure-building operations. If the new approaches are on the right track, they invite closer relations between fields and better linking hypotheses between the foundational questions that animate both the neurosciences and the cognitive sciences.

Sandro Pezzelle, University of Amsterdam

Title: Grounding Vague Expressions in Vision

2nd April 2019, 4pm, F1.15, SP 107

Expressions like "most" or "big" are known to be vague, that is, their interpretation can be borderline and not generally-agreed. Moreover, their use is context-dependent, in a way that an entity can be "big" in one context, but not in another. Interestingly, the meaning of these expressions is shown to be mostly quantitative when they are used to refer to entities (or sets of entities) in real-world contexts; for example, "few" is used by speakers only to refer to a given range of (low) proportions. By exploiting state-of-the-art, cognitively-inspired computational techniques, I tackle the issue of modelling the meaning of vague expressions from their use in grounded contexts, specifically Vision. In the first, longer part of the talk, I will provide an overview of my recent investigations on vague quantifiers ("few", "many", "all", etc.), both at the behavioural and computational level. In the second part, shorter, I will present ongoing research on gradable adjectives ("big", "small", etc.). Any feedback and comment is more than welcome!

Victoria Yaneva, University of Wolverhampton

Title: Applying Behavioural Data to NLP Models for Solving Ambiguity and Non-compositionality

12th March, 2019, 4pm, F 1.15, , SP 107

When processing a text, both humans and machines must cope with ambiguity and non-compositionality. These phenomena represent a considerable challenge for NLP systems, while at the same time there is limited evidence from online measures on how humans solve them during natural reading. We approach these two problems as one and hypothesize that obtaining information on how humans process ambiguous and non-compositional phrases can improve the computational treatment of such instances. I will present experiments on using eye-tracking data to improve NLP models for two tasks: classifying the different roles of the pronoun It (nominal anaphoric, clause anaphoric and non-referential), as well as the identification of multi-word expressions. The experiments test whether gaze-based features improve the performance of state-of-the-art NLP models and the extent to which gaze features can be used to partially or entirely substitute the crafting of linguistic ones. The best-performing models are then analysed to better understand the cognitive processing of these linguistic phenomena and findings are discussed with respect to the E-Z model of reading and the processing stages during which disambiguation occurs.

Afra Alishahi, Tilburg Center for Cognition and Communication

Title: Getting closer to reality: Grounding and interaction in models of human language acquisition

5th March, 2019, 2pm, F2.19, SP107

Humans learn to understand speech from weak and noisy supervision: they manage to extract structure and meaning from speech by simply being exposed to utterances situated and grounded in their daily sensory experience. Emulating this remarkable skill has been the goal of numerous studies; however researchers have often used severely simplified settings where either the language input or the extralinguistic sensory input, or both, are small-scale and symbolically represented. I present a series of studies on modelling visually grounded language understanding. Using variations of recurrent neural networks to model the temporal nature of spoken language, we examine how form and meaning-based linguistic knowledge emerges from the input signal.

Angeliki Lazaridou, Google Deep Mind

Title: Emergence of (linguistic) communication through multi-agent interactions

12th February, 2019, 4pm, F3.20, SP 107

Contact person: Raquel Fernandez

Distributional models and other supervised models of language focus on the structure of language and are an excellent way to learn general statistical associations between sequences of symbols. However, they do not capture the functional aspects of communication, i.e., that humans have intentions and use words to coordinate with others and make things happen in the real world. In this talk, I will present two studies on multi-agent emergent communication, where agents exist in some grounded environment and have to communicate about objects and their properties. This process requires the negotiation of linguistic meaning in this pragmatic context of achieving their goal. In the first study, I will present experiments in which agents learn to form a common ground that allow them to communicate about disentangled (i.e., feature norm) and entangled (i.e., raw pixels) input. In the second study, I will talk about properties of linguistic communication as arising in the context of self-interested agents.

Reshmi Gopalakrishna Pillai, University of Wolverhampton

Roads (and flights) to stress - identifying travel troubles from social media

28th January, 2019, 2pm

Psychological stress is a crucial underlying reason for several physical and mental illnesses. The plethora of social media content provides an effective source to monitor stress, both long-term and short-term in nature. Depending on the context, analysis of stress in social media content could help assess customer feedback in businesses, bottlenecks in transportation systems or psychological state of target populations. Situational stress in daily life scenarios such as traffic deserves research attention because : 1) it could potentially be an indicator of a persistent issue in the scenario, requiring corrective measures 2) such short term stress can add up in the long term negatively impacting individual well-being. This talk focuses on stress expressions in Tweets belonging to two domains: Airlines and London traffic. Using topic modeling and word vector representations, I will present an analysis of reasons for stress in these two domains. I will also discuss the features of the language used in high-stress travel Tweets, examining the presence of offensive words, sarcasm and negative emotions in detail, comparing and contrasting the findings with the features of Tweets belonging to other domains.

Lisa Beinborn, Universiteit van Amsterdam

Title: Bridging computational models with cognitive data for language processing

11th December, 2018, F1.15, SP107

Neural language models can be evaluated by comparing their performance on task-based evaluations. We discuss several methods for analyzing the cognitive plausibility of computational language representations by comparing them to human brain data. We examined the performance of several evaluation metrics across four fMRI datasets. In this talk, I will present the results of this experiment and compare the performance to a random model. In addition, we discuss the effect of selecting voxels (i.e. relevant regions of the brain to examine) in a model-driven way.

Andreas Vlachos, University of Cambridge

Title: Imitation learning, zero-shot learning and automated fact checking

30th October, 2018, F1.15, SP107

In this talk I will give an overview of my research in machine learning for natural language processing. I will begin by introducing my work on imitation learning, a machine learning paradigm I have used to develop novel algorithms for structure prediction that have been applied successfully to a number of tasks such as semantic parsing, natural language generation and information extraction. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. Following this, I will discuss my work on zero-shot learning using neural networks, which enabled us to learn models that can predict labels for which no data was observed during training. I will conclude with my work on automated fact-checking, a challenge we proposed in order to stimulate progress in machine learning, natural language processing and, more broadly, artificial intelligence.

Vlad Niculae, Instituto de Telecomunicações, Lisbon, Portugal

Title: Learning with Sparse Latent Structure

29th October, 2018, TBD

Structured representations are a powerful tool in machine learning, and in particular in natural language processing: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. At the same time, deep, hierarchical neural networks with latent representations are increasingly widely and successfully applied to language tasks. Deep networks conventionally perform smooth, soft computations resulting in dense hidden representations.
We study deep models with structured and sparse latent representations, without sacrificing differentiability, and thus enabling end-to-end gradient-based training. We demonstrate sparse and structured attention mechanisms, as well as latent computation graph structure learning, with successful empirical results on large scale problems including sentiment analysis, natural language inference, and neural machine translation.
Joint work with Claire Cardie, Mathieu Blondel, and André Martins.

Relevant publications
Vlad Niculae, André F. T. Martins, Mathieu Blondel, Claire Cardie. SparseMAP: Differentiable sparse structured inference. In: Proc. of ICML 2018.
Vlad Niculae, André F. T. Martins, Claire Cardie. Towards dynamic computation graphs via sparse latent structure. In: Proc. of EMNLP 2018.
Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural attention. In: Proc. of NIPS 2017.

Mehrnoosh Sadrzadeh, Queen Mary University of London

Title: Exploring Semantic Incrementally with Dynamic Syntax and Vector Space Semantics

16th October, 2018, F3.20, SP107

Vector space semantics uses contexts of words to reason about their meanings; it is motivated by ideas of Firth and Harris, uses co-occurrence matrices to assign vectors to words, and has applications in diverse NLP tasks, from named entity recognition, to parsing, to disambiguation. Distributional semantics has been extended from words to sentences, using different grammatical formalisms, such as Lambek’s pregroups, the Lambek Calculus, and Combinatorial Categorial Grammar. It has, however, not been considered for incremental and dialogue pheonema. These phenomena cover individual language processing, where hearers incrementally disambiguate word senses before sentences are even complete, and dialogue utterances, where more than one agent contribute to the unfolding of a sequence. In recent joint work with Purver, Hough, and Kempson (SemDial 2018), we defined an incremental vector space semantic model using the formalism of Dynamic Syntax and showed how it can incrementally assign a semantic plausibility measure as it performs word-by-word parses of utterances.

Stephen Schockaert, Cardiff University

Title: Distributional Relation Vectors

2nd October, 2018, F1.15, SP107

Word embeddings implicitly encode a rich amount of semantic knowledge. The extent to which they can capture relational information, however, is inherently limited. To address this limitation, we propose to learn relation vectors, describing how two words are related based on the distribution of words in sentences where these two words co-occur. In this way, we can capture aspects of word meaning that are complementary to what is captured by word embeddings. For example, by examining clusters of relation vectors, we observe that relational similarities can be identified at a more abstract level than with traditional word vector differences. These relation vectors can be used, among others, to enrich the input to neural text classification models. From a network of relation vectors, we can also learn relational word vectors. These are vector representations of word meaning which, unlike standard word vectors, capture relational properties rather than similarity. On a range of different tasks, we find that combining these relational word vectors with standard word vectors leads to improved results.

Duygu Ataman, ILLC - Fondazione Bruno Kessler

Title: Compositional Source Word Representations for Neural Machine Translation

8th September, 2018, 1100, F1.15, SP107

The requirement for neural machine translation (NMT) models to use fixed-size input and output vocabularies plays an important role in their accuracy and generalization capability. The conventional approach to cope with this limitation is performing translation based on a vocabulary of sub-word units that are predicted using statistical word segmentation methods. However, these methods have recently shown to be prone to morphological errors, which lead to inaccurate translations. In this paper, we extend the source-language embedding layer of the NMT model with a bi-directional recurrent neural network that generates compositional representations of the source words from embeddings of character n-grams. Our model consistently outperforms conventional NMT with sub-word units on four translation directions with varying degrees of morphological complexity and data sparseness on the source side.

Ekaterina Shutova, ILLC - University of Amsterdam

Title: Grasping the finer point: Metaphor identification in text and brain imaging data

25th September, 2018, 1600, F1.15, SP107

Besides making our thoughts more vivid and filling our communication with richer imagery, metaphor plays a fundamental structural role in our cognition, helping us to organise and project knowledge. For example, when we say “a/well-oiled/political/machine/”, we view the concept of/political system/in terms of a/mechanism/and transfer inferences from the domain of mechanisms onto our reasoning about political processes. Highly frequent in text, metaphorical language represents a significant challenge for natural language processing (NLP) systems. In this talk, I will first present a neural network architecture designed to capture the patterns of metaphorical use and its application to metaphor identification in text. I will then discuss how general-purpose lexical and compositional semantic models can be used to better understand metaphor processing in the human brain.

Grzegorz Chrupala, Tilburg University

Title: Neural representations of form and meaning in spoken language

17th July, 2018, 1600, F1.15, SP107

The task of learning language in a multisensory setting, with weak and noisy supervision, is of interest to scientists trying to understand the human mind as well as to engineers trying to build smart conversational agents or robots. In this talk I will present work on learning language from visually grounded speech using deep recurrent neural networks, and show that these models are able to extract linguistic knowledge at different levels of abstraction from the input signal. I then describe analytical methods which allow us to better understand the nature and localization or representations emerging in such recurrent neural networks. I will also discuss the challenges inherent in fully unsupervised modeling of spoken language and present recent results on this problem.

Dong Nguyen, Turing Institute

Title: NLP for the social sciences: opportunities and challenges

19th June, 2018, 1600, F1.15, SP107

Massive digital datasets, such as social media data, are a promising source to study social and cultural phenomena. They provide the opportunity to study language use and behaviour in a variety of social situations on a large scale. However, to fully leverage their potential for research in the social sciences, new computational approaches are needed. In this talk I will start with a general introduction to this research area. I will then focus on two case studies. First, I discuss how natural language processing can help to scale-up social science research by applying social science theories on large scale naturalistic text data. I discuss how we investigated the impact of participants’ motivations in the public health campaign Movember on the amount of campaign donations raised based on the Social Identity Model of Collective Action. Second, I will discuss how advances in machine learning can be used to develop better tools for sociolinguists. Existing approaches to identify variables that exhibit geographical variation (e.g., pop vs. soda vs. coke in the US) have several important drawbacks. I discuss a method to measure geographical language variation based on Reproducing Kernel Hilbert space (RKHS) representations. I then conclude with discussing my perspective on a few big challenges in this area.

Arianna Bisazza, Leiden University

Hints of linguistic structure in neural models of language and translation.

15th May, 2018, 1100, F1.15, SP107

The advent of efficiently trainable neural networks has led to striking improvements in the accuracy of next word prediction, machine translation and many other NLP tasks. It has also produced models that are much less interpretable. In particular, the role played by linguistic structure in sequence prediction and sequence-to-sequence models remains hard to gauge. What makes recurrent neural networks work so well for next word prediction? Do neural translation models learn to extract linguistic features from raw data and exploit them in any explicable way? In this talk I will give an overview of recent work, including my own, that aims at answering these questions. I will also present recent experiments on the importance of recurrency for capturing hierarchical structure with sequential models. Answering these questions is important to establish whether injecting linguistic knowledge into neural models is a promising research direction, and to understand how close we are to building intelligent systems that can truly understand and process human language.

Ivan Titov, Universities of Edinburgh (ILCC) and Amsterdam (ILLC)

Graph Convolutional Networks for Natural Language Processing and Relational Modeling

13th February, 2018, 1600, F1.15, SP107

Graph Convolutional Networks (GCNs) is an effective tool for modeling graph structured data. We investigate their applicability in the context of natural language processing (machine translation and semantic role labelling) and modeling relational data (link prediction). For natural language processing, we introduce a version of GCNs suited to modeling syntactic and/or semantic dependency graphs and use them to construct linguistically-informed sentence encoders. We demonstrate that using them results in a substantial boost in machine translation performance and state-of-the-art results on semantic role labeling of English and Chinese. For link prediction, we propose Relational GCNs (RGCNs), GCNs developed specifically to deal with highly multi-relational data, characteristic of realistic knowledge bases. By explicitly modeling neighbourhoods of entities, RGCNs accumulate evidence over multiple inference steps in relational graphs and yield competitive results on standard link prediction benchmarks.

Joint work with Diego Marcheggiani, Michael Schlichtkrull, Joost Bastings, Thomas Kipf, Khalil Sima’an, Max Welling, Rianna van den Berg and Peter Bloem.

Gerhard Jäger, University of Tübingen

A Bayesian test of the lineage-specificity of word-order correlations

25th January, 2018, 1600, F3.20, SP107

One of the major early achievements of linguistic typology was Greenberg’s (1963) discovery of implicational word order universals. While his work was based on a comparatively small sample of languages, later work, such as (Hawkins, 1983; Dryer, 1992), confirmed the existence of implicational word order universals on the basis of broader data collections. In a landmark study using modern quantitative, Bayesian comparative methods and data from four language families (Austronesian, Bantu, Indo-European and Uto-Aztecan), Dunn, Greenhill, Levinson, and Gray (2011) established results being in stark contrast to the established view. While the authors did find evidence for word order correlations in many cases, the emerging pictures differed fundamentally between the four families. From this they concluded that word order tendencies are lineage specific rather than universal. The authors did not explicitly compare their lineage-specific model with a universal model though; they only qualitatively assessed the assumption of universal word-order correlations as not plausible given their findings. In the talk I will present a study addressing this issue via performing a Bayesian model comparison between a universal and a lineage-specific model. It turns out that there is solid support for universal word-order correlations between features that Dryer (1992) classified as "verb patterners", while other correlations are clearly lineage specific. The broader methodological point to be made is that linguistic typology can immensely benefit from the tools of modern Bayesian statistics and the phylogenetic comparative method.

Marco Baroni, Facebook AI Research

Systematic compositionality with recurrent neural networks

16th January, 2018, 1600, F3.20, SP107

Recurrent neural networks (RNNs) are remarkably general learning systems that, given appropriate training examples, can handle complex sequential processing tasks, such as those frequently encountered in language and reasoning. However, RNNs are remarkably sample-heavy, typically requiring hundreds of thousands of examples to master tasks that humans can solve after seeing just a few exposures. The first set of experiments I will present shows that modern RNNs, just like their ancestors from the nineties, have problems with systematic compositionality, that is, the ability to extract general rules from the training data, and apply them to new examples. As systematic compositionality allows very fast generalization to unseen cases, lack of compositional learning might be one of the roots of RNN's training data thirst. I will next present an ongoing study where RNNs must solve an apparently simple task where correct generalization relies on function composition. Current results suggest that a large random search in RNN space finds a small portion of models that converged on a (limited) compositional solution. However, it's not clear, for the time being, what is special about such models. The quest for compositional RNNs is still on.

Joint work with: Brenden Lake, Adam Liska, Germán Kruszewski

Padraic Monaghan, Lancaster University

Cognitive processes driving language evolution: Diachronic and experimental studies of English vocabulary change

28th November, 2017, 1600, F1.21 SP 107

There are multiple contributors to language change that are external to the speaker, such as social or economic drivers, or even accidents of linguistic contact. However, there are also internal constraints that are key to shaping language evolution. In particular, psycholinguistic properties of language can predict which representations are acquired and stored with greatest fidelity by the speaker. For instance, we know that frequency, length, and the age at which a language structure is acquired all contribute to more stable storage and accurate reproduction of that structure. In this talk, I present a series of studies of the English vocabulary to demonstrate how internal cognitive processing has shaped the language, with analyses from corpora of diachronic vocabulary change and morphological change of the past tense forms of verbs, accompanied by laboratory studies of artificial language learning and change that show similar patterns to the diachronic data. These studies provide suggestions for how psycholinguistic properties of the language affect learning and cultural transmission across generations of speakers.

Luciano Serafini, FBK - Trento

Learning and Reasoning with Logic Tensor Networks: the framework and an application

10th October, 2017, 1600, F1.21 SP 107

Logic Tensor Networks (LTN) is a theoretical framework and an experimental platform that integrates learning based on tensor neural networks with reasoning using first-order many-valued/fuzzy logic. LTN supports a wide range of reasoning and learning tasks with logical knowledge and data using rich symbolic knowledge representation in first-order logic (FOL) to be combined with efficient data-driven machine learning based on the manipulation of real-valued vectors. In practice, FOL reasoning including function symbols is approximated through the usual iterative deepening of clause depth. Given data available in the form of real-valued vectors, logical soft and hard constraints and relations which apply to certain subsets of the vectors can be specified compactly in FOL. All the different tasks can be represented in LTN as a form of approximated satisfiability, reasoning can help improve learning, and learning from new data may revise the constraints thus modifying reasoning. We apply LTNs to Semantic Image Interpretation (SII) in order to solve the following tasks: (i) the classification of an image's bounding boxes and (ii) the detection of the relevant part-of relations between objects. The results shows that the usage of background knowledge improves the performance of pure machine learning data driven methods.

David Schlangen, Bielefeld University

Learning and Maintaining a Lexicon for Situated Interaction

26th September, 2017, 1600, F1.15 SP 107

If, when asked to "point at the mug", a physically unimpaired person seems unable to identify a potential referent that is standing in front of them, we might hesitate to ascribe knowledge of the meaning of the word "mug" to them, whatever else they may be able to tell us about mugs (e.g., "wooden mugs were produced probably from the oldest time, but most of them have not survived intact.", or "mugs are similar to cups"). And yet computational models of word meaning are good at the latter (e.g., by simply linking to knowledge repositories like wikipedia, where the previous sentence about wooden mugs was taken from), and fail at the former. In this talk, I will present our recent work at learning a lexicon for referential interaction, where the referential aspects of word meaning are modelled through perceptual classifiers taking real images as input. I show that this representation complements other computational meaning representations such as those derived from distributional patterns, as well as decompositional or attribute-based representations. The lexicon is learned through (observation of) interaction, and is maintained and defended in interaction.

Antske Fokkens, Vrije University Amsterdam

Deep Analysis: Reflections on experimental design in NLP

26th May, 2017, 1600 F1.21, SP107

Shared tasks and (shared) corpora have proven themselves highly valuable for NLP. They have allowed us to evaluate our methods and compare them to others helping us, our readers and reviewers to assess the quality of our methods. A downside of the wide-spread approach of comparing results on a gold dataset is that it is relative common practice to draw conclusions based on the highest numbers without looking into what is behind this. However, what goes wrong and why can be highly relevant for end-applications and, specially given the well-known difficulties with reproducing results, looking into the details of how and why results improve (or not) is highly relevant. In this talk, I will present two studies taking intrinsic evaluation one step further 1) investigating error propagation in parsing and 2) diving in the evaluation of distributional semantic methods. Finally, I will outline the importance of deeper evaluation when NLP is used within digital humanities and digital social science.

Julia Kreutzer, Universität Heidelberg

Bandit Structured Prediction for Machine Translation Domain Adaptation with Weak Feedback

27th March, 2017, 1600 F3.20, SP107

Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback in form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. This framework has successfully been applied to various structured prediction tasks in NLP. In this talk I will focus on the application of bandit structured prediction to linear and non-linear machine translation models where models are adapted to a new domain without seeing reference translations of the new domain. In simulation experiments we showed that partial information in form of translation quality judgements on predicted translations is sufficient for model adaptation, even for feedback as weak as pairwise preference judgments.

Raffaella Bernardi, University of Trento

Learning quantities from vision and language

23 March, 2017, 1700 F3.20 SP107

Linguistics quantifiers have been the realm of Formal Semantics. A lot is known about their formal properties and how those properties affect logical entailment, the licensing of polarity item, or scope ambiguities. Less is known about how quantifiers are acquired by children and even less about how computational models can learn to quantify objects in images. In this talk, we will report on our findings in this direction. First of all, we will explain why the task is interesting and challenging for a Language and Vision model. Secondly, we will report our evaluation of state-of-the-art neural network models against this task. Thirdly, we will compare the acquisition of quantifiers with the acquisition of cardinals. We will show that a model capitalizing on a `fuzzy' measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.

Malvina Nissim, University of Groningen

(To what extent) Can we de-supervise affective computing?

14 March, 2017, 1600, F3.20 SP107

While our ultimate aim in language processing might be making fully unsupervised models that optimally resemble the human way of learning, in many areas of NLP we are still heavily working with high degrees of supervision. Aiming at sparing annotation effort, distant supervision has been explored in the past 10 years as an alternative way to obtain (noisy) training data. This obviously doesn't take us directly to unsupervised models, but in addition to being a cheaper method to labelling instances, it also keeps us closer to the original data and it might give us an indication into the extent to which we can make do with rather spontaneous signals in the data. In the talk, I will present two experiments in the area of affective computing exploiting distant supervision: one on emotion detection, and one on stance detection. In both cases, we acquire silver labels for training leveraging user generated social media data, and play with different degrees of supervision in building our models. These are eventually tested on standard benchmarks and compared to state-of-the-art approaches. Our (mixed) results are discussed also in the light of whether supervision is truly necessary or not, and the value of silver versus gold data.

Alex Fraser, LMU Munich

Challenges in Machine Translation related to Morphologically Rich Languages

7 March, 2017, 1600 F1.21 SP 107

There are a number of interesting challenges in translation to morphologically rich languages (such as German or Czech) from a language like English. I will first present a linguistically rich English to German translation system generalizing over compounds, phenomena of inflectional morphology and syntactic issues, relying on preprocessing and postprocessing techniques. Following this, I'll present approaches addressing similar issues which have been tightly integrated into the Moses SMT decoder, and work well for multiple language pairs. Finally, time allowing, I'll present some thoughts on addressing these and further challenges within the framework of neural machine translation.

Gemma Boleda, Universitat Pompeu Fabra

The interplay between sense and reference

19 January, 1600, F1.15 SP 107

Over a century ago, Frege famously introduced the distinction between sense and reference that is one of the theoretical foundations of formal semantics. However, in practice formal semanticists took reference and ran away with it, either eschewing sense-related issues altogether or giving a referential treatment to them (with notable exceptions). In this talk, I argue that we need to go back to Fregean sense, and propose that data-induced, continuous representations provided by distributional semantics and deep learning methods provide a good methodological handle for sense-related aspects of meaning. I support these claims with results from both computational modeling and theoretical studies. I then revisit reference and present ongoing work on the challenging enterprise of tackling it with continuous methods, too.

2016

Laura Rimell, University of Cambridge

Compositional Distributional Semantics for Relative Clauses

29 November, 1600, F3.20 SP 107

In this talk I will describe the creation of RELPRON, a dataset of subject and object relative clauses for the evaluation of compositional distributional semantic models. The RELPRON task involves matching terms, such as 'wisdom', with representative properties in relative clause form, such as 'quality that experience teaches'. Relative clauses are an interesting test case for compositional distributional semantic models because they contain a closed class function word and a long-distance dependency. I will present results on RELPRON obtained within a type-based composition framework, using a variety of approaches to simplify the learning of higher-order tensors, as well as results obtained using neural networks for composition. In line with many existing datasets, vector addition provides a challenging baseline for RELPRON, but it is possible to match or improve on the baseline by finding appropriate training data and models for the semantics of the relative pronoun.

Siva Reddy, University of Edinburgh

Freebase Semantic Parsing with and without Question-Answer Pairs

22 November, 1600, F3.20 SP 107

I will present three semantic parsing approaches for querying Freebase in natural language 1) training only on raw web corpus, 2) training on question-answer (QA) pairs and 3) training on both QA pairs and web corpus. For 1 and 2, we conceptualise semantic parsing as a graph matching problem, where natural language graphs built using CCG/dependency logical forms are transduced to Freebase graphs. For 3, I will present a natural-logic combined with Convolutional Neural-Network based relation extraction. Our methods achieve state-of-the-art on WebQuestions and Free917 QA datasets.

Alessandro Lenci, University of Pisa

The Cost of Compositionality: Towards a Distributional Model of Semantic Complexity

17 November, 1400, F1.15 SP 107

In this talk, I will introduce a distributional model for computing the complexity of semantic composition, inspired by recent psycholinguistic research on sentence comprehension. I argue that the comprehension of a sentence is an incremental process driven by the goal of constructing a coherent representation of the event the speaker intends to communicate with the sentence. Semantic complexity is determined by a compositon cost depending on the internal coherence of the event model being constructed and on the activation degree of such event by linguistic constructions. The model is tested on some psycholinguistic datasets for the study of sentence comprehension.

Caroline Sporleder, University of Göttingen

Distinguishing Figurative and Literal Usages in Discourse

1 November, 1615-1700, F3.20 SP 107

Figurative expressions such as ''break the ice'' occur frequently in natural language, even in apparently matter-of-fact texts such as news wire. Many of these expressions are also ambiguous between a figurative and a literal interpretation when taken out of context, e.g. ''break the ice (...on the duck pond)'' vs. ''break the ice (...with wary adolescents)''. Being able to automatically detect figurative usages in a given context is potentially useful for a number of tasks, ranging from corpus-based studies of phraseology to applications in automatic natural language processing. In this talk, I will present a method for automatically distinguishing figurative and literal usages of a target expression in a given context. The method exploits the fact that well-formed texts exhibit lexical cohesion, i.e. words are semantically related to other words in the vicinity.

Barend Beekhuizen, University of Toronto

Parallel corpora and semantic typology

1 November, 1700-1745, F3.20 SP 107

Languages vary in the ways they carve up the world: where English uses the preposition on to describe support relations between object, Dutch employs two prepositions, op and aan. Underlying such crosslinguistic variation, we also find tendencies in the way unique situations (objects, events) are grouped into linguistic semantic categories. For studying the variation and biases in the word meaning inventories of the world's languages, semantic typology has typically taken recourse to in-person elicitation. This process, however, is tedious, hard to apply for more abstract domains (the meaning of connectives, abstract verbs like think), and displays a researcher-bias in the selection of stimuli. Instead, we propose to use parallel corpora to obtain judgments similar to in-person elicitations, but avoiding these pitfalls. In my talk, I will describe our pipeline for approaching this issue, discuss the properties of the representational space it yields, and present preliminary results on a typologically diverse corpus of translated subtitles. (joint work with Suzanne Stevenson)

Douwe Kiela, University of Cambridge

Grounding Semantics in Perceptual Modalities

27 September, 1600, F3.20 SP 107

Although distributed semantics has been very successful in various NLP tasks in recent years, the fact that word meanings are represented as a distribution over other words exposes them to the so-called grounding problem. Multi-modal semantics attempts to address this by enhancing textual representations with extra-linguistic perceptual input. Such multi-modal models outperform language-only models on a range of tasks. In this talk I will discuss my PhD work, which has been concerned with advancing this idea, by (1) improving how we mix information through multi-modal fusion, (2) finding better ways to obtain perceptual information through deep learning and (3) obtaining representations for previously untried modalities such as auditory and even olfactory perception. I'll also briefly talk about a new multi-modal features toolkit that NLP researchers can use to experiment with visual and auditory representations.

Barbara Plank, University of Groningen

What to do about non-canonical data in Natural Language Processing - fortuitous data to the rescue

13 September, 1600, F3.20 SP 107

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.

In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.

Dekai Wu, HKUST

9 June

Alexis Palmer, Universität Heidelberg

Discourse Modes and Situation Entities

24 May

Approaches to the computational analysis of discourse are sensitive to different aspects of textual structure. Some consider topical structure, others focus on rhetorical relations, and still others concern themselves with the functional structure of texts. In this talk I present a new way of approaching the task, following Smith's (2003) work on Discourse Modes. The central idea is that texts are made up of passages - usually several sentences or more - with different modes: Smith's typology includes Narrative, Description, Report, Information, and Argument/Commentary. Smith further identifies specific linguistic correlates of these modes, one of which pertains to the contributions made to the discourse by individual clauses of text. As a first step toward automatic Discourse Mode classification, we address the problem of classifying clauses of written English text according to the type of situation expressed by the clause. The situation entity (SE) classification task as construed here uses a scheme that includes, among others, events, states, abstract entities, and generic sentences. We find that a feature-driven approach to annotating SEs both improves annotation consistency and enriches the annotated data with useful semantic information, such as lexical aspect of verbs, genericity of main referents, and habituality of clauses. This data has been used to develop automatic classifiers for SE types as well as for other semantic phenomena. $Abstract

Shay Cohen, University of Edinburgh

Latent-Variable Grammars and Natural Language Semantics

9 May

Probabilistic grammars are an important model family in natural language processing. They are used in the modeling of many problems, mostly prominently in syntax and semantics. Latent-variable grammars are an extension of vanilla probabilistic grammars, introducing latent variables that inject additional information into the grammar by using learning algorithms in the incomplete data setting. In this talk, I will discuss work aimed at the development of (four) theoretically-motivated algorithms for the estimation of latent-variable grammars. I will discuss how we applied them to syntactic parsing, and more semantically-oriented problems such as machine translation, conversation modeling in online forums and question answering.

Iacer Calixto, Dublin City University

Incorporating Translational Evidence in Multilingual and Multimodal Embeddings

5 April

We introduce models for training embeddings that effectively integrate computer vision and natural language processing. The main novelty in our proposal is the utilisation of data that is not only multimodal, but both multimodal and multilingual. The intuition behind our models is that multiple sources of textual information might convey more "facts" about an image than a textual description in only one language. We discuss how incorporating translational evidence might be used in improving the quality of trained embeddings. We use the recently released multimodal Flickr30k dataset and evaluate our models on the tasks of sentence-to-image and image-to-sentence ranking. Our results demonstrate that including multilingual data leads to substantial improvement over the (monolingual) state-of-the-art.

Bob Coecke, University of Oxford

In pictures: from quantum foundations to natural language meaning

18 Feb

Earlier work on an entirely diagrammatic formulation of quantum theory, which is soon to appear in the form of a textbook, has somewhat surprisingly guided us towards an answer for the following question: how do we produce the meaning of a sentence given that we understand the meaning of its words? This work has practical applications in the area of natural language processing, and the resulting tools have meanwhile outperformed existing methods.

2015

Grzegorz Chrupała, Tilburg University

Learning visually grounded linguistic representations

3 December

Most research into learning linguistic representations focuses on the distributional hypothesis and exploits linguistic context to embed words in a semantic vector space. In this talk I address two important but often neglected aspects of language learning: compositionality and grounding. Words are important building blocks of language, but what makes language unique is putting them together: how can we build meaning representations of phrases and whole sentences out of representations of words? And how can we make sure that these representations connect to the extralinguistic world that we perceive and interact with? I will present a multi-task gated recurrent neural network model which sequentially processes words in a sentence and builds a representation of its meaning while making concurrent predictions about (a) which words are to follow and (b) what are the features of the corresponding visual scene. Learning is driven by feedback on this multi-task objective. I evaluate the induced representations on tasks such as image search, paraphrasing and textual inference, and present quantitative and qualitative analyses of how they encode certain aspects of language structure.

Wolfgang Maier, Universität Düsseldorf

Discontinuous incremental shift-reduce parsing

9 June

Incremental shift-reduce parsing with structured perceptron training is an established technique for continuous constituency parsing. The corresponding parsers are very fast and yield results that are close to the state of the art. In this talk, I present a shift-reduce parser which can produce discontinuous constituents by processing the input words out-of-order, a strategy known from dependency parsing. The system yields accurate results. Unlike previous grammar-based parsers for discontinuous constituents, it also achieves very high parsing speeds.

SMOG, King's College London

Special CLS workshop on Statistical Models of Grammaticality

19 May

The CLS is happy to announce three talks about Statistical Models of Grammaticality studied within the SMOG project at King’s College London:

Alex Clark: On his work on theoretical results for grammar induction
Shalom Lappin: Experimental work on identifying gradience in speakers' representation of syntactic knowledge
Jey Han Lau: Experiments with unsupervised language models to predict speakers' syntactic acceptability judgements

SMOG is exploring the construction of an enriched stochastic model that represents the syntactic knowledge that native speakers of English have of their language. We are experimenting with different sorts of language models that contain a variety of parameters encoding properties of sentences and probability distributions over corpora.

Tejaswini Deoskar, University of Edinburgh

Generalising Strongly-Lexicalised Parsers

10 March

I will present two ideas aiming towards 'parser-generalization', the problem of enhancing a supervised grammar and parsing model to accurately cover a wider variety of linguistic data than has been seen in the labeled data, using additional unlabeled data. The first idea concerns the use of the Expectation Maximisation (EM) algorithm for semi-supervised learning of parsing models. While it has long been thought that EM is unsuitable for semi-supervised learning of structured models such as part-of-speech taggers and parsing models (Merialdo 1994, Elworthy 1994), I will present experiments under two grammar formalisms (PCFG and CCG) where we have successfully used EM for semi-supervised learning of generative parsers. These two grammars share the property of being 'strongly lexicalised', in that they have complex lexical categories, and a few simple grammar rules that combine them. This strong lexicalisation makes these grammars more suitable for learning from unlabeled data than grammars which are not lexicalised in this way. In this work, I make the assumption that all lexical category types in the language are *known* from the supervised part of the data, a reasonable assumption to make if the supervised data is large enough. In the second part of the talk, I will discuss ongoing work where we generate *new* category types, based on those types seen in the labeled data. We use a latent-variable PCFG model for generating new CCG types, under the assumption that there is a hidden structure in CCG lexical categories which can be uncovered using such a model.

2014

Desmond Elliott, Centrum Wiskunde and Informatica

Representing the Structure of Images for Language Generation and Image Search

9 December

One approach to representing images is as a bag-of-regions vector, but this representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of the research is that capturing and encoding the relationships between parts of an image will improve the performance of downstream tasks. A simplifying assumption throughout the talk is that we have access to gold-standard object annotations. The first part of this talk will focus on the Visual Dependency Representation: a novel structured representation that captures region-region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. We explain how to automatically predict Visual Dependency Representations using a modified graph-based statistical dependency parser. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. The second part of the talk will show that adopting Visual Dependency Representations of images leads to significant improvements on two downstream tasks. In an image description task, we find improvements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. Finally, in an query-by-example image retrieval task, we show improvements in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach.

Qun Liu, Dublin City University

Dependency-based Statistical Machine Translation

11 November

It is proofed that syntax-based statistical machine translation can produce better translation than phrase-based translation does, especially for those language pairs with big structural difference. However, constituent-based models are complex and not efficient in implementation. Dependency is regarded as a more compact and efficient formalism of syntax and a nature bridge from syntax to semantics, but early dependency-based SMT has lower performance compared with the mainstream approaches. We proposed the first dependency-based SMT model whose performance is comparable with the state-of-the-art models in 2011, and then we developed several improvements based on this model. Recently we tried a new dependency-based transfer-and-generation approach which we think is promising and got positive results at this preliminary stage.

Wilker Aziz, University of Wolverhampton

Exact Sampling and Optimisation in Statistical Machine Translation

2 April

In Statistical Machine Translation (SMT), inference is performed over a high-complexity discrete distribution defined by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation - the task of searching for the optimum translation - or sampling - the task of finding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. In this presentation, I will talk about exact optimisation (decoding) and sampling for SMT based on a form of rejection sampling. In this view, the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally refined to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level.

Michael Franke, Unversity of Tübingen

Bayesian models for reasoning about referential expressions

12 Feb

Establishing reference to objects in a shared environment is pivotal to successful communication. By using artificial scenarios where subjects need to choose referential expressions or guess the speaker's intended referent we can study the extent to which speakers and listeners reason pragmatically about each other's perspective. I will present a number of related empirical studies in this paradigm and discuss how different flavors of Bayesian cognitive modeling can be used to analyze the data.

2013

Andrea Ravignani, University of Vienna

Brains hate randomness: Patterning skills in humans and other animals

Oct 16

Human beings are excellent at making sense of, and producing, structured sensory input. In particular, cognitive abilities for patterning seem crucial in allowing humans to perceive and produce language and music. The comparative approach, testing a range of animal species, can help unveil the evolutionary history of such patterning abilities. Here, I present experimental data and ongoing work in humans, chimpanzees, squirrel monkeys, pigeons and kea. I compare monkeys' and humans' skills in processing sensory dependencies in auditory stimuli, a crucial feature of human cognition. In order to infer individual and species-specific learning strategies and behavioral heuristics, I analyze data from visual touch-screen experiments in birds. Finally, as pattern production and perception abilities have been shown to differ in humans, the same divide could exist in other species. I present ongoing work using "electronic drums" I developed specifically for apes, which will allow chimpanzees to spontaneously produce non-vocal acoustic patterns.

2011

Fermin Moscoso del Prado Martin, CNRS/Rhône-Alpin

Information theoretical approaches to language structure and complexity

22 Sep

Shalom Lappin, King's College London

Probabilistic Semantics for Natural Language

14 Sep

Probabilistic and stochastic methods have been fruitfully applied to a wide variety of problems in grammar induction, natural language processing, and cognitive modeling. In this talk I will explore the possibility of developing a class of combinatorial semantic representations for natural languages that compute the semantic value of a (declarative) sentence as a probability value which expresses the likelihood of speakers of the language accepting the sentence as true in a given model. Such an approach to semantic representation treats the pervasive gradience of semantic properties as intrinsic to speakers' linguistic knowledge, rather the result of the interference of performance factors in processing and interpretation. In order for this research program to succeed, it must solve three central problems. First, it needs to formulate a type system that computes the probability value of a sentence from the semantic values of its syntactic constituents. Second, it must incorporate a viable probabilitic logic into the representation of semantic knowledge in order to model meaning entailment. Finally, it must show how the specified class of semantic representations can be efficiently learned from the primary linguistic data available for language acquisition. This research has developed out of recent work with Alex Clark (Royal Holloway, London) on the application of computational learning theory to grammar induction.

Suzanne Stevenson, University of Toronto

Computational Models of Child Verb Learning: Mechanisms for abstraction and generalization

1 Sep

Early verb learning in children seems an almost miraculous feat. In learning a verb, children must learn both the basic meaning of the event ("falling" or "eating"), as well as the allowable structures in their language for correctly communicating the participants in that event ("The glass fell", but not "She fell the glass"). Moreover, given the sparsity of evidence, children must be able to abstract away from specific usages they observe in order to use their knowledge of verbs productively. Finally, children must accomplish all this in the face of a high degree of variability among verbs, along with much noise and uncertainty in the input data, and with no explicit teaching. Do children require innate knowledge of language to accomplish this, or are general cognitive learning mechanisms sufficient to the task? We have developed various computational models of verb learning using unsupervised clustering over simple statistical properties of verb usages. Our findings support the claim that general learning mechanisms are able to acquire abstract knowledge of verbs and to generalize that knowledge to novel verbs and situations.