The CLS is the Computational Linguistics Seminar of the University of Amsterdam. Seminars are open to all interested researchers and students of all levels from UvA and elsewhere.
To receive notifications about upcoming talks and the Zoom details, please join the CLS mailing list.
To make sure you do not miss any talk, you can add the CLS agenda to your calendar.
The CLS usually takes place on Tuesdays at 16:00 at the ILLC in LAB42 at Amsterdam Science Park or via Zoom. Other days and locations are occasionally possible. See the details for each talk. To receive the details please subscribe to the CLS mailing list. The links to participate on Zoom will be distributed via the mailing list on the day of the seminars.
Recent years have witnessed an explosion of NLP models for many different tasks, both in text-only and multimodal (vision & language) settings. Impressive results have been obtained on multimodal encoders, whereas decoders have received less attention. In my work, I focus on the latter aiming to study the problem-solving reasoning behind natural language generation. To this end, I take referential grounded dialogue games as a testbed. I will discuss the main issues affecting generative systems and explore how the weaknesses of the encoder affect the choice of the decoder by focusing on the interpretation of negatively answered questions. I will then present a cognitively-inspired re-ranking decoding strategy for promoting the generation of strategic questions. I will compare this strategy to a wide variety of different decoding algorithms proposed in the literature, together with an in-depth analysis of their hyper-parameter configurations. Finally, I will briefly mention some ongoing works on exploring how modeling human uncertainty can lead to better natural language generation systems and an investigation of pragmatic phenomena that allow humans to efficiently solve referential games.
Transformers have revolutionized deep learning research across many disciplines, starting from NLP and expanding to vision, speech, and more. In my talk, I will explore several milestones toward interpreting all families of Transformers, including unimodal, bi-modal, and encoder-decoder Transformers. I will present working examples and results that cover some of the most prominent models, including CLIP, BERT, LXMERT, and ViT. I will then present our recent explainability-driven fine-tuning technique that significantly improves the robustness of Vision Transformers (ViTs). The loss we employ ensures that the model bases its prediction on the relevant parts of the input, rather than supportive cues (e.g., background). This can be done with very little added supervision in the form of foreground masks, or without any such supervision.
I will discuss some of my recent (ACL 2022; CogSci 2022) experiments with large language models. We take polarity-sensitivity as a case study and take a closer look at linguistic representations of monolingual (BERT, GPT-2) and multilingual (multilingual BERT, XLM-RoBERTa) pre-trained language models. The overarching question is: what do these models learn about NPI licensing? We test this with simple 'observational' methods and with somewhat more baroque interventional ones and compare (some of) the results with human behavioural data. I hope these experiments lead to a more general discussion about the relation between LM data, psycholinguistic data and linguistic theory. This is joint work with Alexey Tikhonov.
To learn how to communicate with people around them, children have to master the linguistic content (e.g., what words mean) and understand how to encode and decode communicative intents (e.g., how words are used in dialog). While both aspects have been studied extensively over the last few decades, we do not have a complete theory of how they develop and interact. This slow progress is due, in part, to the fact that traditional research methods are often restrictive/de-contextualized and do not reflect children's real learning environment which is largely multimodal, socially embedded, and culturally variable. In this talk, I will argue that opportunities for collaborative data collection/pooling about children's learning in more ecologically valid settings as well as advances in data processing at scale provide new tools that can be utilized to help answer lingering scientific questions, making much more plausible the prospect of a quantitative theory of child communicative development in the wild.
In this talk, I will cover a (diverse) set of ongoing research threads:
In this talk I will cover some recent work that tries to improve how we do model evaluation in multimodal settings, focusing on the new Adversarial VQA and Winoground evaluation datasets. After that, I will talk about our latest vision and language "foundation model", called FLAVA: a single holistic universal transformer that targets all modalities at once and that shows impressive performance on a wide range of tasks.
Interaction between caregivers and children plays a critical role in human language acquisition and development. Given this observation, it is remarkable that explicit interaction plays little to no role in artificial language modeling -- which also targets the acquisition of human language, yet by artificial models. Moreover, an interactive approach to language modeling has the potential to make language models substantially more versatile and to considerably impact downstream applications. Motivated by these considerations, we pioneer the space of interactive language modeling. First we present a road map in which we detail the steps that need to be taken towards interactive language modeling. We then lead by example and take the first steps on this road map, showing the initial feasibility of our approach. As such, this work aims to be the start of a larger research agenda on interactive language modeling.
Attempts to computationally model or simulate the acquisition of spoken language via grounding in the visual modality have a long tradition but have gained momentum since around 2015 with the revival of neural networks. Current neural approaches are able to spot associations between the spoken and visual modality, and use these to represent speech and image/video data in a joint vector space. A major limitation of these works are the datasets used to train them. Most consist of static images or videos paired with spoken descriptions of what is depicted, and thus guarantee a strong correlation between speech and the visual world by construction. A child learning a language faces a very different and harder task: in the real world the coupling between the linguistic and the visual is much looser, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal, such as voices of specific people and environmental sounds. The current study is a first step towards simulating such a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Evaluation and analysis results indicate that despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.
Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. These short summaries can help users make better purchasing decisions by condensing useful information in hundreds or even thousands of reviews. However, due to the high cost of summary production, datasets large enough for supervised learning were absent until recently. This lead to a variety of extractive methods that construct summaries from review sentences. However, these methods often produce incoherent summaries with unimportant details. This presentation will focus on abstractive approaches that generate summaries using a free vocabulary and thus can yield more coherent texts. We will discuss summarizers trained in unsupervised, few-shot, and supervised regimes. These models combine principles of latent probabilistic models, variational inference, and reinforcement learning. In our unsupervised model (Copycat), we treat the product and review representations as latent continuous variables. At test time, we induce summarizing representations and map them to summarizing texts. In the supervised model (SelSum), we decompose the system into a selector (posterior) and summarizer. The selector treats reviews as latent categorical variables and selects a summary-relevant subset in training. Only the small subset is passed to the summarizer, which results in computational and memory savings. The system is trained end-to-end using variational inference and reinforcement learning. Finally, we fit another selector (prior) that selects subsets of informative reviews to summarize in test time.
From the gestures that accompany speech to images in social media posts, humans effortlessly combine words with visual presentations. However, machines are not equipped to understand and generate such presentations due to people’s pervasive reliance on commonsense and world knowledge in relating words and images. I present a novel framework for modeling and learning a deeper combined understanding of text and images by classifying inferential relations to predict temporal, causal, and logical entailments in context. This enables systems to make inferences with high accuracy while revealing author expectations and social-context preferences. I proceed to design methods for generating text based on visual input that use these inferences to provide users with key requested information. The results show a dramatic improvement in the consistency and quality of the generated text by decreasing spurious information by half. Finally, I sketch my other projects on human-robot collaboration and conversational systems and describe my research vision: to build human-level communicative systems and grounded artificial intelligence by leveraging the cognitive science of language use.
Toy tasks such as interpreting the arithmetic language (Hupkes et al. 2018) or SCAN (Lake and Baroni 2018) are designed to help us detect and analyze compositional semantic behavior of machine learning models. Most results using these toy tasks have been achieved in analyzing and interpreting recurrrent neural network models. However, the state of the art in NLP is now defined by the Transformer model (Vaswani et al. 2017) which, while in principle having the same theoretical expressive capacity as recurrent models, have a different structure and achieve different learning outcomes. The talk will share some observations on Transformers' compositional generalization behavior on toy tasks.
Structured representations are a powerful tool in machine learning, in particular for natural language: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. Such representations are at odds with deep neural networks, which conventionally perform smooth, soft computations, learning dense, inscrutable hidden representations. We present SparseMAP, a strategy for inferring differentiable combinatorial latent structures, alleviating the tension between discrete and continuous representations through sparsity. SparseMAP computes a globally-optimal combination of a very small number of structures, and can be extended to arbitrary factor graphs (LP-SparseMAP), only requiring access to local maximization oracles. Our strategy is fully deterministic and compatible with familiar gradient-based methods for training neural networks. We demonstrate sparse and structured neural hidden layers, with successful empirical results and visualization properties.
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and provide concrete evidence of their capabilities and limitations. In the community building spirit of ELLIS-Amsterdam, we have formed three teams mixing Bachelor's, Master's, and PhD students and have contributed three tasks to the benchmark. In the seminar, we will briefly introduce the BIG-bench challenge and then the three teams will present their benchmarking tasks. The Metaphor Understanding task tests the capability of language models to understand English metaphors. It consists of two subtasks: in the first one, a language model is asked to correctly map a metaphorical expression to its correct literal paraphrases; in the second one, the model needs to map a literal paraphrase to the corresponding metaphorical expression. The two subtasks form a new dataset that takes into account the lessons learned from existing models and benchmarks. The Implicit Relations task evaluates a model's ability to infer relations between characters from short passages of English narratives, where the relations are left implicit. In each example, a passage and a question of the form "What is X to Y?" is presented, and the model must select the correct relation. Our new dataset makes use of 25 labels ranging from familial relations to professional relations. Finally, the Fantasy Reasoning task assesses a language model's ability to reason within situations that go against common sense or in some way violate the rules of the real world; humans do this easily, e.g., when reading a science fiction book. We collect a corpus of contexts that language models are extremely unlikely to be familiar with, paired with yes-no questions.
Languages are powerful solutions to the complex coordination problems that arise between social agents. They provide stable, shared expectations about how the words we say correspond to the beliefs and intentions in our heads. However, to handle an ever-changing environment with new things to talk about and new partners to talk with, linguistic knowledge must be flexible: we give old words new meaning on the fly. In this talk, I will present work investigating the cognitive mechanisms that support this balance between stability and flexibility. First, I'll present a large corpus of natural-language communication in the classic "tangrams" task that allows us to quantitatively characterize the dynamics of ad hoc convention formation with a single partner. Second, I'll ask how these ad hoc conventions may be generalized to broader communities. I'll introduce a theoretical framework re-casting communication not as a transmission problem but as a meta-learning problem which may be formalized via hierarchical probabilistic inference: dynamics within an interaction are driven by ad hoc partner-specific adaptation while community-level conventions are gradually abstracted away from many interactions and provide a stable prior for new partners. Finally, I'll explore several proposals about how this computational framework can be implemented at scale to allow artificial agents to form natural-language conventions, adapting to human partners in real time. Taken together, this line of work aims to build a computational foundation for a more dynamic view of meaning and common ground in communication.
Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.
Conversational question answering (QA) requires the ability to correctly interpret a question in the context of previous conversation turns. This talk presents the current advancements in this field, specifically focusing on the question rewriting approaches. The advantages of using question reformulation in the conversational settings are manifold: (1) reuse of existing models, datasets and approaches for information retrieval; (2) more transparency in the prediction results; (3) ability to deploy the models in a distributed environment, where the individual components do not share a common representation. Our experiments demonstrate that question rewriting is not only effective at setting the state-of-the-art performance on conversational QA but also allows to evaluate robustness of the question answering approaches.
The current benchmarking paradigm in AI has many issues: benchmarks saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts, have unclear or imperfect evaluation metrics, and do not measure what we really care about. I will talk about my work in trying to rethink the way we do benchmarking in AI, specifically in natural language processing, focusing mostly on the recently launched Dynabench platform.
Neural sequence generation systems oftentimes generate sequences by searching for the most likely sequence under the learnt probability distribution. This assumes that the most likely sequence, i.e. the mode, under such a model must also be the best sequence it has to offer (often in a given context, e.g. conditioned on a source sentence in translation). Recent findings in neural machine translation (NMT) show that the true most likely sequence oftentimes is empty under many state-of-the-art NMT models. This follows a large list of other pathologies and biases observed in NMT and other sequence generation models: a length bias, larger beams degrading performance, exposure bias, and many more. Many of these works blame the probabilistic formulation of NMT or maximum likelihood estimation. We provide a different view on this: it is mode-seeking search, e.g. beam search, that introduces many of these pathologies and biases, and such a decision rule is not suitable for the type of distributions learnt by NMT systems. We show that NMT models spread probability mass over many translations, and that the most likely translation oftentimes is a rare event. We further show that translation distributions do capture important aspects of translation well in expectation. Therefore, we advocate for decision rules that take into account the entire probability distribution and not just its mode. We provide one example of such a decision rule, and show that this is a fruitful research direction.
Theoretical linguistics postulates abstract structures that successfully explain key aspects of language. However, the precise relation between abstract theoretical ideas and empirical data from language use is not always apparent. Here, we propose to empirically test abstract semantic theories through the lens of probabilistic pragmatic modelling. We consider the historically important case of quantity words (e.g., `some', `all'). Data from a large-scale production study seem to suggest that quantity words are understood via prototypes. But based on statistical and empirical model comparison, we show that a probabilistic pragmatic model that embeds a strict truth-conditional notion of meaning explains the data just as well as a model that encodes prototypes into the meaning of quantity words.
Language is the ultimate social medium: we communicate not just to convey information, but to entertain, gossip, console, convince, and so much more. Social sciences have long explored this connection between behavior and language to learn more about the people and societies who use it. As the amount of language data grows exponentially, traditional methods are no longer sufficient, but NLP can help address this issue. In addition to what language can tell us about society, we now find out what NLP can tell us about language. This combination opens a wide range of exciting new applications, and to answer questions that were out of reach for years. However, it introduces NLP into areas that were previously the sole domain of social sciences, which also presents the challenge of finding a balance between methodology and theoretical motivation. In this talk, I will show how using NLP in social sciences can do more than what we thought before. I will illustrate NLP's role in social sciences with some ongoing research, and discuss a number of open questions and challenges for the field(s).
Context has a large influence on word meaning; not only local context, like in the combination of a predicate and its argument, but also global topical context. In computational models, this is routinely factored in, but the question of how to integrate different context influences is still open for theoretical accounts of sentence meaning. We start from Fillmore's "semantics of understanding", where he argues that listeners imagine the situation behind a given utterance using all their knowledge about words and the world. We formalize this idea as a "situation description system". This is a generative model of utterance understanding, which characterizes understanding as probabilistically describing the situation underlying the utterance.
Traditional semantic change detection algorithms rely on the assumption that a single word type representation is sufficient to model the different usages of a word. In this talk, I'll present a usage-based approach for the detection and analysis of lexical semantic change that relies on contextualised word representations obtained from a neural language model—one for every occurrence of a word of interest. After introducing this method, I'll discuss the validity of contextualised embeddings as word usage representations and show that they capture a variety of synchronic and diachronic linguistic phenomena. We'll see how this is reflected in the accuracy of the proposed approach tested on historical corpora in four languages, and compare ways to make the method more robust. Finally, I'll give an overview of the types of change detected by our usage-based approach and propose ideas to automate this finer-grained analysis.
Massive language models like GPT-3 can do amazing things with language, and this raises the interesting question of whether such text-based models could ever really "understand" it. One clear difference between GPT-understanding and human understanding is that GPT3 doesn't learn to connect language to its actions or its perception of the world it inhabits. In this talk, I'll discuss an approach to language understanding in which a neural-network-based agent is trained to associate words and phrases with things that it learns to see and do. First, I'll provide some evidence for the promise of this approach by showing that the interactive, first-person perspective of an agent affords it with a particular inductive bias that helps it to extend its training experience to generalize to out-of-distribution settings in ways that seem natural or 'systematic'. Second, I'll show the amount of 'propositional' (i.e. linguistic) knowledge that emerges in the internal states of the agent as it interacts with the world can be increased significantly by it learning to make predictions about observations multiple timesteps into the future. Third, I'll show how meta-learning and an explicit multi-modal external memory can afford agents the ability to learn new words in a single experience with an object (i.e. fast-mapping) and to combine this fast knowledge with longstanding semantic knowledge to interpret novel instructions. Finally, I'll connect GPT and agent-based learning in a more literal way, by showing how an agent endowed with representations from a massive language model can achieve substantial (zero-shot) transfer from template-based language to noisy natural instructions given by humans with access to the agent's world.
Reading requires rapid recognition of words in printed text. Existing models of visual word recognition account for this mechanism by mapping the perceived letter strings into lexical units. In our work, we explore whether this process is mediated by the statistical properties of the input writing systems. Adopting an information-theoretic perspective, we analyze two languages from different families (English and Hebrew), and we find key differences in the available information contained in the letters in different parts of the word (beginning vs. ending) for converging on a lexical candidate. We test the implications of these cross-linguistic differences in a novel perceptually-constrained connectionist model of visual word recognition. The simulations account for a number of behavioral phenomena. First, our model predicts a tendency to fixate slightly closer to the beginning of the word. Second, we demonstrate cross-linguistic differences in the likelihood of fixating at other locations due to availability of information-content. Our model makes the novel prediction, which we confirmed by behavioral data, that words with an atypical distribution of information-content across letters are better recognized when fixating at an unusual location in a word. Overall, our research shows how the mechanism of visual word identification is tuned to the perceptually-constrained regularities of the writing systems, thereby driving proficient reading.
Models like BERT or GPT-2 can do amazing things with language, and this raises the interesting question of whether such text-based models could ever really "understand" it. One clear difference between BERT-understanding and human understanding is that BERT doesn't learn to connect language to its actions or its perception of the world it inhabits. In this talk, I'll discuss an alternative approach to language understanding in which a neural-network-based agent is trained to associate words and phrases with things that it learns to see and do. First, I'll provide some evidence for the promise of this approach by showing that the interactive, first-person perspective of an agent affords it with a particular inductive bias that helps it to extend its training experience to generalize to out-of-distribution settings in ways that seem natural or 'systematic'. Second, I'll show the amount of 'propositional' (i.e. linguistic) knowledge that emerges in the internal states of the agent as it interacts with the world can be increased significantly by it learning to make predictions about observations multiple timesteps into the future. This underlines some important common ground between the agent-based and BERT-style approaches: both attest to the power of prediction and the importance of context in acquiring semantic representations. Finally, I'll connect BERT and agent-based learning in a more literal way, by showing how an agent endowed with BERT representations can achieve substantial (zero-shot) transfer from template-based language to noisy natural instructions given by humans with access to the agent's world.
In this talk, I will discuss our parser for semantic graphs such as Abstract Meaning Representation (AMR). Our approach combines neural models with mechanisms from compositional semantic construction. Key to this approach is the Apply-Modify (AM) algebra, which we developed to both reflect linguistic principles and yield a simple parsing model. In particular, the AM algebra allows us to find consistent latent compositional structures for our training data, which is crucial when training a compositional parser. The parser then employs neural supertagging and dependency models to predict interpretable, meaningful operations that construct the semantic graph. The result is a semantic parser with strong performance across diverse graphbanks, that also provides insights to the compositional patterns of the graphs.
Translation into morphologically-rich languages challenges neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. The former is based on word segmentation algorithms optimized using corpus-level statistics with no regard to the translation task. The latter learns directly from translation data but requires rather deep architectures. In this paper, we propose to translate words by modeling word formation through a hierarchical latent variable model which mimics the process of morphological inflection. Our model generates words one character at a time by composing two latent representations: a continuous one, aimed at capturing the lexical semantics, and a set of (approximately) discrete features, aimed at capturing the morphosyntactic function, which are shared among different surface forms. Our model achieves better accuracy in translation into three morphologically-rich languages than conventional open-vocabulary NMT methods, while also demonstrating a better generalization capacity under low to mid-resource settings.
Children learn about the visual world from implicit supervision that language provides. Most children learn their language, at least to some extent, by observing the world. Recently released datasets of instructional videos are interesting as they can be considered a rough approximation of a child’s visual and linguistic experience -- in these videos, the narrator performs a high-level task (e.g., cooking pasta) while describing the steps involved in that task (e.g., boiling water). Moreover, these datasets pose challenges similar to those children need to address; for example, identifying relevant activities to the task (e.g., boiling water) and ignoring the rest (e.g., shaking head). I will present two recent projects where we study the interaction of visual and linguistic signals in these videos: (1) We show that using language and the structure of tasks is important in discovering action boundaries. (2) I will discuss how visual signal improves the quality of unsupervised word translation, especially for dissimilar languages, and where we do not have access to large corpora.
Understanding how tutors and students adapt to one another within Second Language (L2) learning is an important step in the development of better automated tutoring tools for L2 conversational practice. Such an understanding can not only inform conversational agent design, but can be useful for other pedagogic applications such as formative assessment, self reflection on tutoring practice, learning analytics, and conversation modelling for personalisation and adaptation. We compare L2 dialogue at different levels of student ability to fluent conversational dialogues in order to identify how adaptation takes place in terms of the linguistic complexity, lexical alignment and the dialogue act usage demonstrated by the speakers within the dialogue. Finally, with the end goal of an automated tutor in mind, student alignment levels are used to compare dialogues between student and human tutor with those where the tutor is an agent. We find that the adaptation measured by speakers in L2 dialogue differs from fluent dialogue, and changes depending on learner proficiency. We also find different types of learner behaviours within automated L2 tutoring dialogues to those present in human ones, using alignment to measure this. We frame these findings as useful in identifying users who interact with tutoring agents as intended within future large online dialogue learning tools, with an emphasis on how these can be used to improve tutoring dialogue agents.
One long-standing puzzle in semantics is the ability of speakers to refer successfully in spite of holding different models of the world. This puzzle is famously illustrated by the cup/mug example: if two speakers disagree on whether a specific entity is a cup or a mug (i.e. if their interpretation functions differ), how can they align so that the entity can still be talked about? Another puzzle, coming to us through lexical and distributional semantics, is that word meaning seems to be infinitely flexible across utterances, indeed much more so than the traditional notion of sense would have it. This makes the alignment process between speakers even more unpredictable. In this talk, I will report on a series of experiments aiming at investigating differences in language use through distributional semantics techniques. I will sketch what such differences can tell us about the ability of speakers to align at a model-theoretic level.
Communication is made easier when speakers use language in similar ways. When speakers come to an interaction with slightly different languages they often adjust their languages to be more similar, in a process of alignment or accommodation. In this talk I consider interactions in which one speaker is a more experienced speaker than the other, such as interactions between a native and non-native speaker: in this case the native speaker could improve communication by accommodating to the non-native speaker. Accurate accommodation requires making inferences about the other's language, which we can model in a Bayesian framework. In a dialogue between two rational agents, a native speaker agent who accommodates and a non-native learner agent, the learner ends up with a simplified language, due to a reinforcing effect between an initially underinformed learner and an accommodating native speaker. This result gives a possible mechanism for the negative correlation between the proportion of non-native speakers of a language and language complexity.
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
As natural language processing (NLP) techniques are increasingly being used in various day-to-day applications, there is growing awareness that the decisions we as researchers and developers make about our data, methods, and algorithms have immense impact in shaping our social lives. In this talk, I will outline the growing body of research on ethical implications of machine learning and NLP technologies, especially around questions about fairness and accountability of the models we build and deploy into the world. I will discuss ways in which machine learned NLP models may reflect, propagate, and sometimes amplify social stereotypes about people, potentially harming already marginalized groups. I will also briefly discuss various ways to address these issues, both through mitigation strategies and through increased transparency.
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image properties which justify visual predictions. In this talk, I will present my past and current work on Zero-Shot Learning, Vision and Language for Generative Modeling and Explainable Artificial Intelligence where we show (1) how to generalize image classification models to cases when no visual training data is available, (2) how to generate images and image features using detailed visual descriptions, and (3) how our models focus on discriminating properties of the visible object, jointly predict a class label, explain why/not the predicted label is chosen for the image.
Traditional research in NLG focuses on building better models and assessing their performance using clean, preprocessed and curated datasets, as well as standard automatic evaluation metrics. From a scientific point-of-view, this provides a controlled environment where different models can be compared and robust conclusions can be made. However, these controlled settings can drastically deviate from scenarios that happen when deploying systems in the real world. In this talk, I will focus on what happens *before* data is fed into NLG systems and what happens *after* we generate outputs. For the first part, I will focus on addressing heterogeneous data sources using tools from graph theory and deep learning. In the second part, I will talk about how to improve decision making from generated texts through Bayesian techniques, using Machine Translation post-editing as a test case.
Discrete structures such as dependency trees are often used to inject prior linguistic knowledge into statistical models. Many systems are built on top of a pipeline that starts with predicting a linguistic structure (e.g., syntactic or semantic representations) using a parser and then makes a task-specific prediction relying on this predicted structure (e.g., choose a polarity label in sentiment analysis). Unfortunately, most parsers rely on large amounts of manually-annotated data for training, which is available only for a small fraction of languages and domains. Therefore, it is appealing to rely on other forms of supervision to learn the parameters of the parser. On the one hand, raw text data is available in many languages. It can be used for semi-supervised learning to complement a small set of available annotated data. On the other hand, even when annotated data is not available, assuming a structured representations of sentences can be beneficial, as it provides inductive biases about the structure of the language. In this case, we want to induce task-specific structured representations of language in such a way as to benefit a given downstream task. In other words, an inductive bias is injected in the model, i.e. structures are good for natural languages, but no assumption is made about the appropriate content: the parser is trained end-to-end while optimizing performance on the downstream task. In practice, structures induced in this way tend not to resemble any accepted syntactic or semantic formalism as it lets the model induce the one which is better suited for the particular downstream task.
In this talk, I will explain how both problems can be cast as learning the parameters of a statistical model with structured latent variables. During training, exact inference in these models requires marginalizing over latent variables which is intractable (e.g. summing over all dependency trees for a given sentence). Recently, differentiable Monte-Carlo estimation (i.e. the reparametrization trick) has been explored for training statistical models parametrized with neural networks. We follow this line of work and introduce a differentiable relaxation which we use to approximate samples and compute gradients with respect to the parser parameters. Our method (Differentiable Perturb-and-Parse) relies on differentiable dynamic programming over stochastically perturbed arc weights. We show the effectiveness of our approach on several tasks and datasets.
In this talk I would like to discuss how to integrate vision, language, and world knowledge in the context of (natural) language generation. I will start by discussing our most recent paper just accepted for publication at ACL 2019, and I will wrap up by contextualising my research interests for the next 3 years under my Marie-Curie project "IMAGINE: Improving language generation with world knowledge". In our ACL paper we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language, and is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. I will show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kádár, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Regarding my research agenda for the next 3 years, I will discuss how to represent world knowledge by learning general-purpose multi-modal knowledge base representations, as well as how to incorporate these representations into (and improve) natural language generation.
How language, music, and other complex sequences are represented and computed in the human brain is a fundamental area of brain research that continues to stimulate as much research as it does vigorous debate. Some classical questions (and persistent puzzles) - highlighting the tension between neuroscience and cognitive science research - concern the role of structure and abstraction. Recent findings from human neuroscience, across various techniques (e.g. fMRI, MEG, ECoG), suggest that the brain supports hierarchically structured abstract representations.
New data on the role of brain rhythms show that such neural activity appears to underpin the tracking of structure-building operations. If the new approaches are on the right track, they invite closer relations between fields and better linking hypotheses between the foundational questions that animate both the neurosciences and the cognitive sciences.
Expressions like "most" or "big" are known to be vague, that is, their interpretation can be borderline and not generally-agreed. Moreover, their use is context-dependent, in a way that an entity can be "big" in one context, but not in another. Interestingly, the meaning of these expressions is shown to be mostly quantitative when they are used to refer to entities (or sets of entities) in real-world contexts; for example, "few" is used by speakers only to refer to a given range of (low) proportions. By exploiting state-of-the-art, cognitively-inspired computational techniques, I tackle the issue of modelling the meaning of vague expressions from their use in grounded contexts, specifically Vision. In the first, longer part of the talk, I will provide an overview of my recent investigations on vague quantifiers ("few", "many", "all", etc.), both at the behavioural and computational level. In the second part, shorter, I will present ongoing research on gradable adjectives ("big", "small", etc.). Any feedback and comment is more than welcome!
When processing a text, both humans and machines must cope with ambiguity and non-compositionality. These phenomena represent a considerable challenge for NLP systems, while at the same time there is limited evidence from online measures on how humans solve them during natural reading. We approach these two problems as one and hypothesize that obtaining information on how humans process ambiguous and non-compositional phrases can improve the computational treatment of such instances. I will present experiments on using eye-tracking data to improve NLP models for two tasks: classifying the different roles of the pronoun It (nominal anaphoric, clause anaphoric and non-referential), as well as the identification of multi-word expressions. The experiments test whether gaze-based features improve the performance of state-of-the-art NLP models and the extent to which gaze features can be used to partially or entirely substitute the crafting of linguistic ones. The best-performing models are then analysed to better understand the cognitive processing of these linguistic phenomena and findings are discussed with respect to the E-Z model of reading and the processing stages during which disambiguation occurs.
Humans learn to understand speech from weak and noisy supervision: they manage to extract structure and meaning from speech by simply being exposed to utterances situated and grounded in their daily sensory experience. Emulating this remarkable skill has been the goal of numerous studies; however researchers have often used severely simplified settings where either the language input or the extralinguistic sensory input, or both, are small-scale and symbolically represented. I present a series of studies on modelling visually grounded language understanding. Using variations of recurrent neural networks to model the temporal nature of spoken language, we examine how form and meaning-based linguistic knowledge emerges from the input signal.
Distributional models and other supervised models of language focus on the structure of language and are an excellent way to learn general statistical associations between sequences of symbols. However, they do not capture the functional aspects of communication, i.e., that humans have intentions and use words to coordinate with others and make things happen in the real world. In this talk, I will present two studies on multi-agent emergent communication, where agents exist in some grounded environment and have to communicate about objects and their properties. This process requires the negotiation of linguistic meaning in this pragmatic context of achieving their goal. In the first study, I will present experiments in which agents learn to form a common ground that allow them to communicate about disentangled (i.e., feature norm) and entangled (i.e., raw pixels) input. In the second study, I will talk about properties of linguistic communication as arising in the context of self-interested agents.
Psychological stress is a crucial underlying reason for several physical and mental illnesses. The plethora of social media content provides an effective source to monitor stress, both long-term and short-term in nature. Depending on the context, analysis of stress in social media content could help assess customer feedback in businesses, bottlenecks in transportation systems or psychological state of target populations. Situational stress in daily life scenarios such as traffic deserves research attention because : 1) it could potentially be an indicator of a persistent issue in the scenario, requiring corrective measures 2) such short term stress can add up in the long term negatively impacting individual well-being. This talk focuses on stress expressions in Tweets belonging to two domains: Airlines and London traffic. Using topic modeling and word vector representations, I will present an analysis of reasons for stress in these two domains. I will also discuss the features of the language used in high-stress travel Tweets, examining the presence of offensive words, sarcasm and negative emotions in detail, comparing and contrasting the findings with the features of Tweets belonging to other domains.
Neural language models can be evaluated by comparing their performance on task-based evaluations. We discuss several methods for analyzing the cognitive plausibility of computational language representations by comparing them to human brain data. We examined the performance of several evaluation metrics across four fMRI datasets. In this talk, I will present the results of this experiment and compare the performance to a random model. In addition, we discuss the effect of selecting voxels (i.e. relevant regions of the brain to examine) in a model-driven way.
In this talk I will give an overview of my research in machine learning for natural language processing. I will begin by introducing my work on imitation learning, a machine learning paradigm I have used to develop novel algorithms for structure prediction that have been applied successfully to a number of tasks such as semantic parsing, natural language generation and information extraction. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. Following this, I will discuss my work on zero-shot learning using neural networks, which enabled us to learn models that can predict labels for which no data was observed during training. I will conclude with my work on automated fact-checking, a challenge we proposed in order to stimulate progress in machine learning, natural language processing and, more broadly, artificial intelligence.
Structured representations are a powerful tool in machine learning, and in particular in natural language processing: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. At the same time, deep, hierarchical neural networks with latent representations are increasingly widely and successfully applied to language tasks. Deep networks conventionally perform smooth, soft computations resulting in dense hidden representations.
We study deep models with structured and sparse latent representations, without sacrificing differentiability, and thus enabling end-to-end gradient-based training. We demonstrate sparse and structured attention mechanisms, as well as latent computation graph structure learning, with successful empirical results on large scale problems including sentiment analysis, natural language inference, and neural machine translation.
Joint work with Claire Cardie, Mathieu Blondel, and André Martins.
Vlad Niculae, André F. T. Martins, Mathieu Blondel, Claire Cardie. SparseMAP: Differentiable sparse structured inference. In: Proc. of ICML 2018.
Vlad Niculae, André F. T. Martins, Claire Cardie. Towards dynamic computation graphs via sparse latent structure. In: Proc. of EMNLP 2018.
Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural attention. In: Proc. of NIPS 2017.
Vector space semantics uses contexts of words to reason about their meanings; it is motivated by ideas of Firth and Harris, uses co-occurrence matrices to assign vectors to words, and has applications in diverse NLP tasks, from named entity recognition, to parsing, to disambiguation. Distributional semantics has been extended from words to sentences, using different grammatical formalisms, such as Lambek’s pregroups, the Lambek Calculus, and Combinatorial Categorial Grammar. It has, however, not been considered for incremental and dialogue pheonema. These phenomena cover individual language processing, where hearers incrementally disambiguate word senses before sentences are even complete, and dialogue utterances, where more than one agent contribute to the unfolding of a sequence. In recent joint work with Purver, Hough, and Kempson (SemDial 2018), we defined an incremental vector space semantic model using the formalism of Dynamic Syntax and showed how it can incrementally assign a semantic plausibility measure as it performs word-by-word parses of utterances.
Word embeddings implicitly encode a rich amount of semantic knowledge. The extent to which they can capture relational information, however, is inherently limited. To address this limitation, we propose to learn relation vectors, describing how two words are related based on the distribution of words in sentences where these two words co-occur. In this way, we can capture aspects of word meaning that are complementary to what is captured by word embeddings. For example, by examining clusters of relation vectors, we observe that relational similarities can be identified at a more abstract level than with traditional word vector differences. These relation vectors can be used, among others, to enrich the input to neural text classification models. From a network of relation vectors, we can also learn relational word vectors. These are vector representations of word meaning which, unlike standard word vectors, capture relational properties rather than similarity. On a range of different tasks, we find that combining these relational word vectors with standard word vectors leads to improved results.
The requirement for neural machine translation (NMT) models to use fixed-size input and output vocabularies plays an important role in their accuracy and generalization capability. The conventional approach to cope with this limitation is performing translation based on a vocabulary of sub-word units that are predicted using statistical word segmentation methods. However, these methods have recently shown to be prone to morphological errors, which lead to inaccurate translations. In this paper, we extend the source-language embedding layer of the NMT model with a bi-directional recurrent neural network that generates compositional representations of the source words from embeddings of character n-grams. Our model consistently outperforms conventional NMT with sub-word units on four translation directions with varying degrees of morphological complexity and data sparseness on the source side.
Besides making our thoughts more vivid and filling our communication with richer imagery, metaphor plays a fundamental structural role in our cognition, helping us to organise and project knowledge. For example, when we say “a/well-oiled/political/machine/”, we view the concept of/political system/in terms of a/mechanism/and transfer inferences from the domain of mechanisms onto our reasoning about political processes. Highly frequent in text, metaphorical language represents a significant challenge for natural language processing (NLP) systems. In this talk, I will first present a neural network architecture designed to capture the patterns of metaphorical use and its application to metaphor identification in text. I will then discuss how general-purpose lexical and compositional semantic models can be used to better understand metaphor processing in the human brain.
The task of learning language in a multisensory setting, with weak and noisy supervision, is of interest to scientists trying to understand the human mind as well as to engineers trying to build smart conversational agents or robots. In this talk I will present work on learning language from visually grounded speech using deep recurrent neural networks, and show that these models are able to extract linguistic knowledge at different levels of abstraction from the input signal. I then describe analytical methods which allow us to better understand the nature and localization or representations emerging in such recurrent neural networks. I will also discuss the challenges inherent in fully unsupervised modeling of spoken language and present recent results on this problem.
Massive digital datasets, such as social media data, are a promising source to study social and cultural phenomena. They provide the opportunity to study language use and behaviour in a variety of social situations on a large scale. However, to fully leverage their potential for research in the social sciences, new computational approaches are needed. In this talk I will start with a general introduction to this research area. I will then focus on two case studies. First, I discuss how natural language processing can help to scale-up social science research by applying social science theories on large scale naturalistic text data. I discuss how we investigated the impact of participants’ motivations in the public health campaign Movember on the amount of campaign donations raised based on the Social Identity Model of Collective Action. Second, I will discuss how advances in machine learning can be used to develop better tools for sociolinguists. Existing approaches to identify variables that exhibit geographical variation (e.g., pop vs. soda vs. coke in the US) have several important drawbacks. I discuss a method to measure geographical language variation based on Reproducing Kernel Hilbert space (RKHS) representations. I then conclude with discussing my perspective on a few big challenges in this area.
The advent of efficiently trainable neural networks has led to striking improvements in the accuracy of next word prediction, machine translation and many other NLP tasks. It has also produced models that are much less interpretable. In particular, the role played by linguistic structure in sequence prediction and sequence-to-sequence models remains hard to gauge. What makes recurrent neural networks work so well for next word prediction? Do neural translation models learn to extract linguistic features from raw data and exploit them in any explicable way? In this talk I will give an overview of recent work, including my own, that aims at answering these questions. I will also present recent experiments on the importance of recurrency for capturing hierarchical structure with sequential models. Answering these questions is important to establish whether injecting linguistic knowledge into neural models is a promising research direction, and to understand how close we are to building intelligent systems that can truly understand and process human language.
Graph Convolutional Networks (GCNs) is an effective tool for modeling graph structured data. We investigate their applicability in the context of natural language processing (machine translation and semantic role labelling) and modeling relational data (link prediction). For natural language processing, we introduce a version of GCNs suited to modeling syntactic and/or semantic dependency graphs and use them to construct linguistically-informed sentence encoders. We demonstrate that using them results in a substantial boost in machine translation performance and state-of-the-art results on semantic role labeling of English and Chinese. For link prediction, we propose Relational GCNs (RGCNs), GCNs developed specifically to deal with highly multi-relational data, characteristic of realistic knowledge bases. By explicitly modeling neighbourhoods of entities, RGCNs accumulate evidence over multiple inference steps in relational graphs and yield competitive results on standard link prediction benchmarks.
Joint work with Diego Marcheggiani, Michael Schlichtkrull, Joost Bastings, Thomas Kipf, Khalil Sima’an, Max Welling, Rianna van den Berg and Peter Bloem.
One of the major early achievements of linguistic typology was Greenberg’s (1963) discovery of implicational word order universals. While his work was based on a comparatively small sample of languages, later work, such as (Hawkins, 1983; Dryer, 1992), confirmed the existence of implicational word order universals on the basis of broader data collections. In a landmark study using modern quantitative, Bayesian comparative methods and data from four language families (Austronesian, Bantu, Indo-European and Uto-Aztecan), Dunn, Greenhill, Levinson, and Gray (2011) established results being in stark contrast to the established view. While the authors did find evidence for word order correlations in many cases, the emerging pictures differed fundamentally between the four families. From this they concluded that word order tendencies are lineage specific rather than universal. The authors did not explicitly compare their lineage-specific model with a universal model though; they only qualitatively assessed the assumption of universal word-order correlations as not plausible given their findings. In the talk I will present a study addressing this issue via performing a Bayesian model comparison between a universal and a lineage-specific model. It turns out that there is solid support for universal word-order correlations between features that Dryer (1992) classified as "verb patterners", while other correlations are clearly lineage specific. The broader methodological point to be made is that linguistic typology can immensely benefit from the tools of modern Bayesian statistics and the phylogenetic comparative method.
Recurrent neural networks (RNNs) are remarkably general learning systems that, given appropriate training examples, can handle complex sequential processing tasks, such as those frequently encountered in language and reasoning. However, RNNs are remarkably sample-heavy, typically requiring hundreds of thousands of examples to master tasks that humans can solve after seeing just a few exposures. The first set of experiments I will present shows that modern RNNs, just like their ancestors from the nineties, have problems with systematic compositionality, that is, the ability to extract general rules from the training data, and apply them to new examples. As systematic compositionality allows very fast generalization to unseen cases, lack of compositional learning might be one of the roots of RNN's training data thirst. I will next present an ongoing study where RNNs must solve an apparently simple task where correct generalization relies on function composition. Current results suggest that a large random search in RNN space finds a small portion of models that converged on a (limited) compositional solution. However, it's not clear, for the time being, what is special about such models. The quest for compositional RNNs is still on.
Joint work with: Brenden Lake, Adam Liska, Germán Kruszewski
There are multiple contributors to language change that are external to the speaker, such as social or economic drivers, or even accidents of linguistic contact. However, there are also internal constraints that are key to shaping language evolution. In particular, psycholinguistic properties of language can predict which representations are acquired and stored with greatest fidelity by the speaker. For instance, we know that frequency, length, and the age at which a language structure is acquired all contribute to more stable storage and accurate reproduction of that structure. In this talk, I present a series of studies of the English vocabulary to demonstrate how internal cognitive processing has shaped the language, with analyses from corpora of diachronic vocabulary change and morphological change of the past tense forms of verbs, accompanied by laboratory studies of artificial language learning and change that show similar patterns to the diachronic data. These studies provide suggestions for how psycholinguistic properties of the language affect learning and cultural transmission across generations of speakers.
Logic Tensor Networks (LTN) is a theoretical framework and an experimental platform that integrates learning based on tensor neural networks with reasoning using first-order many-valued/fuzzy logic. LTN supports a wide range of reasoning and learning tasks with logical knowledge and data using rich symbolic knowledge representation in first-order logic (FOL) to be combined with efficient data-driven machine learning based on the manipulation of real-valued vectors. In practice, FOL reasoning including function symbols is approximated through the usual iterative deepening of clause depth. Given data available in the form of real-valued vectors, logical soft and hard constraints and relations which apply to certain subsets of the vectors can be specified compactly in FOL. All the different tasks can be represented in LTN as a form of approximated satisfiability, reasoning can help improve learning, and learning from new data may revise the constraints thus modifying reasoning. We apply LTNs to Semantic Image Interpretation (SII) in order to solve the following tasks: (i) the classification of an image's bounding boxes and (ii) the detection of the relevant part-of relations between objects. The results shows that the usage of background knowledge improves the performance of pure machine learning data driven methods.
If, when asked to "point at the mug", a physically unimpaired person seems unable to identify a potential referent that is standing in front of them, we might hesitate to ascribe knowledge of the meaning of the word "mug" to them, whatever else they may be able to tell us about mugs (e.g., "wooden mugs were produced probably from the oldest time, but most of them have not survived intact.", or "mugs are similar to cups"). And yet computational models of word meaning are good at the latter (e.g., by simply linking to knowledge repositories like wikipedia, where the previous sentence about wooden mugs was taken from), and fail at the former. In this talk, I will present our recent work at learning a lexicon for referential interaction, where the referential aspects of word meaning are modelled through perceptual classifiers taking real images as input. I show that this representation complements other computational meaning representations such as those derived from distributional patterns, as well as decompositional or attribute-based representations. The lexicon is learned through (observation of) interaction, and is maintained and defended in interaction.
Shared tasks and (shared) corpora have proven themselves highly valuable for NLP. They have allowed us to evaluate our methods and compare them to others helping us, our readers and reviewers to assess the quality of our methods. A downside of the wide-spread approach of comparing results on a gold dataset is that it is relative common practice to draw conclusions based on the highest numbers without looking into what is behind this. However, what goes wrong and why can be highly relevant for end-applications and, specially given the well-known difficulties with reproducing results, looking into the details of how and why results improve (or not) is highly relevant. In this talk, I will present two studies taking intrinsic evaluation one step further 1) investigating error propagation in parsing and 2) diving in the evaluation of distributional semantic methods. Finally, I will outline the importance of deeper evaluation when NLP is used within digital humanities and digital social science.
Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback in form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. This framework has successfully been applied to various structured prediction tasks in NLP. In this talk I will focus on the application of bandit structured prediction to linear and non-linear machine translation models where models are adapted to a new domain without seeing reference translations of the new domain. In simulation experiments we showed that partial information in form of translation quality judgements on predicted translations is sufficient for model adaptation, even for feedback as weak as pairwise preference judgments.
Linguistics quantifiers have been the realm of Formal Semantics. A lot is known about their formal properties and how those properties affect logical entailment, the licensing of polarity item, or scope ambiguities. Less is known about how quantifiers are acquired by children and even less about how computational models can learn to quantify objects in images. In this talk, we will report on our findings in this direction. First of all, we will explain why the task is interesting and challenging for a Language and Vision model. Secondly, we will report our evaluation of state-of-the-art neural network models against this task. Thirdly, we will compare the acquisition of quantifiers with the acquisition of cardinals. We will show that a model capitalizing on a `fuzzy' measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.
While our ultimate aim in language processing might be making fully unsupervised models that optimally resemble the human way of learning, in many areas of NLP we are still heavily working with high degrees of supervision. Aiming at sparing annotation effort, distant supervision has been explored in the past 10 years as an alternative way to obtain (noisy) training data. This obviously doesn't take us directly to unsupervised models, but in addition to being a cheaper method to labelling instances, it also keeps us closer to the original data and it might give us an indication into the extent to which we can make do with rather spontaneous signals in the data. In the talk, I will present two experiments in the area of affective computing exploiting distant supervision: one on emotion detection, and one on stance detection. In both cases, we acquire silver labels for training leveraging user generated social media data, and play with different degrees of supervision in building our models. These are eventually tested on standard benchmarks and compared to state-of-the-art approaches. Our (mixed) results are discussed also in the light of whether supervision is truly necessary or not, and the value of silver versus gold data.
There are a number of interesting challenges in translation to morphologically rich languages (such as German or Czech) from a language like English. I will first present a linguistically rich English to German translation system generalizing over compounds, phenomena of inflectional morphology and syntactic issues, relying on preprocessing and postprocessing techniques. Following this, I'll present approaches addressing similar issues which have been tightly integrated into the Moses SMT decoder, and work well for multiple language pairs. Finally, time allowing, I'll present some thoughts on addressing these and further challenges within the framework of neural machine translation.
Over a century ago, Frege famously introduced the distinction between sense and reference that is one of the theoretical foundations of formal semantics. However, in practice formal semanticists took reference and ran away with it, either eschewing sense-related issues altogether or giving a referential treatment to them (with notable exceptions). In this talk, I argue that we need to go back to Fregean sense, and propose that data-induced, continuous representations provided by distributional semantics and deep learning methods provide a good methodological handle for sense-related aspects of meaning. I support these claims with results from both computational modeling and theoretical studies. I then revisit reference and present ongoing work on the challenging enterprise of tackling it with continuous methods, too.
In this talk I will describe the creation of RELPRON, a dataset of subject and object relative clauses for the evaluation of compositional distributional semantic models. The RELPRON task involves matching terms, such as 'wisdom', with representative properties in relative clause form, such as 'quality that experience teaches'. Relative clauses are an interesting test case for compositional distributional semantic models because they contain a closed class function word and a long-distance dependency. I will present results on RELPRON obtained within a type-based composition framework, using a variety of approaches to simplify the learning of higher-order tensors, as well as results obtained using neural networks for composition. In line with many existing datasets, vector addition provides a challenging baseline for RELPRON, but it is possible to match or improve on the baseline by finding appropriate training data and models for the semantics of the relative pronoun.
I will present three semantic parsing approaches for querying Freebase in natural language 1) training only on raw web corpus, 2) training on question-answer (QA) pairs and 3) training on both QA pairs and web corpus. For 1 and 2, we conceptualise semantic parsing as a graph matching problem, where natural language graphs built using CCG/dependency logical forms are transduced to Freebase graphs. For 3, I will present a natural-logic combined with Convolutional Neural-Network based relation extraction. Our methods achieve state-of-the-art on WebQuestions and Free917 QA datasets.
In this talk, I will introduce a distributional model for computing the complexity of semantic composition, inspired by recent psycholinguistic research on sentence comprehension. I argue that the comprehension of a sentence is an incremental process driven by the goal of constructing a coherent representation of the event the speaker intends to communicate with the sentence. Semantic complexity is determined by a compositon cost depending on the internal coherence of the event model being constructed and on the activation degree of such event by linguistic constructions. The model is tested on some psycholinguistic datasets for the study of sentence comprehension.
Figurative expressions such as ''break the ice'' occur frequently in natural language, even in apparently matter-of-fact texts such as news wire. Many of these expressions are also ambiguous between a figurative and a literal interpretation when taken out of context, e.g. ''break the ice (...on the duck pond)'' vs. ''break the ice (...with wary adolescents)''. Being able to automatically detect figurative usages in a given context is potentially useful for a number of tasks, ranging from corpus-based studies of phraseology to applications in automatic natural language processing. In this talk, I will present a method for automatically distinguishing figurative and literal usages of a target expression in a given context. The method exploits the fact that well-formed texts exhibit lexical cohesion, i.e. words are semantically related to other words in the vicinity.
Languages vary in the ways they carve up the world: where English uses the preposition on to describe support relations between object, Dutch employs two prepositions, op and aan. Underlying such crosslinguistic variation, we also find tendencies in the way unique situations (objects, events) are grouped into linguistic semantic categories. For studying the variation and biases in the word meaning inventories of the world's languages, semantic typology has typically taken recourse to in-person elicitation. This process, however, is tedious, hard to apply for more abstract domains (the meaning of connectives, abstract verbs like think), and displays a researcher-bias in the selection of stimuli. Instead, we propose to use parallel corpora to obtain judgments similar to in-person elicitations, but avoiding these pitfalls. In my talk, I will describe our pipeline for approaching this issue, discuss the properties of the representational space it yields, and present preliminary results on a typologically diverse corpus of translated subtitles. (joint work with Suzanne Stevenson)
Although distributed semantics has been very successful in various NLP tasks in recent years, the fact that word meanings are represented as a distribution over other words exposes them to the so-called grounding problem. Multi-modal semantics attempts to address this by enhancing textual representations with extra-linguistic perceptual input. Such multi-modal models outperform language-only models on a range of tasks. In this talk I will discuss my PhD work, which has been concerned with advancing this idea, by (1) improving how we mix information through multi-modal fusion, (2) finding better ways to obtain perceptual information through deep learning and (3) obtaining representations for previously untried modalities such as auditory and even olfactory perception. I'll also briefly talk about a new multi-modal features toolkit that NLP researchers can use to experiment with visual and auditory representations.
Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.
In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.
Approaches to the computational analysis of discourse are sensitive to different aspects of textual structure. Some consider topical structure, others focus on rhetorical relations, and still others concern themselves with the functional structure of texts. In this talk I present a new way of approaching the task, following Smith's (2003) work on Discourse Modes. The central idea is that texts are made up of passages - usually several sentences or more - with different modes: Smith's typology includes Narrative, Description, Report, Information, and Argument/Commentary. Smith further identifies specific linguistic correlates of these modes, one of which pertains to the contributions made to the discourse by individual clauses of text. As a first step toward automatic Discourse Mode classification, we address the problem of classifying clauses of written English text according to the type of situation expressed by the clause. The situation entity (SE) classification task as construed here uses a scheme that includes, among others, events, states, abstract entities, and generic sentences. We find that a feature-driven approach to annotating SEs both improves annotation consistency and enriches the annotated data with useful semantic information, such as lexical aspect of verbs, genericity of main referents, and habituality of clauses. This data has been used to develop automatic classifiers for SE types as well as for other semantic phenomena. $Abstract
Probabilistic grammars are an important model family in natural language processing. They are used in the modeling of many problems, mostly prominently in syntax and semantics. Latent-variable grammars are an extension of vanilla probabilistic grammars, introducing latent variables that inject additional information into the grammar by using learning algorithms in the incomplete data setting. In this talk, I will discuss work aimed at the development of (four) theoretically-motivated algorithms for the estimation of latent-variable grammars. I will discuss how we applied them to syntactic parsing, and more semantically-oriented problems such as machine translation, conversation modeling in online forums and question answering.
We introduce models for training embeddings that effectively integrate computer vision and natural language processing. The main novelty in our proposal is the utilisation of data that is not only multimodal, but both multimodal and multilingual. The intuition behind our models is that multiple sources of textual information might convey more "facts" about an image than a textual description in only one language. We discuss how incorporating translational evidence might be used in improving the quality of trained embeddings. We use the recently released multimodal Flickr30k dataset and evaluate our models on the tasks of sentence-to-image and image-to-sentence ranking. Our results demonstrate that including multilingual data leads to substantial improvement over the (monolingual) state-of-the-art.
Earlier work on an entirely diagrammatic formulation of quantum theory, which is soon to appear in the form of a textbook, has somewhat surprisingly guided us towards an answer for the following question: how do we produce the meaning of a sentence given that we understand the meaning of its words? This work has practical applications in the area of natural language processing, and the resulting tools have meanwhile outperformed existing methods.
Most research into learning linguistic representations focuses on the distributional hypothesis and exploits linguistic context to embed words in a semantic vector space. In this talk I address two important but often neglected aspects of language learning: compositionality and grounding. Words are important building blocks of language, but what makes language unique is putting them together: how can we build meaning representations of phrases and whole sentences out of representations of words? And how can we make sure that these representations connect to the extralinguistic world that we perceive and interact with? I will present a multi-task gated recurrent neural network model which sequentially processes words in a sentence and builds a representation of its meaning while making concurrent predictions about (a) which words are to follow and (b) what are the features of the corresponding visual scene. Learning is driven by feedback on this multi-task objective. I evaluate the induced representations on tasks such as image search, paraphrasing and textual inference, and present quantitative and qualitative analyses of how they encode certain aspects of language structure.
Incremental shift-reduce parsing with structured perceptron training is an established technique for continuous constituency parsing. The corresponding parsers are very fast and yield results that are close to the state of the art. In this talk, I present a shift-reduce parser which can produce discontinuous constituents by processing the input words out-of-order, a strategy known from dependency parsing. The system yields accurate results. Unlike previous grammar-based parsers for discontinuous constituents, it also achieves very high parsing speeds.
The CLS is happy to announce three talks about Statistical Models of Grammaticality studied within the SMOG project at King’s College London:
I will present two ideas aiming towards 'parser-generalization', the problem of enhancing a supervised grammar and parsing model to accurately cover a wider variety of linguistic data than has been seen in the labeled data, using additional unlabeled data. The first idea concerns the use of the Expectation Maximisation (EM) algorithm for semi-supervised learning of parsing models. While it has long been thought that EM is unsuitable for semi-supervised learning of structured models such as part-of-speech taggers and parsing models (Merialdo 1994, Elworthy 1994), I will present experiments under two grammar formalisms (PCFG and CCG) where we have successfully used EM for semi-supervised learning of generative parsers. These two grammars share the property of being 'strongly lexicalised', in that they have complex lexical categories, and a few simple grammar rules that combine them. This strong lexicalisation makes these grammars more suitable for learning from unlabeled data than grammars which are not lexicalised in this way. In this work, I make the assumption that all lexical category types in the language are *known* from the supervised part of the data, a reasonable assumption to make if the supervised data is large enough. In the second part of the talk, I will discuss ongoing work where we generate *new* category types, based on those types seen in the labeled data. We use a latent-variable PCFG model for generating new CCG types, under the assumption that there is a hidden structure in CCG lexical categories which can be uncovered using such a model.
One approach to representing images is as a bag-of-regions vector, but this representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of the research is that capturing and encoding the relationships between parts of an image will improve the performance of downstream tasks. A simplifying assumption throughout the talk is that we have access to gold-standard object annotations. The first part of this talk will focus on the Visual Dependency Representation: a novel structured representation that captures region-region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. We explain how to automatically predict Visual Dependency Representations using a modified graph-based statistical dependency parser. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. The second part of the talk will show that adopting Visual Dependency Representations of images leads to significant improvements on two downstream tasks. In an image description task, we find improvements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. Finally, in an query-by-example image retrieval task, we show improvements in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach.
It is proofed that syntax-based statistical machine translation can produce better translation than phrase-based translation does, especially for those language pairs with big structural difference. However, constituent-based models are complex and not efficient in implementation. Dependency is regarded as a more compact and efficient formalism of syntax and a nature bridge from syntax to semantics, but early dependency-based SMT has lower performance compared with the mainstream approaches. We proposed the first dependency-based SMT model whose performance is comparable with the state-of-the-art models in 2011, and then we developed several improvements based on this model. Recently we tried a new dependency-based transfer-and-generation approach which we think is promising and got positive results at this preliminary stage.
In Statistical Machine Translation (SMT), inference is performed over a high-complexity discrete distribution defined by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation - the task of searching for the optimum translation - or sampling - the task of finding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. In this presentation, I will talk about exact optimisation (decoding) and sampling for SMT based on a form of rejection sampling. In this view, the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally refined to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level.
Establishing reference to objects in a shared environment is pivotal to successful communication. By using artificial scenarios where subjects need to choose referential expressions or guess the speaker's intended referent we can study the extent to which speakers and listeners reason pragmatically about each other's perspective. I will present a number of related empirical studies in this paradigm and discuss how different flavors of Bayesian cognitive modeling can be used to analyze the data.
Human beings are excellent at making sense of, and producing, structured sensory input. In particular, cognitive abilities for patterning seem crucial in allowing humans to perceive and produce language and music. The comparative approach, testing a range of animal species, can help unveil the evolutionary history of such patterning abilities. Here, I present experimental data and ongoing work in humans, chimpanzees, squirrel monkeys, pigeons and kea. I compare monkeys' and humans' skills in processing sensory dependencies in auditory stimuli, a crucial feature of human cognition. In order to infer individual and species-specific learning strategies and behavioral heuristics, I analyze data from visual touch-screen experiments in birds. Finally, as pattern production and perception abilities have been shown to differ in humans, the same divide could exist in other species. I present ongoing work using "electronic drums" I developed specifically for apes, which will allow chimpanzees to spontaneously produce non-vocal acoustic patterns.
Probabilistic and stochastic methods have been fruitfully applied to a wide variety of problems in grammar induction, natural language processing, and cognitive modeling. In this talk I will explore the possibility of developing a class of combinatorial semantic representations for natural languages that compute the semantic value of a (declarative) sentence as a probability value which expresses the likelihood of speakers of the language accepting the sentence as true in a given model. Such an approach to semantic representation treats the pervasive gradience of semantic properties as intrinsic to speakers' linguistic knowledge, rather the result of the interference of performance factors in processing and interpretation. In order for this research program to succeed, it must solve three central problems. First, it needs to formulate a type system that computes the probability value of a sentence from the semantic values of its syntactic constituents. Second, it must incorporate a viable probabilitic logic into the representation of semantic knowledge in order to model meaning entailment. Finally, it must show how the specified class of semantic representations can be efficiently learned from the primary linguistic data available for language acquisition. This research has developed out of recent work with Alex Clark (Royal Holloway, London) on the application of computational learning theory to grammar induction.
Early verb learning in children seems an almost miraculous feat. In learning a verb, children must learn both the basic meaning of the event ("falling" or "eating"), as well as the allowable structures in their language for correctly communicating the participants in that event ("The glass fell", but not "She fell the glass"). Moreover, given the sparsity of evidence, children must be able to abstract away from specific usages they observe in order to use their knowledge of verbs productively. Finally, children must accomplish all this in the face of a high degree of variability among verbs, along with much noise and uncertainty in the input data, and with no explicit teaching. Do children require innate knowledge of language to accomplish this, or are general cognitive learning mechanisms sufficient to the task? We have developed various computational models of verb learning using unsupervised clustering over simple statistical properties of verb usages. Our findings support the claim that general learning mechanisms are able to acquire abstract knowledge of verbs and to generalize that knowledge to novel verbs and situations.