Discrete structures such as dependency trees are often used to inject prior linguistic knowledge into statistical models. Many systems are built on top of a pipeline that starts with predicting a linguistic structure (e.g., syntactic or semantic representations) using a parser and then makes a task-specific prediction relying on this predicted structure (e.g., choose a polarity label in sentiment analysis). Unfortunately, most parsers rely on large amounts of manually-annotated data for training, which is available only for a small fraction of languages and domains. Therefore, it is appealing to rely on other forms of supervision to learn the parameters of the parser. On the one hand, raw text data is available in many languages. It can be used for semi-supervised learning to complement a small set of available annotated data. On the other hand, even when annotated data is not available, assuming a structured representations of sentences can be beneficial, as it provides inductive biases about the structure of the language. In this case, we want to induce task-specific structured representations of language in such a way as to benefit a given downstream task. In other words, an inductive bias is injected in the model, i.e. structures are good for natural languages, but no assumption is made about the appropriate content: the parser is trained end-to-end while optimizing performance on the downstream task. In practice, structures induced in this way tend not to resemble any accepted syntactic or semantic formalism as it lets the model induce the one which is better suited for the particular downstream task. In this talk, I will explain how both problems can be cast as learning the parameters of a statistical model with structured latent variables. During training, exact inference in these models requires marginalizing over latent variables which is intractable (e.g. summing over all dependency trees for a given sentence). Recently, differentiable Monte-Carlo estimation (i.e. the reparametrization trick) has been explored for training statistical models parametrized with neural networks. We follow this line of work and introduce a differentiable relaxation which we use to approximate samples and compute gradients with respect to the parser parameters. Our method (Differentiable Perturb-and-Parse) relies on differentiable dynamic programming over stochastically perturbed arc weights. We show the effectiveness of our approach on several tasks and datasets.
How language, music, and other complex sequences are represented and computed in the human brain is a fundamental area of brain research that continues to stimulate as much research as it does vigorous debate. Some classical questions (and persistent puzzles) - highlighting the tension between neuroscience and cognitive science research - concern the role of structure and abstraction. Recent findings from human neuroscience, across various techniques (e.g. fMRI, MEG, ECoG), suggest that the brain supports hierarchically structured abstract representations.
New data on the role of brain rhythms show that such neural activity appears to underpin the tracking of structure-building operations. If the new approaches are on the right track, they invite closer relations between fields and better linking hypotheses between the foundational questions that animate both the neurosciences and the cognitive sciences.
Expressions like "most" or "big" are known to be vague, that is, their interpretation can be borderline and not generally-agreed. Moreover, their use is context-dependent, in a way that an entity can be "big" in one context, but not in another. Interestingly, the meaning of these expressions is shown to be mostly quantitative when they are used to refer to entities (or sets of entities) in real-world contexts; for example, "few" is used by speakers only to refer to a given range of (low) proportions. By exploiting state-of-the-art, cognitively-inspired computational techniques, I tackle the issue of modelling the meaning of vague expressions from their use in grounded contexts, specifically Vision. In the first, longer part of the talk, I will provide an overview of my recent investigations on vague quantifiers ("few", "many", "all", etc.), both at the behavioural and computational level. In the second part, shorter, I will present ongoing research on gradable adjectives ("big", "small", etc.). Any feedback and comment is more than welcome!
When processing a text, both humans and machines must cope with ambiguity and non-compositionality. These phenomena represent a considerable challenge for NLP systems, while at the same time there is limited evidence from online measures on how humans solve them during natural reading. We approach these two problems as one and hypothesize that obtaining information on how humans process ambiguous and non-compositional phrases can improve the computational treatment of such instances. I will present experiments on using eye-tracking data to improve NLP models for two tasks: classifying the different roles of the pronoun It (nominal anaphoric, clause anaphoric and non-referential), as well as the identification of multi-word expressions. The experiments test whether gaze-based features improve the performance of state-of-the-art NLP models and the extent to which gaze features can be used to partially or entirely substitute the crafting of linguistic ones. The best-performing models are then analysed to better understand the cognitive processing of these linguistic phenomena and findings are discussed with respect to the E-Z model of reading and the processing stages during which disambiguation occurs.
Humans learn to understand speech from weak and noisy supervision: they manage to extract structure and meaning from speech by simply being exposed to utterances situated and grounded in their daily sensory experience. Emulating this remarkable skill has been the goal of numerous studies; however researchers have often used severely simplified settings where either the language input or the extralinguistic sensory input, or both, are small-scale and symbolically represented. I present a series of studies on modelling visually grounded language understanding. Using variations of recurrent neural networks to model the temporal nature of spoken language, we examine how form and meaning-based linguistic knowledge emerges from the input signal.
Distributional models and other supervised models of language focus on the structure of language and are an excellent way to learn general statistical associations between sequences of symbols. However, they do not capture the functional aspects of communication, i.e., that humans have intentions and use words to coordinate with others and make things happen in the real world. In this talk, I will present two studies on multi-agent emergent communication, where agents exist in some grounded environment and have to communicate about objects and their properties. This process requires the negotiation of linguistic meaning in this pragmatic context of achieving their goal. In the first study, I will present experiments in which agents learn to form a common ground that allow them to communicate about disentangled (i.e., feature norm) and entangled (i.e., raw pixels) input. In the second study, I will talk about properties of linguistic communication as arising in the context of self-interested agents.
Psychological stress is a crucial underlying reason for several physical and mental illnesses. The plethora of social media content provides an effective source to monitor stress, both long-term and short-term in nature. Depending on the context, analysis of stress in social media content could help assess customer feedback in businesses, bottlenecks in transportation systems or psychological state of target populations. Situational stress in daily life scenarios such as traffic deserves research attention because : 1) it could potentially be an indicator of a persistent issue in the scenario, requiring corrective measures 2) such short term stress can add up in the long term negatively impacting individual well-being. This talk focuses on stress expressions in Tweets belonging to two domains: Airlines and London traffic. Using topic modeling and word vector representations, I will present an analysis of reasons for stress in these two domains. I will also discuss the features of the language used in high-stress travel Tweets, examining the presence of offensive words, sarcasm and negative emotions in detail, comparing and contrasting the findings with the features of Tweets belonging to other domains.
Neural language models can be evaluated by comparing their performance on task-based evaluations. We discuss several methods for analyzing the cognitive plausibility of computational language representations by comparing them to human brain data. We examined the performance of several evaluation metrics across four fMRI datasets. In this talk, I will present the results of this experiment and compare the performance to a random model. In addition, we discuss the effect of selecting voxels (i.e. relevant regions of the brain to examine) in a model-driven way.
In this talk I will give an overview of my research in machine learning for natural language processing. I will begin by introducing my work on imitation learning, a machine learning paradigm I have used to develop novel algorithms for structure prediction that have been applied successfully to a number of tasks such as semantic parsing, natural language generation and information extraction. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. Following this, I will discuss my work on zero-shot learning using neural networks, which enabled us to learn models that can predict labels for which no data was observed during training. I will conclude with my work on automated fact-checking, a challenge we proposed in order to stimulate progress in machine learning, natural language processing and, more broadly, artificial intelligence.
Structured representations are a powerful tool in machine learning, and in particular in natural language processing: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. At the same time, deep, hierarchical neural networks with latent representations are increasingly widely and successfully applied to language tasks. Deep networks conventionally perform smooth, soft computations resulting in dense hidden representations.
We study deep models with structured and sparse latent representations, without sacrificing differentiability, and thus enabling end-to-end gradient-based training. We demonstrate sparse and structured attention mechanisms, as well as latent computation graph structure learning, with successful empirical results on large scale problems including sentiment analysis, natural language inference, and neural machine translation.
Joint work with Claire Cardie, Mathieu Blondel, and André Martins.
Vlad Niculae, André F. T. Martins, Mathieu Blondel, Claire Cardie. SparseMAP: Differentiable sparse structured inference. In: Proc. of ICML 2018.
Vlad Niculae, André F. T. Martins, Claire Cardie. Towards dynamic computation graphs via sparse latent structure. In: Proc. of EMNLP 2018.
Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural attention. In: Proc. of NIPS 2017.
Vector space semantics uses contexts of words to reason about their meanings; it is motivated by ideas of Firth and Harris, uses co-occurrence matrices to assign vectors to words, and has applications in diverse NLP tasks, from named entity recognition, to parsing, to disambiguation. Distributional semantics has been extended from words to sentences, using different grammatical formalisms, such as Lambek’s pregroups, the Lambek Calculus, and Combinatorial Categorial Grammar. It has, however, not been considered for incremental and dialogue pheonema. These phenomena cover individual language processing, where hearers incrementally disambiguate word senses before sentences are even complete, and dialogue utterances, where more than one agent contribute to the unfolding of a sequence. In recent joint work with Purver, Hough, and Kempson (SemDial 2018), we defined an incremental vector space semantic model using the formalism of Dynamic Syntax and showed how it can incrementally assign a semantic plausibility measure as it performs word-by-word parses of utterances.
Word embeddings implicitly encode a rich amount of semantic knowledge. The extent to which they can capture relational information, however, is inherently limited. To address this limitation, we propose to learn relation vectors, describing how two words are related based on the distribution of words in sentences where these two words co-occur. In this way, we can capture aspects of word meaning that are complementary to what is captured by word embeddings. For example, by examining clusters of relation vectors, we observe that relational similarities can be identified at a more abstract level than with traditional word vector differences. These relation vectors can be used, among others, to enrich the input to neural text classification models. From a network of relation vectors, we can also learn relational word vectors. These are vector representations of word meaning which, unlike standard word vectors, capture relational properties rather than similarity. On a range of different tasks, we find that combining these relational word vectors with standard word vectors leads to improved results.
The requirement for neural machine translation (NMT) models to use fixed-size input and output vocabularies plays an important role in their accuracy and generalization capability. The conventional approach to cope with this limitation is performing translation based on a vocabulary of sub-word units that are predicted using statistical word segmentation methods. However, these methods have recently shown to be prone to morphological errors, which lead to inaccurate translations. In this paper, we extend the source-language embedding layer of the NMT model with a bi-directional recurrent neural network that generates compositional representations of the source words from embeddings of character n-grams. Our model consistently outperforms conventional NMT with sub-word units on four translation directions with varying degrees of morphological complexity and data sparseness on the source side.
Besides making our thoughts more vivid and filling our communication with richer imagery, metaphor plays a fundamental structural role in our cognition, helping us to organise and project knowledge. For example, when we say “a/well-oiled/political/machine/”, we view the concept of/political system/in terms of a/mechanism/and transfer inferences from the domain of mechanisms onto our reasoning about political processes. Highly frequent in text, metaphorical language represents a significant challenge for natural language processing (NLP) systems. In this talk, I will first present a neural network architecture designed to capture the patterns of metaphorical use and its application to metaphor identification in text. I will then discuss how general-purpose lexical and compositional semantic models can be used to better understand metaphor processing in the human brain.
The task of learning language in a multisensory setting, with weak and noisy supervision, is of interest to scientists trying to understand the human mind as well as to engineers trying to build smart conversational agents or robots. In this talk I will present work on learning language from visually grounded speech using deep recurrent neural networks, and show that these models are able to extract linguistic knowledge at different levels of abstraction from the input signal. I then describe analytical methods which allow us to better understand the nature and localization or representations emerging in such recurrent neural networks. I will also discuss the challenges inherent in fully unsupervised modeling of spoken language and present recent results on this problem.
Massive digital datasets, such as social media data, are a promising source to study social and cultural phenomena. They provide the opportunity to study language use and behaviour in a variety of social situations on a large scale. However, to fully leverage their potential for research in the social sciences, new computational approaches are needed. In this talk I will start with a general introduction to this research area. I will then focus on two case studies. First, I discuss how natural language processing can help to scale-up social science research by applying social science theories on large scale naturalistic text data. I discuss how we investigated the impact of participants’ motivations in the public health campaign Movember on the amount of campaign donations raised based on the Social Identity Model of Collective Action. Second, I will discuss how advances in machine learning can be used to develop better tools for sociolinguists. Existing approaches to identify variables that exhibit geographical variation (e.g., pop vs. soda vs. coke in the US) have several important drawbacks. I discuss a method to measure geographical language variation based on Reproducing Kernel Hilbert space (RKHS) representations. I then conclude with discussing my perspective on a few big challenges in this area.
The advent of efficiently trainable neural networks has led to striking improvements in the accuracy of next word prediction, machine translation and many other NLP tasks. It has also produced models that are much less interpretable. In particular, the role played by linguistic structure in sequence prediction and sequence-to-sequence models remains hard to gauge. What makes recurrent neural networks work so well for next word prediction? Do neural translation models learn to extract linguistic features from raw data and exploit them in any explicable way? In this talk I will give an overview of recent work, including my own, that aims at answering these questions. I will also present recent experiments on the importance of recurrency for capturing hierarchical structure with sequential models. Answering these questions is important to establish whether injecting linguistic knowledge into neural models is a promising research direction, and to understand how close we are to building intelligent systems that can truly understand and process human language.
Graph Convolutional Networks (GCNs) is an effective tool for modeling graph structured data. We investigate their applicability in the context of natural language processing (machine translation and semantic role labelling) and modeling relational data (link prediction). For natural language processing, we introduce a version of GCNs suited to modeling syntactic and/or semantic dependency graphs and use them to construct linguistically-informed sentence encoders. We demonstrate that using them results in a substantial boost in machine translation performance and state-of-the-art results on semantic role labeling of English and Chinese. For link prediction, we propose Relational GCNs (RGCNs), GCNs developed specifically to deal with highly multi-relational data, characteristic of realistic knowledge bases. By explicitly modeling neighbourhoods of entities, RGCNs accumulate evidence over multiple inference steps in relational graphs and yield competitive results on standard link prediction benchmarks.
Joint work with Diego Marcheggiani, Michael Schlichtkrull, Joost Bastings, Thomas Kipf, Khalil Sima’an, Max Welling, Rianna van den Berg and Peter Bloem.
One of the major early achievements of linguistic typology was Greenberg’s (1963) discovery of implicational word order universals. While his work was based on a comparatively small sample of languages, later work, such as (Hawkins, 1983; Dryer, 1992), confirmed the existence of implicational word order universals on the basis of broader data collections. In a landmark study using modern quantitative, Bayesian comparative methods and data from four language families (Austronesian, Bantu, Indo-European and Uto-Aztecan), Dunn, Greenhill, Levinson, and Gray (2011) established results being in stark contrast to the established view. While the authors did find evidence for word order correlations in many cases, the emerging pictures differed fundamentally between the four families. From this they concluded that word order tendencies are lineage specific rather than universal. The authors did not explicitly compare their lineage-specific model with a universal model though; they only qualitatively assessed the assumption of universal word-order correlations as not plausible given their findings. In the talk I will present a study addressing this issue via performing a Bayesian model comparison between a universal and a lineage-specific model. It turns out that there is solid support for universal word-order correlations between features that Dryer (1992) classified as "verb patterners", while other correlations are clearly lineage specific. The broader methodological point to be made is that linguistic typology can immensely benefit from the tools of modern Bayesian statistics and the phylogenetic comparative method.
Recurrent neural networks (RNNs) are remarkably general learning systems that, given appropriate training examples, can handle complex sequential processing tasks, such as those frequently encountered in language and reasoning. However, RNNs are remarkably sample-heavy, typically requiring hundreds of thousands of examples to master tasks that humans can solve after seeing just a few exposures. The first set of experiments I will present shows that modern RNNs, just like their ancestors from the nineties, have problems with systematic compositionality, that is, the ability to extract general rules from the training data, and apply them to new examples. As systematic compositionality allows very fast generalization to unseen cases, lack of compositional learning might be one of the roots of RNN's training data thirst. I will next present an ongoing study where RNNs must solve an apparently simple task where correct generalization relies on function composition. Current results suggest that a large random search in RNN space finds a small portion of models that converged on a (limited) compositional solution. However, it's not clear, for the time being, what is special about such models. The quest for compositional RNNs is still on.
Joint work with: Brenden Lake, Adam Liska, Germán Kruszewski
There are multiple contributors to language change that are external to the speaker, such as social or economic drivers, or even accidents of linguistic contact. However, there are also internal constraints that are key to shaping language evolution. In particular, psycholinguistic properties of language can predict which representations are acquired and stored with greatest fidelity by the speaker. For instance, we know that frequency, length, and the age at which a language structure is acquired all contribute to more stable storage and accurate reproduction of that structure. In this talk, I present a series of studies of the English vocabulary to demonstrate how internal cognitive processing has shaped the language, with analyses from corpora of diachronic vocabulary change and morphological change of the past tense forms of verbs, accompanied by laboratory studies of artificial language learning and change that show similar patterns to the diachronic data. These studies provide suggestions for how psycholinguistic properties of the language affect learning and cultural transmission across generations of speakers.
Logic Tensor Networks (LTN) is a theoretical framework and an experimental platform that integrates learning based on tensor neural networks with reasoning using first-order many-valued/fuzzy logic. LTN supports a wide range of reasoning and learning tasks with logical knowledge and data using rich symbolic knowledge representation in first-order logic (FOL) to be combined with efficient data-driven machine learning based on the manipulation of real-valued vectors. In practice, FOL reasoning including function symbols is approximated through the usual iterative deepening of clause depth. Given data available in the form of real-valued vectors, logical soft and hard constraints and relations which apply to certain subsets of the vectors can be specified compactly in FOL. All the different tasks can be represented in LTN as a form of approximated satisfiability, reasoning can help improve learning, and learning from new data may revise the constraints thus modifying reasoning. We apply LTNs to Semantic Image Interpretation (SII) in order to solve the following tasks: (i) the classification of an image's bounding boxes and (ii) the detection of the relevant part-of relations between objects. The results shows that the usage of background knowledge improves the performance of pure machine learning data driven methods.
If, when asked to "point at the mug", a physically unimpaired person seems unable to identify a potential referent that is standing in front of them, we might hesitate to ascribe knowledge of the meaning of the word "mug" to them, whatever else they may be able to tell us about mugs (e.g., "wooden mugs were produced probably from the oldest time, but most of them have not survived intact.", or "mugs are similar to cups"). And yet computational models of word meaning are good at the latter (e.g., by simply linking to knowledge repositories like wikipedia, where the previous sentence about wooden mugs was taken from), and fail at the former. In this talk, I will present our recent work at learning a lexicon for referential interaction, where the referential aspects of word meaning are modelled through perceptual classifiers taking real images as input. I show that this representation complements other computational meaning representations such as those derived from distributional patterns, as well as decompositional or attribute-based representations. The lexicon is learned through (observation of) interaction, and is maintained and defended in interaction.
Shared tasks and (shared) corpora have proven themselves highly valuable for NLP. They have allowed us to evaluate our methods and compare them to others helping us, our readers and reviewers to assess the quality of our methods. A downside of the wide-spread approach of comparing results on a gold dataset is that it is relative common practice to draw conclusions based on the highest numbers without looking into what is behind this. However, what goes wrong and why can be highly relevant for end-applications and, specially given the well-known difficulties with reproducing results, looking into the details of how and why results improve (or not) is highly relevant. In this talk, I will present two studies taking intrinsic evaluation one step further 1) investigating error propagation in parsing and 2) diving in the evaluation of distributional semantic methods. Finally, I will outline the importance of deeper evaluation when NLP is used within digital humanities and digital social science.
Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback in form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. This framework has successfully been applied to various structured prediction tasks in NLP. In this talk I will focus on the application of bandit structured prediction to linear and non-linear machine translation models where models are adapted to a new domain without seeing reference translations of the new domain. In simulation experiments we showed that partial information in form of translation quality judgements on predicted translations is sufficient for model adaptation, even for feedback as weak as pairwise preference judgments.
Linguistics quantifiers have been the realm of Formal Semantics. A lot is known about their formal properties and how those properties affect logical entailment, the licensing of polarity item, or scope ambiguities. Less is known about how quantifiers are acquired by children and even less about how computational models can learn to quantify objects in images. In this talk, we will report on our findings in this direction. First of all, we will explain why the task is interesting and challenging for a Language and Vision model. Secondly, we will report our evaluation of state-of-the-art neural network models against this task. Thirdly, we will compare the acquisition of quantifiers with the acquisition of cardinals. We will show that a model capitalizing on a `fuzzy' measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.
While our ultimate aim in language processing might be making fully unsupervised models that optimally resemble the human way of learning, in many areas of NLP we are still heavily working with high degrees of supervision. Aiming at sparing annotation effort, distant supervision has been explored in the past 10 years as an alternative way to obtain (noisy) training data. This obviously doesn't take us directly to unsupervised models, but in addition to being a cheaper method to labelling instances, it also keeps us closer to the original data and it might give us an indication into the extent to which we can make do with rather spontaneous signals in the data. In the talk, I will present two experiments in the area of affective computing exploiting distant supervision: one on emotion detection, and one on stance detection. In both cases, we acquire silver labels for training leveraging user generated social media data, and play with different degrees of supervision in building our models. These are eventually tested on standard benchmarks and compared to state-of-the-art approaches. Our (mixed) results are discussed also in the light of whether supervision is truly necessary or not, and the value of silver versus gold data.
There are a number of interesting challenges in translation to morphologically rich languages (such as German or Czech) from a language like English. I will first present a linguistically rich English to German translation system generalizing over compounds, phenomena of inflectional morphology and syntactic issues, relying on preprocessing and postprocessing techniques. Following this, I'll present approaches addressing similar issues which have been tightly integrated into the Moses SMT decoder, and work well for multiple language pairs. Finally, time allowing, I'll present some thoughts on addressing these and further challenges within the framework of neural machine translation.
Over a century ago, Frege famously introduced the distinction between sense and reference that is one of the theoretical foundations of formal semantics. However, in practice formal semanticists took reference and ran away with it, either eschewing sense-related issues altogether or giving a referential treatment to them (with notable exceptions). In this talk, I argue that we need to go back to Fregean sense, and propose that data-induced, continuous representations provided by distributional semantics and deep learning methods provide a good methodological handle for sense-related aspects of meaning. I support these claims with results from both computational modeling and theoretical studies. I then revisit reference and present ongoing work on the challenging enterprise of tackling it with continuous methods, too.
In this talk I will describe the creation of RELPRON, a dataset of subject and object relative clauses for the evaluation of compositional distributional semantic models. The RELPRON task involves matching terms, such as 'wisdom', with representative properties in relative clause form, such as 'quality that experience teaches'. Relative clauses are an interesting test case for compositional distributional semantic models because they contain a closed class function word and a long-distance dependency. I will present results on RELPRON obtained within a type-based composition framework, using a variety of approaches to simplify the learning of higher-order tensors, as well as results obtained using neural networks for composition. In line with many existing datasets, vector addition provides a challenging baseline for RELPRON, but it is possible to match or improve on the baseline by finding appropriate training data and models for the semantics of the relative pronoun.
I will present three semantic parsing approaches for querying Freebase in natural language 1) training only on raw web corpus, 2) training on question-answer (QA) pairs and 3) training on both QA pairs and web corpus. For 1 and 2, we conceptualise semantic parsing as a graph matching problem, where natural language graphs built using CCG/dependency logical forms are transduced to Freebase graphs. For 3, I will present a natural-logic combined with Convolutional Neural-Network based relation extraction. Our methods achieve state-of-the-art on WebQuestions and Free917 QA datasets.
In this talk, I will introduce a distributional model for computing the complexity of semantic composition, inspired by recent psycholinguistic research on sentence comprehension. I argue that the comprehension of a sentence is an incremental process driven by the goal of constructing a coherent representation of the event the speaker intends to communicate with the sentence. Semantic complexity is determined by a compositon cost depending on the internal coherence of the event model being constructed and on the activation degree of such event by linguistic constructions. The model is tested on some psycholinguistic datasets for the study of sentence comprehension.
Figurative expressions such as ''break the ice'' occur frequently in natural language, even in apparently matter-of-fact texts such as news wire. Many of these expressions are also ambiguous between a figurative and a literal interpretation when taken out of context, e.g. ''break the ice (...on the duck pond)'' vs. ''break the ice (...with wary adolescents)''. Being able to automatically detect figurative usages in a given context is potentially useful for a number of tasks, ranging from corpus-based studies of phraseology to applications in automatic natural language processing. In this talk, I will present a method for automatically distinguishing figurative and literal usages of a target expression in a given context. The method exploits the fact that well-formed texts exhibit lexical cohesion, i.e. words are semantically related to other words in the vicinity.
Languages vary in the ways they carve up the world: where English uses the preposition on to describe support relations between object, Dutch employs two prepositions, op and aan. Underlying such crosslinguistic variation, we also find tendencies in the way unique situations (objects, events) are grouped into linguistic semantic categories. For studying the variation and biases in the word meaning inventories of the world's languages, semantic typology has typically taken recourse to in-person elicitation. This process, however, is tedious, hard to apply for more abstract domains (the meaning of connectives, abstract verbs like think), and displays a researcher-bias in the selection of stimuli. Instead, we propose to use parallel corpora to obtain judgments similar to in-person elicitations, but avoiding these pitfalls. In my talk, I will describe our pipeline for approaching this issue, discuss the properties of the representational space it yields, and present preliminary results on a typologically diverse corpus of translated subtitles. (joint work with Suzanne Stevenson)
Although distributed semantics has been very successful in various NLP tasks in recent years, the fact that word meanings are represented as a distribution over other words exposes them to the so-called grounding problem. Multi-modal semantics attempts to address this by enhancing textual representations with extra-linguistic perceptual input. Such multi-modal models outperform language-only models on a range of tasks. In this talk I will discuss my PhD work, which has been concerned with advancing this idea, by (1) improving how we mix information through multi-modal fusion, (2) finding better ways to obtain perceptual information through deep learning and (3) obtaining representations for previously untried modalities such as auditory and even olfactory perception. I'll also briefly talk about a new multi-modal features toolkit that NLP researchers can use to experiment with visual and auditory representations.
Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.
In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.
Approaches to the computational analysis of discourse are sensitive to different aspects of textual structure. Some consider topical structure, others focus on rhetorical relations, and still others concern themselves with the functional structure of texts. In this talk I present a new way of approaching the task, following Smith's (2003) work on Discourse Modes. The central idea is that texts are made up of passages - usually several sentences or more - with different modes: Smith's typology includes Narrative, Description, Report, Information, and Argument/Commentary. Smith further identifies specific linguistic correlates of these modes, one of which pertains to the contributions made to the discourse by individual clauses of text. As a first step toward automatic Discourse Mode classification, we address the problem of classifying clauses of written English text according to the type of situation expressed by the clause. The situation entity (SE) classification task as construed here uses a scheme that includes, among others, events, states, abstract entities, and generic sentences. We find that a feature-driven approach to annotating SEs both improves annotation consistency and enriches the annotated data with useful semantic information, such as lexical aspect of verbs, genericity of main referents, and habituality of clauses. This data has been used to develop automatic classifiers for SE types as well as for other semantic phenomena. $Abstract
Probabilistic grammars are an important model family in natural language processing. They are used in the modeling of many problems, mostly prominently in syntax and semantics. Latent-variable grammars are an extension of vanilla probabilistic grammars, introducing latent variables that inject additional information into the grammar by using learning algorithms in the incomplete data setting. In this talk, I will discuss work aimed at the development of (four) theoretically-motivated algorithms for the estimation of latent-variable grammars. I will discuss how we applied them to syntactic parsing, and more semantically-oriented problems such as machine translation, conversation modeling in online forums and question answering.
We introduce models for training embeddings that effectively integrate computer vision and natural language processing. The main novelty in our proposal is the utilisation of data that is not only multimodal, but both multimodal and multilingual. The intuition behind our models is that multiple sources of textual information might convey more "facts" about an image than a textual description in only one language. We discuss how incorporating translational evidence might be used in improving the quality of trained embeddings. We use the recently released multimodal Flickr30k dataset and evaluate our models on the tasks of sentence-to-image and image-to-sentence ranking. Our results demonstrate that including multilingual data leads to substantial improvement over the (monolingual) state-of-the-art.
Earlier work on an entirely diagrammatic formulation of quantum theory, which is soon to appear in the form of a textbook, has somewhat surprisingly guided us towards an answer for the following question: how do we produce the meaning of a sentence given that we understand the meaning of its words? This work has practical applications in the area of natural language processing, and the resulting tools have meanwhile outperformed existing methods.
Most research into learning linguistic representations focuses on the distributional hypothesis and exploits linguistic context to embed words in a semantic vector space. In this talk I address two important but often neglected aspects of language learning: compositionality and grounding. Words are important building blocks of language, but what makes language unique is putting them together: how can we build meaning representations of phrases and whole sentences out of representations of words? And how can we make sure that these representations connect to the extralinguistic world that we perceive and interact with? I will present a multi-task gated recurrent neural network model which sequentially processes words in a sentence and builds a representation of its meaning while making concurrent predictions about (a) which words are to follow and (b) what are the features of the corresponding visual scene. Learning is driven by feedback on this multi-task objective. I evaluate the induced representations on tasks such as image search, paraphrasing and textual inference, and present quantitative and qualitative analyses of how they encode certain aspects of language structure.
Incremental shift-reduce parsing with structured perceptron training is an established technique for continuous constituency parsing. The corresponding parsers are very fast and yield results that are close to the state of the art. In this talk, I present a shift-reduce parser which can produce discontinuous constituents by processing the input words out-of-order, a strategy known from dependency parsing. The system yields accurate results. Unlike previous grammar-based parsers for discontinuous constituents, it also achieves very high parsing speeds.
The CLS is happy to announce three talks about Statistical Models of Grammaticality studied within the SMOG project at King’s College London:
I will present two ideas aiming towards 'parser-generalization', the problem of enhancing a supervised grammar and parsing model to accurately cover a wider variety of linguistic data than has been seen in the labeled data, using additional unlabeled data. The first idea concerns the use of the Expectation Maximisation (EM) algorithm for semi-supervised learning of parsing models. While it has long been thought that EM is unsuitable for semi-supervised learning of structured models such as part-of-speech taggers and parsing models (Merialdo 1994, Elworthy 1994), I will present experiments under two grammar formalisms (PCFG and CCG) where we have successfully used EM for semi-supervised learning of generative parsers. These two grammars share the property of being 'strongly lexicalised', in that they have complex lexical categories, and a few simple grammar rules that combine them. This strong lexicalisation makes these grammars more suitable for learning from unlabeled data than grammars which are not lexicalised in this way. In this work, I make the assumption that all lexical category types in the language are *known* from the supervised part of the data, a reasonable assumption to make if the supervised data is large enough. In the second part of the talk, I will discuss ongoing work where we generate *new* category types, based on those types seen in the labeled data. We use a latent-variable PCFG model for generating new CCG types, under the assumption that there is a hidden structure in CCG lexical categories which can be uncovered using such a model.
One approach to representing images is as a bag-of-regions vector, but this representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of the research is that capturing and encoding the relationships between parts of an image will improve the performance of downstream tasks. A simplifying assumption throughout the talk is that we have access to gold-standard object annotations. The first part of this talk will focus on the Visual Dependency Representation: a novel structured representation that captures region-region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. We explain how to automatically predict Visual Dependency Representations using a modified graph-based statistical dependency parser. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. The second part of the talk will show that adopting Visual Dependency Representations of images leads to significant improvements on two downstream tasks. In an image description task, we find improvements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. Finally, in an query-by-example image retrieval task, we show improvements in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach.
It is proofed that syntax-based statistical machine translation can produce better translation than phrase-based translation does, especially for those language pairs with big structural difference. However, constituent-based models are complex and not efficient in implementation. Dependency is regarded as a more compact and efficient formalism of syntax and a nature bridge from syntax to semantics, but early dependency-based SMT has lower performance compared with the mainstream approaches. We proposed the first dependency-based SMT model whose performance is comparable with the state-of-the-art models in 2011, and then we developed several improvements based on this model. Recently we tried a new dependency-based transfer-and-generation approach which we think is promising and got positive results at this preliminary stage.
In Statistical Machine Translation (SMT), inference is performed over a high-complexity discrete distribution defined by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation - the task of searching for the optimum translation - or sampling - the task of finding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. In this presentation, I will talk about exact optimisation (decoding) and sampling for SMT based on a form of rejection sampling. In this view, the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally refined to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level.
Establishing reference to objects in a shared environment is pivotal to successful communication. By using artificial scenarios where subjects need to choose referential expressions or guess the speaker's intended referent we can study the extent to which speakers and listeners reason pragmatically about each other's perspective. I will present a number of related empirical studies in this paradigm and discuss how different flavors of Bayesian cognitive modeling can be used to analyze the data.
Human beings are excellent at making sense of, and producing, structured sensory input. In particular, cognitive abilities for patterning seem crucial in allowing humans to perceive and produce language and music. The comparative approach, testing a range of animal species, can help unveil the evolutionary history of such patterning abilities. Here, I present experimental data and ongoing work in humans, chimpanzees, squirrel monkeys, pigeons and kea. I compare monkeys' and humans' skills in processing sensory dependencies in auditory stimuli, a crucial feature of human cognition. In order to infer individual and species-specific learning strategies and behavioral heuristics, I analyze data from visual touch-screen experiments in birds. Finally, as pattern production and perception abilities have been shown to differ in humans, the same divide could exist in other species. I present ongoing work using "electronic drums" I developed specifically for apes, which will allow chimpanzees to spontaneously produce non-vocal acoustic patterns.
Probabilistic and stochastic methods have been fruitfully applied to a wide variety of problems in grammar induction, natural language processing, and cognitive modeling. In this talk I will explore the possibility of developing a class of combinatorial semantic representations for natural languages that compute the semantic value of a (declarative) sentence as a probability value which expresses the likelihood of speakers of the language accepting the sentence as true in a given model. Such an approach to semantic representation treats the pervasive gradience of semantic properties as intrinsic to speakers' linguistic knowledge, rather the result of the interference of performance factors in processing and interpretation. In order for this research program to succeed, it must solve three central problems. First, it needs to formulate a type system that computes the probability value of a sentence from the semantic values of its syntactic constituents. Second, it must incorporate a viable probabilitic logic into the representation of semantic knowledge in order to model meaning entailment. Finally, it must show how the specified class of semantic representations can be efficiently learned from the primary linguistic data available for language acquisition. This research has developed out of recent work with Alex Clark (Royal Holloway, London) on the application of computational learning theory to grammar induction.
Early verb learning in children seems an almost miraculous feat. In learning a verb, children must learn both the basic meaning of the event ("falling" or "eating"), as well as the allowable structures in their language for correctly communicating the participants in that event ("The glass fell", but not "She fell the glass"). Moreover, given the sparsity of evidence, children must be able to abstract away from specific usages they observe in order to use their knowledge of verbs productively. Finally, children must accomplish all this in the face of a high degree of variability among verbs, along with much noise and uncertainty in the input data, and with no explicit teaching. Do children require innate knowledge of language to accomplish this, or are general cognitive learning mechanisms sufficient to the task? We have developed various computational models of verb learning using unsupervised clustering over simple statistical properties of verb usages. Our findings support the claim that general learning mechanisms are able to acquire abstract knowledge of verbs and to generalize that knowledge to novel verbs and situations.