The seminar is organised by:
To receive notifications about upcoming seminars, please join the CLS mailing list.
Graph Convolutional Networks (GCNs) is an effective tool for modeling graph structured data. We investigate their applicability in the context of natural language processing (machine translation and semantic role labelling) and modeling relational data (link prediction). For natural language processing, we introduce a version of GCNs suited to modeling syntactic and/or semantic dependency graphs and use them to construct linguistically-informed sentence encoders. We demonstrate that using them results in a substantial boost in machine translation performance and state-of-the-art results on semantic role labeling of English and Chinese. For link prediction, we propose Relational GCNs (RGCNs), GCNs developed specifically to deal with highly multi-relational data, characteristic of realistic knowledge bases. By explicitly modeling neighbourhoods of entities, RGCNs accumulate evidence over multiple inference steps in relational graphs and yield competitive results on standard link prediction benchmarks.
Joint work with Diego Marcheggiani, Michael Schlichtkrull, Joost Bastings, Thomas Kipf, Khalil Sima’an, Max Welling, Rianna van den Berg and Peter Bloem.
One of the major early achievements of linguistic typology was Greenberg’s (1963) discovery of implicational word order universals. While his work was based on a comparatively small sample of languages, later work, such as (Hawkins, 1983; Dryer, 1992), confirmed the existence of implicational word order universals on the basis of broader data collections. In a landmark study using modern quantitative, Bayesian comparative methods and data from four language families (Austronesian, Bantu, Indo-European and Uto-Aztecan), Dunn, Greenhill, Levinson, and Gray (2011) established results being in stark contrast to the established view. While the authors did find evidence for word order correlations in many cases, the emerging pictures differed fundamentally between the four families. From this they concluded that word order tendencies are lineage specific rather than universal. The authors did not explicitly compare their lineage-specific model with a universal model though; they only qualitatively assessed the assumption of universal word-order correlations as not plausible given their findings. In the talk I will present a study addressing this issue via performing a Bayesian model comparison between a universal and a lineage-specific model. It turns out that there is solid support for universal word-order correlations between features that Dryer (1992) classified as "verb patterners", while other correlations are clearly lineage specific. The broader methodological point to be made is that linguistic typology can immensely benefit from the tools of modern Bayesian statistics and the phylogenetic comparative method.
Recurrent neural networks (RNNs) are remarkably general learning systems that, given appropriate training examples, can handle complex sequential processing tasks, such as those frequently encountered in language and reasoning. However, RNNs are remarkably sample-heavy, typically requiring hundreds of thousands of examples to master tasks that humans can solve after seeing just a few exposures. The first set of experiments I will present shows that modern RNNs, just like their ancestors from the nineties, have problems with systematic compositionality, that is, the ability to extract general rules from the training data, and apply them to new examples. As systematic compositionality allows very fast generalization to unseen cases, lack of compositional learning might be one of the roots of RNN's training data thirst. I will next present an ongoing study where RNNs must solve an apparently simple task where correct generalization relies on function composition. Current results suggest that a large random search in RNN space finds a small portion of models that converged on a (limited) compositional solution. However, it's not clear, for the time being, what is special about such models. The quest for compositional RNNs is still on.
Joint work with: Brenden Lake, Adam Liska, Germán Kruszewski
There are multiple contributors to language change that are external to the speaker, such as social or economic drivers, or even accidents of linguistic contact. However, there are also internal constraints that are key to shaping language evolution. In particular, psycholinguistic properties of language can predict which representations are acquired and stored with greatest fidelity by the speaker. For instance, we know that frequency, length, and the age at which a language structure is acquired all contribute to more stable storage and accurate reproduction of that structure. In this talk, I present a series of studies of the English vocabulary to demonstrate how internal cognitive processing has shaped the language, with analyses from corpora of diachronic vocabulary change and morphological change of the past tense forms of verbs, accompanied by laboratory studies of artificial language learning and change that show similar patterns to the diachronic data. These studies provide suggestions for how psycholinguistic properties of the language affect learning and cultural transmission across generations of speakers.
Logic Tensor Networks (LTN) is a theoretical framework and an experimental platform that integrates learning based on tensor neural networks with reasoning using first-order many-valued/fuzzy logic. LTN supports a wide range of reasoning and learning tasks with logical knowledge and data using rich symbolic knowledge representation in first-order logic (FOL) to be combined with efficient data-driven machine learning based on the manipulation of real-valued vectors. In practice, FOL reasoning including function symbols is approximated through the usual iterative deepening of clause depth. Given data available in the form of real-valued vectors, logical soft and hard constraints and relations which apply to certain subsets of the vectors can be specified compactly in FOL. All the different tasks can be represented in LTN as a form of approximated satisfiability, reasoning can help improve learning, and learning from new data may revise the constraints thus modifying reasoning. We apply LTNs to Semantic Image Interpretation (SII) in order to solve the following tasks: (i) the classification of an image's bounding boxes and (ii) the detection of the relevant part-of relations between objects. The results shows that the usage of background knowledge improves the performance of pure machine learning data driven methods.
If, when asked to "point at the mug", a physically unimpaired person seems unable to identify a potential referent that is standing in front of them, we might hesitate to ascribe knowledge of the meaning of the word "mug" to them, whatever else they may be able to tell us about mugs (e.g., "wooden mugs were produced probably from the oldest time, but most of them have not survived intact.", or "mugs are similar to cups"). And yet computational models of word meaning are good at the latter (e.g., by simply linking to knowledge repositories like wikipedia, where the previous sentence about wooden mugs was taken from), and fail at the former. In this talk, I will present our recent work at learning a lexicon for referential interaction, where the referential aspects of word meaning are modelled through perceptual classifiers taking real images as input. I show that this representation complements other computational meaning representations such as those derived from distributional patterns, as well as decompositional or attribute-based representations. The lexicon is learned through (observation of) interaction, and is maintained and defended in interaction.
Shared tasks and (shared) corpora have proven themselves highly valuable for NLP. They have allowed us to evaluate our methods and compare them to others helping us, our readers and reviewers to assess the quality of our methods. A downside of the wide-spread approach of comparing results on a gold dataset is that it is relative common practice to draw conclusions based on the highest numbers without looking into what is behind this. However, what goes wrong and why can be highly relevant for end-applications and, specially given the well-known difficulties with reproducing results, looking into the details of how and why results improve (or not) is highly relevant. In this talk, I will present two studies taking intrinsic evaluation one step further 1) investigating error propagation in parsing and 2) diving in the evaluation of distributional semantic methods. Finally, I will outline the importance of deeper evaluation when NLP is used within digital humanities and digital social science.
Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback in form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. This framework has successfully been applied to various structured prediction tasks in NLP. In this talk I will focus on the application of bandit structured prediction to linear and non-linear machine translation models where models are adapted to a new domain without seeing reference translations of the new domain. In simulation experiments we showed that partial information in form of translation quality judgements on predicted translations is sufficient for model adaptation, even for feedback as weak as pairwise preference judgments.
Linguistics quantifiers have been the realm of Formal Semantics. A lot is known about their formal properties and how those properties affect logical entailment, the licensing of polarity item, or scope ambiguities. Less is known about how quantifiers are acquired by children and even less about how computational models can learn to quantify objects in images. In this talk, we will report on our findings in this direction. First of all, we will explain why the task is interesting and challenging for a Language and Vision model. Secondly, we will report our evaluation of state-of-the-art neural network models against this task. Thirdly, we will compare the acquisition of quantifiers with the acquisition of cardinals. We will show that a model capitalizing on a `fuzzy' measure of similarity is effective for learning quantifiers, whereas the learning of exact cardinals is better accomplished when information about number is provided.
While our ultimate aim in language processing might be making fully unsupervised models that optimally resemble the human way of learning, in many areas of NLP we are still heavily working with high degrees of supervision. Aiming at sparing annotation effort, distant supervision has been explored in the past 10 years as an alternative way to obtain (noisy) training data. This obviously doesn't take us directly to unsupervised models, but in addition to being a cheaper method to labelling instances, it also keeps us closer to the original data and it might give us an indication into the extent to which we can make do with rather spontaneous signals in the data. In the talk, I will present two experiments in the area of affective computing exploiting distant supervision: one on emotion detection, and one on stance detection. In both cases, we acquire silver labels for training leveraging user generated social media data, and play with different degrees of supervision in building our models. These are eventually tested on standard benchmarks and compared to state-of-the-art approaches. Our (mixed) results are discussed also in the light of whether supervision is truly necessary or not, and the value of silver versus gold data.
There are a number of interesting challenges in translation to morphologically rich languages (such as German or Czech) from a language like English. I will first present a linguistically rich English to German translation system generalizing over compounds, phenomena of inflectional morphology and syntactic issues, relying on preprocessing and postprocessing techniques. Following this, I'll present approaches addressing similar issues which have been tightly integrated into the Moses SMT decoder, and work well for multiple language pairs. Finally, time allowing, I'll present some thoughts on addressing these and further challenges within the framework of neural machine translation.
Over a century ago, Frege famously introduced the distinction between sense and reference that is one of the theoretical foundations of formal semantics. However, in practice formal semanticists took reference and ran away with it, either eschewing sense-related issues altogether or giving a referential treatment to them (with notable exceptions). In this talk, I argue that we need to go back to Fregean sense, and propose that data-induced, continuous representations provided by distributional semantics and deep learning methods provide a good methodological handle for sense-related aspects of meaning. I support these claims with results from both computational modeling and theoretical studies. I then revisit reference and present ongoing work on the challenging enterprise of tackling it with continuous methods, too.
In this talk I will describe the creation of RELPRON, a dataset of subject and object relative clauses for the evaluation of compositional distributional semantic models. The RELPRON task involves matching terms, such as 'wisdom', with representative properties in relative clause form, such as 'quality that experience teaches'. Relative clauses are an interesting test case for compositional distributional semantic models because they contain a closed class function word and a long-distance dependency. I will present results on RELPRON obtained within a type-based composition framework, using a variety of approaches to simplify the learning of higher-order tensors, as well as results obtained using neural networks for composition. In line with many existing datasets, vector addition provides a challenging baseline for RELPRON, but it is possible to match or improve on the baseline by finding appropriate training data and models for the semantics of the relative pronoun.
I will present three semantic parsing approaches for querying Freebase in natural language 1) training only on raw web corpus, 2) training on question-answer (QA) pairs and 3) training on both QA pairs and web corpus. For 1 and 2, we conceptualise semantic parsing as a graph matching problem, where natural language graphs built using CCG/dependency logical forms are transduced to Freebase graphs. For 3, I will present a natural-logic combined with Convolutional Neural-Network based relation extraction. Our methods achieve state-of-the-art on WebQuestions and Free917 QA datasets.
In this talk, I will introduce a distributional model for computing the complexity of semantic composition, inspired by recent psycholinguistic research on sentence comprehension. I argue that the comprehension of a sentence is an incremental process driven by the goal of constructing a coherent representation of the event the speaker intends to communicate with the sentence. Semantic complexity is determined by a compositon cost depending on the internal coherence of the event model being constructed and on the activation degree of such event by linguistic constructions. The model is tested on some psycholinguistic datasets for the study of sentence comprehension.
Figurative expressions such as ''break the ice'' occur frequently in natural language, even in apparently matter-of-fact texts such as news wire. Many of these expressions are also ambiguous between a figurative and a literal interpretation when taken out of context, e.g. ''break the ice (...on the duck pond)'' vs. ''break the ice (...with wary adolescents)''. Being able to automatically detect figurative usages in a given context is potentially useful for a number of tasks, ranging from corpus-based studies of phraseology to applications in automatic natural language processing. In this talk, I will present a method for automatically distinguishing figurative and literal usages of a target expression in a given context. The method exploits the fact that well-formed texts exhibit lexical cohesion, i.e. words are semantically related to other words in the vicinity.
Languages vary in the ways they carve up the world: where English uses the preposition on to describe support relations between object, Dutch employs two prepositions, op and aan. Underlying such crosslinguistic variation, we also find tendencies in the way unique situations (objects, events) are grouped into linguistic semantic categories. For studying the variation and biases in the word meaning inventories of the world's languages, semantic typology has typically taken recourse to in-person elicitation. This process, however, is tedious, hard to apply for more abstract domains (the meaning of connectives, abstract verbs like think), and displays a researcher-bias in the selection of stimuli. Instead, we propose to use parallel corpora to obtain judgments similar to in-person elicitations, but avoiding these pitfalls. In my talk, I will describe our pipeline for approaching this issue, discuss the properties of the representational space it yields, and present preliminary results on a typologically diverse corpus of translated subtitles. (joint work with Suzanne Stevenson)
Although distributed semantics has been very successful in various NLP tasks in recent years, the fact that word meanings are represented as a distribution over other words exposes them to the so-called grounding problem. Multi-modal semantics attempts to address this by enhancing textual representations with extra-linguistic perceptual input. Such multi-modal models outperform language-only models on a range of tasks. In this talk I will discuss my PhD work, which has been concerned with advancing this idea, by (1) improving how we mix information through multi-modal fusion, (2) finding better ways to obtain perceptual information through deep learning and (3) obtaining representations for previously untried modalities such as auditory and even olfactory perception. I'll also briefly talk about a new multi-modal features toolkit that NLP researchers can use to experiment with visual and auditory representations.
Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.
In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.
Approaches to the computational analysis of discourse are sensitive to different aspects of textual structure. Some consider topical structure, others focus on rhetorical relations, and still others concern themselves with the functional structure of texts. In this talk I present a new way of approaching the task, following Smith's (2003) work on Discourse Modes. The central idea is that texts are made up of passages - usually several sentences or more - with different modes: Smith's typology includes Narrative, Description, Report, Information, and Argument/Commentary. Smith further identifies specific linguistic correlates of these modes, one of which pertains to the contributions made to the discourse by individual clauses of text. As a first step toward automatic Discourse Mode classification, we address the problem of classifying clauses of written English text according to the type of situation expressed by the clause. The situation entity (SE) classification task as construed here uses a scheme that includes, among others, events, states, abstract entities, and generic sentences. We find that a feature-driven approach to annotating SEs both improves annotation consistency and enriches the annotated data with useful semantic information, such as lexical aspect of verbs, genericity of main referents, and habituality of clauses. This data has been used to develop automatic classifiers for SE types as well as for other semantic phenomena. $Abstract
Probabilistic grammars are an important model family in natural language processing. They are used in the modeling of many problems, mostly prominently in syntax and semantics. Latent-variable grammars are an extension of vanilla probabilistic grammars, introducing latent variables that inject additional information into the grammar by using learning algorithms in the incomplete data setting. In this talk, I will discuss work aimed at the development of (four) theoretically-motivated algorithms for the estimation of latent-variable grammars. I will discuss how we applied them to syntactic parsing, and more semantically-oriented problems such as machine translation, conversation modeling in online forums and question answering.
We introduce models for training embeddings that effectively integrate computer vision and natural language processing. The main novelty in our proposal is the utilisation of data that is not only multimodal, but both multimodal and multilingual. The intuition behind our models is that multiple sources of textual information might convey more "facts" about an image than a textual description in only one language. We discuss how incorporating translational evidence might be used in improving the quality of trained embeddings. We use the recently released multimodal Flickr30k dataset and evaluate our models on the tasks of sentence-to-image and image-to-sentence ranking. Our results demonstrate that including multilingual data leads to substantial improvement over the (monolingual) state-of-the-art.
Earlier work on an entirely diagrammatic formulation of quantum theory, which is soon to appear in the form of a textbook, has somewhat surprisingly guided us towards an answer for the following question: how do we produce the meaning of a sentence given that we understand the meaning of its words? This work has practical applications in the area of natural language processing, and the resulting tools have meanwhile outperformed existing methods.
Most research into learning linguistic representations focuses on the distributional hypothesis and exploits linguistic context to embed words in a semantic vector space. In this talk I address two important but often neglected aspects of language learning: compositionality and grounding. Words are important building blocks of language, but what makes language unique is putting them together: how can we build meaning representations of phrases and whole sentences out of representations of words? And how can we make sure that these representations connect to the extralinguistic world that we perceive and interact with? I will present a multi-task gated recurrent neural network model which sequentially processes words in a sentence and builds a representation of its meaning while making concurrent predictions about (a) which words are to follow and (b) what are the features of the corresponding visual scene. Learning is driven by feedback on this multi-task objective. I evaluate the induced representations on tasks such as image search, paraphrasing and textual inference, and present quantitative and qualitative analyses of how they encode certain aspects of language structure.
Incremental shift-reduce parsing with structured perceptron training is an established technique for continuous constituency parsing. The corresponding parsers are very fast and yield results that are close to the state of the art. In this talk, I present a shift-reduce parser which can produce discontinuous constituents by processing the input words out-of-order, a strategy known from dependency parsing. The system yields accurate results. Unlike previous grammar-based parsers for discontinuous constituents, it also achieves very high parsing speeds.
The CLS is happy to announce three talks about Statistical Models of Grammaticality studied within the SMOG project at King’s College London:
I will present two ideas aiming towards 'parser-generalization', the problem of enhancing a supervised grammar and parsing model to accurately cover a wider variety of linguistic data than has been seen in the labeled data, using additional unlabeled data. The first idea concerns the use of the Expectation Maximisation (EM) algorithm for semi-supervised learning of parsing models. While it has long been thought that EM is unsuitable for semi-supervised learning of structured models such as part-of-speech taggers and parsing models (Merialdo 1994, Elworthy 1994), I will present experiments under two grammar formalisms (PCFG and CCG) where we have successfully used EM for semi-supervised learning of generative parsers. These two grammars share the property of being 'strongly lexicalised', in that they have complex lexical categories, and a few simple grammar rules that combine them. This strong lexicalisation makes these grammars more suitable for learning from unlabeled data than grammars which are not lexicalised in this way. In this work, I make the assumption that all lexical category types in the language are *known* from the supervised part of the data, a reasonable assumption to make if the supervised data is large enough. In the second part of the talk, I will discuss ongoing work where we generate *new* category types, based on those types seen in the labeled data. We use a latent-variable PCFG model for generating new CCG types, under the assumption that there is a hidden structure in CCG lexical categories which can be uncovered using such a model.
One approach to representing images is as a bag-of-regions vector, but this representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of the research is that capturing and encoding the relationships between parts of an image will improve the performance of downstream tasks. A simplifying assumption throughout the talk is that we have access to gold-standard object annotations. The first part of this talk will focus on the Visual Dependency Representation: a novel structured representation that captures region-region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. We explain how to automatically predict Visual Dependency Representations using a modified graph-based statistical dependency parser. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. The second part of the talk will show that adopting Visual Dependency Representations of images leads to significant improvements on two downstream tasks. In an image description task, we find improvements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. Finally, in an query-by-example image retrieval task, we show improvements in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach.
It is proofed that syntax-based statistical machine translation can produce better translation than phrase-based translation does, especially for those language pairs with big structural difference. However, constituent-based models are complex and not efficient in implementation. Dependency is regarded as a more compact and efficient formalism of syntax and a nature bridge from syntax to semantics, but early dependency-based SMT has lower performance compared with the mainstream approaches. We proposed the first dependency-based SMT model whose performance is comparable with the state-of-the-art models in 2011, and then we developed several improvements based on this model. Recently we tried a new dependency-based transfer-and-generation approach which we think is promising and got positive results at this preliminary stage.
In Statistical Machine Translation (SMT), inference is performed over a high-complexity discrete distribution defined by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation - the task of searching for the optimum translation - or sampling - the task of finding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. In this presentation, I will talk about exact optimisation (decoding) and sampling for SMT based on a form of rejection sampling. In this view, the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally refined to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level.
Establishing reference to objects in a shared environment is pivotal to successful communication. By using artificial scenarios where subjects need to choose referential expressions or guess the speaker's intended referent we can study the extent to which speakers and listeners reason pragmatically about each other's perspective. I will present a number of related empirical studies in this paradigm and discuss how different flavors of Bayesian cognitive modeling can be used to analyze the data.
Human beings are excellent at making sense of, and producing, structured sensory input. In particular, cognitive abilities for patterning seem crucial in allowing humans to perceive and produce language and music. The comparative approach, testing a range of animal species, can help unveil the evolutionary history of such patterning abilities. Here, I present experimental data and ongoing work in humans, chimpanzees, squirrel monkeys, pigeons and kea. I compare monkeys' and humans' skills in processing sensory dependencies in auditory stimuli, a crucial feature of human cognition. In order to infer individual and species-specific learning strategies and behavioral heuristics, I analyze data from visual touch-screen experiments in birds. Finally, as pattern production and perception abilities have been shown to differ in humans, the same divide could exist in other species. I present ongoing work using "electronic drums" I developed specifically for apes, which will allow chimpanzees to spontaneously produce non-vocal acoustic patterns.
Probabilistic and stochastic methods have been fruitfully applied to a wide variety of problems in grammar induction, natural language processing, and cognitive modeling. In this talk I will explore the possibility of developing a class of combinatorial semantic representations for natural languages that compute the semantic value of a (declarative) sentence as a probability value which expresses the likelihood of speakers of the language accepting the sentence as true in a given model. Such an approach to semantic representation treats the pervasive gradience of semantic properties as intrinsic to speakers' linguistic knowledge, rather the result of the interference of performance factors in processing and interpretation. In order for this research program to succeed, it must solve three central problems. First, it needs to formulate a type system that computes the probability value of a sentence from the semantic values of its syntactic constituents. Second, it must incorporate a viable probabilitic logic into the representation of semantic knowledge in order to model meaning entailment. Finally, it must show how the specified class of semantic representations can be efficiently learned from the primary linguistic data available for language acquisition. This research has developed out of recent work with Alex Clark (Royal Holloway, London) on the application of computational learning theory to grammar induction.
Early verb learning in children seems an almost miraculous feat. In learning a verb, children must learn both the basic meaning of the event ("falling" or "eating"), as well as the allowable structures in their language for correctly communicating the participants in that event ("The glass fell", but not "She fell the glass"). Moreover, given the sparsity of evidence, children must be able to abstract away from specific usages they observe in order to use their knowledge of verbs productively. Finally, children must accomplish all this in the face of a high degree of variability among verbs, along with much noise and uncertainty in the input data, and with no explicit teaching. Do children require innate knowledge of language to accomplish this, or are general cognitive learning mechanisms sufficient to the task? We have developed various computational models of verb learning using unsupervised clustering over simple statistical properties of verb usages. Our findings support the claim that general learning mechanisms are able to acquire abstract knowledge of verbs and to generalize that knowledge to novel verbs and situations.