# Literature Survey I - Knowledge Infusion in Language Models

A literature survey exploring recent works that attempt to augment language models using external knowledge. I will only include studies published after BERT to keep this post concise, as well as to draw attention to state-of-the-art-solutions. The studies will be discussed in reverse chronological order (my favourite ones are 1, 3, 4, 5, 8, and 9). All figures have been taken from their respective papers.

## Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model - UC Santa Barbara, Facebook AI

Xiong et al. propose a weakly supervised training objective to incorporate knowledge. The task requires the model to distinguish between true and false textual knowledge by replacing entity mentions in the original text with names of other entities of the same type before training a model to distinguish the correct entity mention from randomly chosen ones. The advantage of this model compared to other methods that try to infuse knowledge in language models is that it can directly do so via unstructured text, and doesn’t need any modifications to BERT while finetuning on downstream tasks. This study tests the proposed model on Question Answering and fine-grained Entity Typing – two tasks that require entity knowledge – and displays strong results for both.

## K-BERT: Enabling Language Representation with Knowledge Graph - Peking University, Tencent Research

Liu et al. propose a knowledge-enabled language representation model (K-BERT) which aims to overcome the issue of knowledge noise – when injecting too much knowledge into a sentence changes its meaning. Initially, K-BERT “injects knowledge from KG into a sentence, making it a knowledge-rich sentence tree”. Next, it uses soft-position and visible matrix to “control the scope of knowledge, preventing it from deviating from its original meaning”. K-BERT outperforms BERT in several tasks where domain knowledge is essential like medicine, finance, and law. It should be noted that most of the experiments conducted by this paper are on Chinese datasets.

## Latent Relation Language Models - Carnegie Mellon

Hayashi et al. propose Latent Relation Language Models (LRLMs: “a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations”. LRLMs model $$P(X, Z | C)$$, where $$X$$ is some textual data; $$Z$$ is a sequence of latent variables that decides whether to generate words from a fixed word vocabulary or through spans defined according to their relations with a topic entity of interest; and $$C$$ is the context from a knowledge base. The authors use Latent Predictor Networks as a model to learn this distribution and WikiFacts and WikiText-103 as datasets for evaluation. “Experiments demonstrate empirical improvements over both a word-based baseline language model and a previous approach that incorporates knowledge graph information”.

## Knowledge Enhanced Contextual Word Representations - Allen AI, University of Washington, UC Irvine

Peters et al. propose the knowledge-enhanced BERT (KnowBERT), which integrates WordNet and a subset of Wikipedia through a novel Knowledge Attention and Recontextualization component (KAR). KAR accepts as input the contextual representations at a particular layer and computes knowledge enhanced representations. These representations are fed into the rest of the model as usual in BERT. KnowBERT “demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation”. KnowBERT also has a comparable runtime to BERT, as it scales to large knowledge bases.

## Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension - Google Research

Andor et al. enable BERT to do lightweight numerical reasoning by augmenting it with a predefined set of executable programs. “Rather than having to learn to manipulate numbers directly, the model can pick a program and execute it”. The study mainly uses the DROP dataset for its evaluations and shows significant improvement by adding such shallow programs, but also shows that the proposed model performs well on the Illinois dataset of math problems and the CoQA dataset.

## Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling - UC Irvine, Allen AI

Logan et al. introduce the Knowledge Graph Language Model (KGLM), which has “mechanisms for selecting and copying information from an external knowledge graph”. The authors’ proposed model “maintains a dynamically growing local knowledge graph, a subset of the knowledge graph that contains entities that have already been mentioned in the text, and their related entities. When generating entity tokens, the model either decides to render a new entity that is absent from the local graph, thereby growing the local knowledge graph, or to render a fact from the local graph. When rendering, the model combines the standard vocabulary with tokens available in the knowledge graph, thus supporting numbers, dates, and other rare tokens”. The study uses the Linked WikiText-2 dataset for its evaluations. Curiously, there are no mentions of transformers, BERT, or attention; and only uses LSTMs throughout.

## COMET: Commonsense Transformers for Automatic Knowledge Graph Construction - Allen AI, University of Washington

Bosselut et al. introduce the Commonsense Transformer (COMET), which follows their proposed generative approach to knowledge base construction: “A model must learn to produce new nodes and identify edges between existing nodes by generating phrases that coherently complete an existing seed phrase and relation type”. To do so, they fine-tune OpenAI’s GPT by training it on a seed set of knowledge tuples. Human judges find that COMET is able to produce high-quality tuples on Atomic and ConceptNet (two knowledge bases), approaching human performance.

## Matching the Blanks: Distributional Similarity for Relation Learning - Google Research

Baldini Soares et al. propose a new training objective for learning relation representations directly from unstructured text. Although they propose a method that outperforms previous works for supervised relation extraction, their main contribution is a method of training such a representation without any supervision from a knowledge graph or human annotations. The training objective they introduce works by “matching the blanks”, which is based on the idea that relation statements that share the same two entities probably encode the same semantic relation. Specifically, the entities in both relations are replaced with a “blank” token with a certain probability and the model tries to classify whether two relations share the same meaning or not. Their proposed model achieves state-of-the-art results on various relation-extraction tasks and is particularly effective in low-resource settings.

## ERNIE: Enhanced Language Representation with Informative Entities - Tsinghua University

Zhang et al. train an enhanced language representation model (ERNIE). Their model uses algorithms like TransE to encode the graph structure or knowledge bases into a vector. After recognizing named entities, ERNIE “integrates entity representations in the knowledge module into the underlying layers of the semantic module”. On top of the masked language modeling and next sentence prediction training objectives proposed by BERT, the authors propose a new objective by “randomly masking some of the named entity alignments in the input text and asking the model to select appropriate entities from KGs to complete the alignments”. ERNIE performs comparably to BERT on common NLP tasks (i.e., the GLUE benchmark) but achieves significant improvements on various knowledge-driven tasks.

## Knowledge-Augmented Language Model and Its Application to Unsupervised Named-Entity Recognition - Facebook AI research

Liu et al. propose the Knowledge-Augmented Language Model (KALM), which “works by providing a language model with the option to generate words from a set of entities from a database”. The model does this through a gating mechanism similar to attention in neural machine translation models. It outperforms popular baselines (which don’t include any transformer-based approaches) on the recipe dataset and CoNLL 2003 in terms of perplexity. The authors also propose an unsupervised named entity recognizer that performs comparably to state-of-the-art supervised models.

Written on May 18, 2020