A curated list of the most important common-sense datasets in NLP

11 min readJan 17, 2021

The problem of common sense and background knowledge is probably one of the most challenging problems in AI. Most of our reasoning about the world is based on unspoken knowledge in partially observable environments. We draw conclusions based on our shared understanding of the world and reason to the best explanations. Here we introduce some basic datasets and show how other datasets are created from them.

At the core of most of these problems the “story comprehension” problem lies: How we understand stories and how can we answer questions about them by reading between the lines and filling out gaps of knowledge that are are not mentioned explicitly in the text but can be “inferred” using common sense knowledge. For example, if we read a story about someone who is overweight that goes to the doctor. We know it is probably because it’s a Nutrition Specialist and he wants a diet program. Even more implicit is we know you have to make an appointment to go to the doctor and the doctor is physically distant etc.

For the “story comprehension” task, researchers start with ROCstories: a simple dataset containing very short 5 sentences stories about different subjects. You can see one sample here:

The simple task that was introduced was to remove the last sentence and let the model choose the right ending. It is a multiple-choice task that can be reduced to a classification task. Basically, the model should know what the first 4 sentences entail as the most plausible for the continuation. Keep in mind the there is no one true last sentence. We can always find other plausible alternatives. That’s why sometimes we call them hypotheses.

This can be even more fine-grained, meaning we can infer one or more plausible hypotheses from one sentence or exclude other hypotheses as being contradictory or neutral to the evidence (premise) at hand. Natural Language Inference (NLI) datasets like SNLI and MultiNLI show the relationship between a premise and hypothesis as “entailing”, “contradiction” or “neutral”. For example:

Premise: A man inspects the uniform of a figure in some East Asian country.

Judgment: contradiction

Hypothesis: The man is sleeping

Also, there is the choice of plausible reasoning (COPA) which is designed for causal reasoning. Each question is composed of a premise and two hypotheses (or alternatives), where the task is to select the hypothesis that more plausibly have a causal relation with the premise. For example:

Premise: The man broke his toe. What was the CAUSE of this?
Hypothesis 1: He got a hole in his sock.
Hypothesis 2: He dropped a hammer on his foot

Based on the above datasets we can go into more details on common sense datasets that are mostly developed by AllenAI.

ATOMIC dataset

This is a common-sense dataset of if-then relationships between 877k textual variables. The relationships are constrained to 9 types

This is one example:

More details here. This dataset with Conceptnet has been used in COMET which is a simple transformer model fine-tuned on these datasets. The model can infer 9 different classes of inferential relationships mentioned above for a new premise.

α-NLI and NLG

Abductive reasoning is the inference to the most plausible explanation for incomplete observations. It is actually going from the observation to the causes or premise and because of this, it’s also called backward reasoning. The reason for this naming is forward reasoning based on well-known modes ponens. For example, we have forward reasoning (or deduction) like this:

Rule: All men are mortal.
Observation: Socrates is a man.
Conclusion: Socrates is mortal.

But imagine the following reasoning:

Observation: The road is wet.
Rule: When it rains the road gets wet
Conclusion: It was raining

This conclusion might be wrong (another car made the road wet) but it is the most plausible explanation. Unlike deduction and induction which are the two most studied modes of reason, abduction is much more complicated and involves generating hypotheses. It is nonmonotonic reasoning meaning the conclusions we draw are subject to revisions when new data comes in.

One of the main differences between the three kinds of reasoning is as below:

1- In deduction, it is impossible that the premises are true and the conclusion would be false. The relationship between premises and the conclusions are of “necessity”. All humans are mortal then all men are mortal.

2- In induction, it is improbable that the premises are true and the conclusion would be false. The relationship between premises and the conclusions are of “probability”. If all the observed swans are white then all swans are white.

3- In abduction, it is implausible that the premises are true and the conclusion would be false. The relationship between premises and the conclusions are of “plausibility”. The knife is in the back of the corpse so it’s plausible that he has been killed (in contrast to suicide or accident)

The third kind of reasoning is the weakest of all.

Abductive reasoning involves understanding different scenarios and background knowledge about the world. For example, in the above picture, if the window is crack open a large bird couldn’t fly to the house. This involves understanding the size of a “usual big bird” the size of “crack window open” and the physical limitations.

AlphaNLI is the dataset that contains two competing hypotheses that are easy for a human to choose one as a “reasonable explanation” for the observation but challenging for the AI systems. (Note in reality there are many more competing hypotheses at play but here we restrict to two). The dataset has been created based on the ROCstories dataset.

Alpha NLI task involves the model to predict the right hypothesis to choose from.

AlphaNLG is the same dataset but the task is to generate the hypothesis. To make this even easier for the model alpha NLG is augmented with 9 relations from COMET (trained on ATOMIC) to add what are the if-then results of both observations. To do this they add a beam of 5 consequences for each relation. A sample of it can be seen below:

Defeasible NLI or δ-NLI

One of the features of some abductive reasoning is that they are defeasible at the same time. Defeasible arguments are ones that can be acceptable at the moment even though in the future they may be open to defeat. New evidence may come in later that defeats the argument.

The canonical example of a defeasible argument, used so often in AI, is the Tweety argument:

Observation: Tweety is a bird
Rule: Birds fly
Conclusion: Tweety flies.

The Tweety argument may be rationally acceptable assuming that we have no information about Tweety except that he is a bird. But suppose new information (an Update) comes in telling us that Tweety is a penguin. A penguin is a bird, but it cannot fly.

The second premise of the Tweety argument (the rule) is not a universal generalization of the absolute kind that can be rendered by the universal quantifier of deductive logic. It is not really an inductive generalization, either. It states that birds normally fly or that one can normally expect a bird to fly, subject to exceptions.

Not all possible exceptions can be predicted in advance. Thus a defeasible argument is one that is open-ended, whereas a deductively valid argument is closed in that it necessarily implies its conclusion. Deductive logic is monotonic which means new facts or knowledge will not change the conclusion of valid deductive inference. On the other hand, defeasible reasoning is non-monotonic which means given new facts the conclusions can change.

There is a very close connection between abductive reasoning and defeasible reasoning. (this need to be expanded)

The updates can strengthen or weaken the default hypothesis. There is one example from the d-NLI dataset that can be seen below:

Time-Travel or Counterfactual Reasoning

Based on Wikipedia:

Counterfactual thinking is a concept in psychology that involves the human tendency to create possible alternatives to life events that have already occurred; something that is contrary to what actually happened. Counterfactual thinking is, as it states: “counter to the facts”. These thoughts consist of the “What if?” and the “If I had only…” that occur when thinking of how things could have turned out differently.

Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. One desired property of AI systems is the ability to predict causal changes in future events given a counterfactual condition applied to the original chain of events.

For example, given the original story in the figure above where “Pierre loved Halloween. He decided to be vampire this year. He got a black cape and white face paint…” and a counterfactual condition, “what if Pierre decided to be a werewolf instead of a vampire?”, an intelligent system should be able to revise the subsequent events in the story appropriately, for example, that a brown sweater would be more appropriate than a black cape.

In TimeTravel dataset that was based on ROCstories, the second sentence in the original story has been changed and the story goes with a different ending.

An important challenge in counterfactual reasoning is causal invariance, namely, the aspects of future events that are invariant under the counterfactual conditions. This is necessary to accurately reason about the new consequences with minimal edits to the original sequence of events, instead of being confounded by spurious correlations.

Similar issues arise in the area of controllable language generation, which involves preserving the content of text while changing it along a single or multiple dimensions, such as theme (Koncel-Kedziorskiet al., 2016), style (Lample et al., 2019), and sentiment (Shen et al., 2017). Reasoning in these tasks is limited to discrete axes (e.g., sentiment), which are often categorized with a closed label set ({positive, negative}). Because of controllability motivations, these axes and labels are generally known a priori. In contrast, counterfactual rewriting focuses on the causes and effects of a story, dimensions that can require more complex and diverse, yet potentially subtle, changes to accommodate the counterfactual event.

Each example consists of a five-sentence story S= (s1, . . . ,s5) with a general structure where the first sentences1sets up the premise, the second sentence s2 provides more information of the initial context, and the last three sentences s3:5 are the original ending of the story. We are further given an additional sentence s′2, which is counterfactual to the initial context s2. That is, s′2 states something contrary to that in s2, which in turn can make the original ending s3:5 no longer valid. Thus, the goal of the task is to rewrite the ending, such that the edited ending s′3:5 minimally modify the original one and regains narrative coherency to the new counterfactual context.

More here and here.

CommonGen

CommonGen is the dataset that is designed for the common-sense controlled language generation or generative common-sense reasoning. Given a set of concepts (e.g. {dog, freesbee, catch, throw}) the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., “a man throws a frisbee and his dog catches it”).

The CommonGen task is challenging because it inherently requires:

relational reasoning with background commonsense knowledge,
the compositional generalization ability to work on unseen concept combinations.

A sample of the data can be seen below:

More here

Social Common Sense Reasoning

here we investigate two main datasets: SocialIQA and Social Chemistey101

SocialIQA is the first large-scale benchmark for commonsense reasoning about social situations. SocialIQA contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.

Performing these inferences is what makes us experts at navigating social situations, and is closely related to Theory of Mind, i.e., the ability to reason about the beliefs, motivations, and needs of others.

This dataset is based on ATOMIC dataset. SocialIQA contains several question types that cover different types of inferential reasoning. Question types are derived from ATOMIC inference dimensions.

Here is a sample of the dataset:

Another important dataset on social intelligence is social chemistry 101.

social chemistry 101 is a new conceptual formalism to study people’s everyday social norms and moral judgments over a rich spectrum of real-life situations described in natural language.

It is a large-scale corpus that catalogs 292k rules-of-thumb such as “It is rude to run a blender at 5am” as the basic conceptual units. Each rule-of-thumb is further broken down with 12 different dimensions of people’s judgments, including social judgments of good and bad, moral foundations, expected cultural pressure, and assumed legality, which together amounts to over 4.5 million annotations of categorical labels and free-text descriptions.

Here is a sample of the data:

It can also be seen as below:

Winogrande

Winogrande is a successor of Winograd and is common-sense reasoning by filling in the blank problem where the blank corresponds to the mention of one of the two names in the context.

On the surface, Winograd Schema questions simply require the resolution of anaphora: the machine must identify the antecedent of an ambiguous pronoun in a statement. This makes it a task of natural language processing, but for Winograd Schemas, the task requires the use of knowledge and commonsense reasoning.

One example is the following:

The difference between Winogrande and Winograd is that they removed all the language-based biased that makes it easier for the models to choose the right answer by correlation and not really understanding the relationships:

WSC problems are constructed as pairs (called twin) of nearly identical questions with two answer choices. The questions include a trigger word that flips the correct answer choice between the questions. Examples (1)-(3) are drawn from WSC (Levesque, Davis, and Morgenstern 2011) and (4) from DPR (Rahman and Ng 2012)). Examples marked with 7 have a language-based bias that current language models can easily detect. Example (4) is undesirable since the word “predators” is more often associated with the word “lions”, compared to “zebras”.