Knowledge graphs are useful for providing structured sources of information for many downstream tasks. Hence, it is an interesting problem to build large knowledge graphs (KG) from a large text corpus. Being able to learn a KG from web-scale corpora means that we could leverage the large amount of unstructured information on websites (e.g. TechCrunch) and build structured knowledge bases. At a large scale, a KG is hard to maintain as it is not easy to keep track of issues like fact coverage, freshness and correctness. This blog post serves as a short introduction to the techniques used in building a simple KG.
Typically, KG construction using machine learning techniques consists of several stages, which of which can be considered an individual problem. The two main stages we consider for a simple KG are:
To illustrate the techniques and challenges here, we use a short passage from Wikipedia, shown below. We will be building a small knowledge graph using this example.
Gennaro Basile was an Italian painter, born in Naples but active in the German-speaking countries. He settled at Brünn, in Moravia, and lived about 1756. His best picture is the altar-piece in the chapel of the chateau at Seeberg, in Salzburg. Most of his works remained in Moravia. This article about an Italian painter born in the 18th century is a stub.Passage from Wikipedia article about Gennaro Basile
The first step is to be able to identify the nodes in the graph. Each node typically represents a unique entity that is found in the corpus. As much as possible, two nodes should not represent the same entity. Entities can be specified beforehand (e.g. by scraping Wikipedia article titles) or found using NLP techniques such as Named Entity Recognition (NER). Both techniques additionally require coreference and entity linking across sentences in order to correctly recognise the same entity across a longer length of text.
From the passage, we see that
Gennaro Basile is an entity that is also referred to by the words
his inside this passage. Using the HuggingFace NeuralCoref library, we can identify and replace those words with the entity:
Gennaro Basile was an Italian painter, born in Naples but active in the German-speaking countries. Gennaro Basile settled at Brünn, in Moravia, and lived about 1756. Gennaro Basile best picture is the altar-piece in the chapel of the chateau at Seeberg, in Salzburg. Most of Gennaro Basile works remained in Moravia. This article about an Italian painter born in the 18th century is a stub.The Wikipedia passage processed using NeuralCoref, with the inserted entities marked in bold.
Entity linking refers to the task of identifying that all the instances of
Gennaro Basile refer to the same entity, and therefore the same node in the KG. The simplest method of entity linking is just using the exact spelling of the entity name to perform the matching.
By the end of our entity identification process, we would have gotten a list of entities that we can use as the nodes in a KG. In our simple example, we used SpaCy NER + NeuralCoref to get a list of our entities. In more complex examples, additional information may be needed to link between two entities with the same spelling.
After we have obtained the list of entities, we can perform the relation extraction. This is a much more challenging process that can be tackled by using a deep learning model. Typically, this involves a model being presented with a sentence and a pair of entities, with the output being a classification of the relation.
In our example, we simply parse the text and dependency tree created by SpaCy to look for two types of relations,
IN, adding new types of entities if needed. The end result, drawn using NetworkX, is shown below:
From the above image, we can see that we’ve built a reasonably good representation of the knowledge in the Wikipedia paragraph. For example, we can read off the KG that
Gennaro Basile best picture is the
the chapel in
Seeberg (among other possibly equivalent relations – knowledge graph refinement is another type of problem altogether).
In this post, we went over a brief overview of how we can construct a simple knowledge graph and some challenges involved. The code for this blog post can be found here.