Introduction

Temporal Wikipedia project is a long-term project that aims to utilize the temporal information (edit histories) from WikiMedia data dump to optimize NLP (natural language processing) model in the future.

Objective

The objective of this project is to analyze and utilize the Wikipedia edit histories for future use in the natural language processing (NLP) field. Wikipedia has been built for a long time and nearly all long documents are built throughout a complicated editing histories. We believe that these information could be helpful to optimize the NLP model for generating long documents. The main challenge of this project is how can we handle large size of Wikipedia's edit history which is larger than 5TB. Specifically, we are planing to use distributional processing systems such as Hadoop or Spark to make a parser. In addition, we are going to explore a way to efficiently construct a corpus to train machine learning models.

Current Phase

The first phase is to build a dataset that can store the differences of edit histories in an efficient way, and also make the dataset useful. This is what we are doing in Fall 2022 semester.

It is a challenging task because we want to make dataset useful, that means a traditional git-like difference patterns does not work. Knowing the character-wise change-set does not help a lot on training a word-based language model.

Background Problem

This project will mainly solve this problem — How to efficiently store the edit history of Wikipedia in a useful way? There are two key points:

Efficient: Store them in a much smaller size while keeping all information available after reconstruction. It is likely a compression problem, but we want to make the compressed version directly usable for training a language model. So, typical compression method does not work because we cannot use a compressed file directly.
Useful: The dataset must make sense to the language model. For example, changing ing to ed makes a little sense because we might know the purpose (by guessing) but we cannot get any context around that word, which might be tough for future encoding and understanding. Changing playing to played makes much more sense because we know the context of this change.

Potential Solution

We are able to achieve both key points by optimizing from a git-like difference pattern. The raw data of Wiki edit histories is stored in the format of full revisions, which means that there are a lot of duplicate contents. If you are familiar with Computer Science or Software Engineering, you should be familiar with Git. It is a great way to store the difference between two versions of a file. It is a perfect solution for compressing edit histories while keeping the compressed version usable. However, this method is not suitable for our purpose because we are losing the context, even the closest one. Therefore, we are going to optimize the original git-like difference pattern to an ergonomic one, which is totally understandable without even looking into its original place.

Here is an example of two patterns parsing the same edit history:

💡

Example edit history: writes -> details

Assume that indexes are stored properly. For a better understanding of the example, I only keep the change type, original content, and new content.

Git-like Difference Pattern

(replace, 'wr', 'deta')
(equal, 'i', 'i')
(replace, 'te', 'l')
(equal, 's', 's')

If we strip the equal part, we can get the following pattern:

(replace, 'wr', 'deta')
(replace, 'te', 'l')

How can we understand this? And how can a language model be trained properly with this kind of dataset?

Ergonomic Difference Pattern Proposal

(replace, 'writes', 'details')

This version makes a lot more sense, at least we know what word is exactly changing.

While the content becomes more complex, interpreting a raw git-like difference is getting harder and harder. However, the ergonomic version is still easy to understand.

A More Complex Example

Origin: The quick brown fox jumps over the lazy dog.
Edit: A team of quiet brown foxes jump over the lazy dog.

Assume that indexes are stored properly for both patterns. I emit the indexes for a cleaner comparison.

Before putting them into differ, we preprocess them into all-lowercase.

Git-like Difference Pattern

# the -> a team of
(insert, '', 'a ')
(delete, 'h', '')
(insert, '', 'am of')
 
# quick -> quiet
(replace, 'ck', 'et')
 
# fox -> foxes
(insert, '', 'es')
 
# jumps -> jump
(delete, 's', '')

Ergonomic Difference Pattern

(replace, 'the quick', 'a team of quiet')
(replace, 'fox jumps', 'foxes jump')

Storing Pattern

Store the pattern in the above format is still wasting a lot of characters. replace, insert, and delete status can be directly known by the last two fields, so we do not need to store them at all. And from the above example, we also need to store indexes of each change in order to locate and reconstruct the original content in the future. So, we can store the pattern in the following format:

(0, 'the quick', 'a team of quiet')
(16, 'fox jumps', 'foxes jump')

Explanation

0: The index of the first character of the original content.
the quick: The original content.
a team of quiet: The new content.

Final Goal

If we are able to achieve these designs, we can get a much more useful temporal dataset for training a language model. And this is the final goal of the current phase (Fall 2022) of Temporal Wiki project.

Last updated on December 7, 2022

Guide