Make temporal data space-efficient and useful

Temporal data could be huge and having a lot of duplicate things. Storing them in an efficient and useful way is a challenge. This project is an attempt to solve this problem by providing a set of tools to make temporal data space-efficient and useful.

Objective

The objective of this project is to analyze and utilize the Wikipedia edit histories for future use in the natural language processing (NLP) field. Wikipedia has been built for a long time and nearly all long documents are built throughout a complicated editing histories. We believe that these information could be helpful to optimize the NLP model for generating long documents. The main challenge of this project is how can we handle large size of Wikipedia's edit history which is larger than 5TB. Specifically, we are planing to use distributional processing systems such as Hadoop or Spark to make a parser. In addition, we are going to explore a way to efficiently construct a corpus to train machine learning models.

Last updated on December 8, 2022