Data mining: UN Docs

Automatic detection of topics in UN documents and agreements, creation of suitable categories and assignment of documents to categories.

Challenges:  automatic text recognition, making the system robust, interpreting initial results and visualizing them
Tools:  PHP, Javascript, HTML, CSS, Matlab
Deliverables:  report, proof of concept

Within UN, there are many, many documents and agreements being signed every day. Every one of these documents needs to be stored and categorized. The categorization currently requires a lot of manual labor – every document needs to be read and placed in one or more categories by the reader. IF categorization could be done at least partially by an automatic process, it would greatly speed up the process and reduce workload. We attempted to solve this problem within our data mining project.

We approached this project from a technical point. Automatic categorization requires understanding of what the documents contain. One way to do this is to find out what words within the document are often together in the sentence. For example, if water and malaria are found often next to each other, then it is pretty certain that this is a topic that is being discussed in the document. However, in order to calculate these relations (n-grams), the texts first need to be filtered for common words and all the words need to be brought back to their basic form.

Our script was able to generate suitable categories within a given set of documents, and create a tree sorting the documents into these suggested categories.

← Back