Wiki Parser

Wiki Parser: easily process all of Wikipedia into XML and plain text

Wiki Parser is a Wikipedia data extraction tool designed to make this enormous free encyclopedia more readily accessible for data mining. It processes Wikipedia database files that are updated every month and contain the current snapshot of all its articles, redirects and disambiguations. While many similar parsers exist, Wiki Parser (which is written entirely in C++) is in a class of its own when it comes to performance. It can parse the entire current English Wikipedia database (66 GB of data uncompressed as of July 2018) in about 2-3 hours, compared to many days or even weeks needed by other parsers.

Wikipedia articles are written in a fairly tricky markup language called MediaWiki, which isn’t particularly amenable to text analysis or data extraction. Wiki Parser solves this problem by translating Wikipedia pages into standard XML and plain text while doing its best to preserve structure and textual content. It fully processes the page’s section tree, recreating it in XML, preserves links to other pages and images in XML nodes, parses common templates, handles quotes, image galleries, and infoboxes.

With its focus on content and structure extraction, Wiki Parser helps make the first step to data mining Wikipedia pages. Sample Wiki Parser output,  a sample parsed page in XML and the XML schema used are provided.

Wiki Parser is free and open source software (FOSS) distributed under the MIT License. The source code is available from the Wiki Parser github repository. You are welcome to install and use the latest 64-bit Windows version:

Download Wiki Parser (Win64)

How to use

Using Wiki Parser is quite straightforward. Simply launch the application (see image below), tell it where the Wikipedia database file is located on your local filesystem (Step 1), indicate where to save the parsed data (Step 2), and press the Start button to begin the parse (Step 6). The full parse of English Wikipedia takes about 2 – 3 hours on a modern desktop machine with 6-8 processor cores.

After the parse completes, you can click the See files in folder button to navigate to the directory where the parse result files were saved. This button only appears after the parse has finished.

If you’d like to modify the default parsing options, you can do so in Step 3. One important option is the Test run setting, which lets you quickly test the parse on a small number of articles (just 100 articles are parsed). In addition, in Step 4 you can verify that you have enough space on your disk to run the parse, and in Step 5 you can specify how many processor cores to use during the parse.

Wiki Parser at a glance

  • Processes articles from Wikipedia database files (written in MediaWiki) into XML and plain text
  • Preserves as much document structure in XML as possible
  • Parses common templates, quotes, image galleries, redirects etc.
  • Parses infoboxes
  • Removes non-content sections such as References, Bibliography, etc.
  • Can discard disambiguations and list pages, if needed
  • Convenient graphical interface
  • Incredibly fast: 2 to 3 hours for a full parse of English Wikipedia on a modern processor
  • Designed primarily for English Wikipedia
  • Installer available for 64-bit Windows Vista and later

Downloading Wikipedia database files

Before using Wiki Parser, the Wikipedia database file needs to be downloaded onto your computer. Wikipedia updates these database files every month or so. One way to download them is through a BitTorrent client such as utorrent. The torrents for the English Wikipedia files are located at:

http://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki

You will need the file that ends in pages-articles.xml.bz2, meaning it contains the current revisions of all articles, but no other extraneous pages. Pick the file with the latest date on it. These files are quite large; they are ~14 GB as of May 2018.

The torrents, however, may not have the latest database dump. As an alternative, you can download the pages-articles.xml.bz2 files from from Wikipedia directly. Wikipedia’s repositories always contain the latest versions. A general description of the various ways to download Wikipedia databases is available here.

Important: once you have downloaded the file, there is no need to decompress it. Simply leave it on disk as it was downloaded. Wiki Parser can process these bz2-compressed XML files directly.

What gets parsed

It’s important to understand that the output produced by Wiki Parser may differ from what is shown on Wikipedia web pages. The main focus during Wiki Parser development was extraction of textual content and the page structure (section tree) for further use in data mining.

For example, Wiki Parser currently omits tables in Wikipedia pages as they are almost impossible to present in textual format. It also flattens multi-level lists (but keeps every list element in its own XML node).

Sometimes Wiki Parser may encounter MediaWiki formatting that it does not understand, or templates that it does not recognize. In those cases Wiki Parser will skip that part of content. If you have strict requirements to the quality of the output produced, please download a sample of Wiki Parser output to verify that the quality is acceptable.

System requirements

Wiki Parser needs the following minimum system specifications to install and run:

  • 64-bit Windows Vista or later
  • 4 GB physical memory (RAM)
  • A modern processor with 4 or more cores (recommended)
  • About 50 GB disk space to save XML and plain text output (as of May 2018)