Skip to content
shilad edited this page Aug 11, 2013 · 1 revision
  • Universal parameters for all scripts:
[-c conf] [-l languages...] [-h threads]
-c      Sets configuration file to the specified path
-l      Selects languages by language code to retrieve from wikimedia, separated by commas
-h      Sets the maximum amount of processors to use for parallel language processing
  • Run the requestedlinkgetter.sh file. The parameters should be formatted to match the following:
[-o outputpath] [-n names...] [-d date]
-o      Sets the path to output the tsv file containing all the links
-f      Selects types of dump files to retrieve, separated by commas
-y      Sets the date to retrieve from. Files are retrieve from on or before this date
  • Run the filedownloader.sh file. The parameters should be formatted to match the following:
[-o outputpath] [-t tsvpath]
-o      Sets the directory in which to output the downloaded dumps
-t      Selects the tsv file from which to read the download links
  • Run the dumploader.sh file. The parameters should be formatted to match the following:
[file ...]
file    Selects the dump files to load
  • Run the redirectloader.sh file. The parameters should be formatted to match the following:
[-d]
-d      Drops and recreates all tables and indexes
  • Run the wikitextdumploader.sh file. The parameters should be formatted to match the following:
[-d]
-d      Drops and recreates all tables and indexes
  • Run conceptmapper.sh file. The parameters should be formatted to match the following:
[-d] [-n algorithms]
-d      Drops and recreates all tables and indexes
-n      Selects the algorithms to use to map concepts
  • Run universallinkloader.sh. The parameters should be formatted to match the following:
[-d] [-n algorithms]
-d      Drops and recreates all tables and indexes
-n      Selects the algorithms to use to map concepts

Optional scripts:

  • Run phraseloader.sh. The parameters should be formatted to match the following:
[-n analyzer]
-p      Selects the phrase analyzer to use
  • Run luceneloader.sh. The parameters should be formatted to match the following:
[-d] [-n namespace...] [-i index...]
-d      Drops and recreates all Lucene indexes
-p      Specifies the namespaces to indexes
-i      Selects the types of indexes to use, as described by the configuration file

A Basic Outline of the Process

  • Download dump files
    • Obtain dump links
    • Download files from those links
  • Load the Dump as XML
    • Convert Dump into RawPages
    • Convert RawPages into LocalPages
      • Mark Redirects to be dealt with after this process
  • Resolve Redirects
    • Load into Redirect Table, fully resolved
  • WikiTextParser does the following
    • load links into table with src/dest IDs
    • load categories with the source article as a category member
  • Load Concepts
  • Load Concept Links

Optional:

  • Load Phrases Database
  • Load Lucene Database

Clone this wiki locally