OldStuff

Universal parameters for all scripts:

[-c conf] [-l languages...] [-h threads]

-c      Sets configuration file to the specified path
-l      Selects languages by language code to retrieve from wikimedia, separated by commas
-h      Sets the maximum amount of processors to use for parallel language processing

Run the requestedlinkgetter.sh file. The parameters should be formatted to match the following:

[-o outputpath] [-n names...] [-d date]

-o      Sets the path to output the tsv file containing all the links
-f      Selects types of dump files to retrieve, separated by commas
-y      Sets the date to retrieve from. Files are retrieve from on or before this date

Run the filedownloader.sh file. The parameters should be formatted to match the following:

[-o outputpath] [-t tsvpath]

-o      Sets the directory in which to output the downloaded dumps
-t      Selects the tsv file from which to read the download links

Run the dumploader.sh file. The parameters should be formatted to match the following:

[file ...]

file    Selects the dump files to load

Run the redirectloader.sh file. The parameters should be formatted to match the following:

[-d]

-d      Drops and recreates all tables and indexes

Run the wikitextdumploader.sh file. The parameters should be formatted to match the following:

[-d]

-d      Drops and recreates all tables and indexes

Run conceptmapper.sh file. The parameters should be formatted to match the following:

[-d] [-n algorithms]

-d      Drops and recreates all tables and indexes
-n      Selects the algorithms to use to map concepts

Run universallinkloader.sh. The parameters should be formatted to match the following:

[-d] [-n algorithms]

-d      Drops and recreates all tables and indexes
-n      Selects the algorithms to use to map concepts

Optional scripts:

Run phraseloader.sh. The parameters should be formatted to match the following:

[-n analyzer]

-p      Selects the phrase analyzer to use

Run luceneloader.sh. The parameters should be formatted to match the following:

[-d] [-n namespace...] [-i index...]

-d      Drops and recreates all Lucene indexes
-p      Specifies the namespaces to indexes
-i      Selects the types of indexes to use, as described by the configuration file

A Basic Outline of the Process

Download dump files
- Obtain dump links
- Download files from those links
Load the Dump as XML
- Convert Dump into RawPages
- Convert RawPages into LocalPages
  - Mark Redirects to be dealt with after this process
Resolve Redirects
- Load into Redirect Table, fully resolved
WikiTextParser does the following
- load links into table with src/dest IDs
- load categories with the source article as a category member
Load Concepts
Load Concept Links

Optional:

Load Phrases Database
Load Lucene Database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OldStuff

A Basic Outline of the Process

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally