Skip to content

Support indexing ClueWeb22 B#3006

Open
rankun203 wants to merge 2 commits intocastorini:masterfrom
rankun203:master
Open

Support indexing ClueWeb22 B#3006
rankun203 wants to merge 2 commits intocastorini:masterfrom
rankun203:master

Conversation

@rankun203
Copy link

This PR adds a new ClueWeb22Collection class and a companion test file for indexing ClueWeb22 dataset.

Note:

  1. At the moment I don't have the ClueWeb22 A, so I haven't tested using the ClueWeb22Collection with category A yet. Tested with: txt format of category B.
  2. Instead of warc files for storing content, ClueWeb22 B txt is using json.gz files.

Sample file (unziped)

{"URL": "https://en.wikipedia.org/wiki/Missing_Links_Volume_Three\n", "URL-hash": "DBFC4640E6E9A0F1134A9FC27EBE00A6", "Language": "en", "ClueWeb22-ID": "clueweb22-en0000-30-00000", "Clean-Text": "Missing links volume three"}
{"URL": "https://www.amazon.com/Kindle-books-your-iPhone-iPad-ebook/dp/B089K3WDJ1\n", "URL-hash": "6597F5FA4E2237615D07C881C61B28B5", "Language": "en", "ClueWeb22-ID": "clueweb22-en0000-30-00001", "Clean-Text": "Amazon.com: how to buy kindle ebook"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant