TPC data-general-text-code-web group

A proposed initial goal (see also slides from August hackathon)

Assemble a large, de-duplicated collection of scientific articles for use in LLM training, with for each the extracted text (initially; later, also figures, entities, citation linkages, etc.), each with descriptive metadata.

Document contents of this collection in a database that details something like, (DOI-if-available, metadata, source, location(s)-on-tpc-storage, info about duplicates).

Potential sources for articles include the Pile, Arxiv, DOE OSTI, PubMed, BioArxiv, etc. Different sources may have data in different formats and duplicates. Not all record DOIs or have accurate metadata. Many but not all of these datasets can be retrieved from public sources, but generally they cannot be redistribited.

Some potential first steps

To support this work, we are engaged in the following activities:

Development of a curated list of potential data sources with details on how to access eacb.
Assembly of articles from different sources. E.g., the Argonne group is assembling some datasets at ALCF.
Extraction of text from PDF articles, when not available in other formats. The following are some methods that can be considered.
- Grobid to produce XML, and then simple extraction of text from XML.
- Andrew McNaughton has been using PyPDF2 and PyMuPDF (https://pymupdf.readthedocs.io/en/latest/)
- Marker for extraction as markdown.
Evaluation of similarity/de-duplication
- We are experimenting with MinhashLSH from the Python Datasketch library, as used by the Pile team: see code assembled by Hong Zhi.
- See also the RefinedWeb pipeline, which uses MinHash but also a variety of other methods.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data-sources		data-sources
deduplication		deduplication
README.md		README.md
books.png		books.png
install_redis.sh		install_redis.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TPC data-general-text-code-web group

A proposed initial goal (see also slides from August hackathon)

Some potential first steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

TPC-AI/data-general-text-code-web

Folders and files

Latest commit

History

Repository files navigation

TPC data-general-text-code-web group

A proposed initial goal (see also slides from August hackathon)

Some potential first steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages