Skip to content

TPC-AI/data-general-text-code-web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TPC data-general-text-code-web group

Image showing a lot of books

A proposed initial goal (see also slides from August hackathon)

Assemble a large, de-duplicated collection of scientific articles for use in LLM training, with for each the extracted text (initially; later, also figures, entities, citation linkages, etc.), each with descriptive metadata.

Document contents of this collection in a database that details something like, (DOI-if-available, metadata, source, location(s)-on-tpc-storage, info about duplicates).

Potential sources for articles include the Pile, Arxiv, DOE OSTI, PubMed, BioArxiv, etc. Different sources may have data in different formats and duplicates. Not all record DOIs or have accurate metadata. Many but not all of these datasets can be retrieved from public sources, but generally they cannot be redistribited.

Some potential first steps

To support this work, we are engaged in the following activities:

  1. Development of a curated list of potential data sources with details on how to access eacb.
  2. Assembly of articles from different sources. E.g., the Argonne group is assembling some datasets at ALCF.
  3. Extraction of text from PDF articles, when not available in other formats. The following are some methods that can be considered.
  4. Evaluation of similarity/de-duplication

About

Material relating to data-general-text-code-web

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •