Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/document_dataset.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Create a document dataset

This guide will show you how to create a document with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document with several thousand pdfs.
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.

<Tip>

Expand All @@ -10,7 +10,7 @@ You can control access to your dataset by requiring users to share their contact

## PdfFolder

The `PdfFolder` is a dataset builder designed to quickly load a document with several thousand pdfs without requiring you to write any code.
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.

<Tip>

Expand Down
8 changes: 4 additions & 4 deletions docs/source/document_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ To work with pdf datasets, you need to have the `pdfplumber` package installed.

</Tip>

When you load an pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:
When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:

```py
>>> from datasets import load_dataset, Pdf
Expand All @@ -26,15 +26,15 @@ When you load an pdf dataset and call the pdf column, the pdfs are decoded as `p

<Tip warning={true}>

Index into an pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

## Read pages

Access pages directly from a pdf using the `PDF` using `.pages`.
Access pages directly from a pdf using the `.pages` attribute.

Then you can use the `pdfplumber` functions to read texts, tables and images, e.g.:

Expand Down Expand Up @@ -168,7 +168,7 @@ To ignore the information in the metadata file, set `drop_metadata=True` in [`lo

If you don't have a metadata file, `PdfFolder` automatically infers the label name from the directory name.
If you want to drop automatically created labels, set `drop_labels=True`.
In this case, your dataset will only contain an pdf column:
In this case, your dataset will only contain a pdf column:

```py
>>> from datasets import load_dataset
Expand Down
6 changes: 3 additions & 3 deletions docs/source/video_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ To work with video datasets, you need to have the `torchvision` and `av` package

</Tip>

When you load an video dataset and call the video column, the videos are decoded as `torchvision` Videos:
When you load a video dataset and call the video column, the videos are decoded as `torchvision` Videos:

```py
>>> from datasets import load_dataset, Video
Expand All @@ -26,7 +26,7 @@ When you load an video dataset and call the video column, the videos are decoded

<Tip warning={true}>

Index into an video dataset using the row index first and then the `video` column - `dataset[0]["video"]` - to avoid creating all the video objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
Index into a video dataset using the row index first and then the `video` column - `dataset[0]["video"]` - to avoid creating all the video objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>

Expand Down Expand Up @@ -136,7 +136,7 @@ To ignore the information in the metadata file, set `drop_metadata=True` in [`lo

If you don't have a metadata file, `VideoFolder` automatically infers the label name from the directory name.
If you want to drop automatically created labels, set `drop_labels=True`.
In this case, your dataset will only contain an video column:
In this case, your dataset will only contain a video column:

```py
>>> from datasets import load_dataset
Expand Down
Loading