Skip to content
Merged
70 changes: 37 additions & 33 deletions example/extract/extract_html.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,13 @@
{
"data": {
"text/plain": [
"{'extract': ['ExtractImageFlow',\n",
"{'extract': ['ExtractHTMLFlow',\n",
" 'ExtractImageFlow',\n",
" 'ExtractIpynbFlow',\n",
" 'ExtractMarkdownFlow',\n",
" 'ExtractPDFFlow',\n",
" 'ExtractTxtFlow',\n",
" 'ExtractS3TxtFlow',\n",
" 'ExtractHTMLFlow'],\n",
" 'ExtractS3TxtFlow'],\n",
" 'transform': ['TransformAzureOpenAIFlow',\n",
" 'TransformCopyFlow',\n",
" 'TransformHuggingFaceFlow',\n",
Expand Down Expand Up @@ -116,7 +116,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -132,7 +132,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -141,21 +141,14 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1/1 [00:00<00:00, 4.53it/s]\n"
"100%|██████████| 1/1 [00:00<00:00, 10330.80it/s]\n"
]
}
],
Expand All @@ -174,40 +167,51 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['22.11. Information Theory — Dive into Deep Learning 1.0.3 documentation22.',\n",
"['22.11. Information Theory — Dive into Deep Learning 1.0.3 documentation',\n",
" 'Appendix: Mathematics for Deep Learning',\n",
" 'navigate_next',\n",
" 'Information Theory',\n",
" 'Quick search',\n",
" 'Show Source',\n",
" 'Preview Version',\n",
" 'Table Of Contents',\n",
" 'Installation',\n",
" '1. Introduction',\n",
" '2. Preliminaries',\n",
" '2.1. Data Manipulation',\n",
" '2.2. Data Preprocessing',\n",
" '2.3. Linear Algebra',\n",
" '2.4. Calculus',\n",
" '2.5. Automatic Differentiation',\n",
" '2.6. Probability and Statistics',\n",
" '3. Linear Neural Networks for Regression',\n",
" '3.1. Linear Regression',\n",
" '3.2. Object-Oriented Design for Implementation',\n",
" '3.3. Synthetic Regression Data',\n",
" '3.4. Linear Regression Implementation from Scratch',\n",
" '3.5. Concise Implementation of Linear Regression',\n",
" '4. Linear Neural Networks for Classification',\n",
" '4.1. Softmax Regression',\n",
" '4.2. The Image Classification Dataset',\n",
" '4.3. The Base Classification Model',\n",
" '4.4. Softmax Regression Implementation from Scratch',\n",
" '4.5. Concise Implementation of Softmax Regression',\n",
" '4.6. Generalization in Classification',\n",
" '4.7. Environment and Distribution Shift']\n"
" '2.7. Documentation']\n"
]
}
],
"source": [
"text = output[0]['output'][0]['text'][0]\n",
"text = [p for p in text.split(\"\\n\") if len(p) > 20]\n",
"pprint.pprint(text[:20])"
"text = output[0]['output'][0]['text'][0:30]\n",
"text = [p for p in text if len(p) > 10]\n",
"pprint.pprint(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of the notebook\n",
"\n",
"Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!\n",
"\n",
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
" <img src=\"../image/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
"</a>"
]
}
],
Expand Down
265 changes: 265 additions & 0 deletions example/extract/extract_pdf_with_recursive_splitter.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example of loading PDF using recursive splitter\n",
"\n",
"Recursive Splitter: Splitting text by recursively look at characters.\n",
"Recursively tries to split by different characters to find one that works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Before running the code\n",
"\n",
"You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation. Furthermore, make sure you have the following packages installed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pip3 install nougat-ocr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load packages"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import sys\n",
"\n",
"sys.path.append(\".\")\n",
"sys.path.append(\"..\")\n",
"sys.path.append(\"../..\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/anaconda3/envs/uniflow/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"import pprint\n",
"from uniflow.flow.client import ExtractClient, TransformClient\n",
"from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig\n",
"from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig\n",
"from uniflow.op.prompt import PromptTemplate, Context\n",
"from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory\n",
"from uniflow.op.extract.split.constants import RECURSIVE_CHARACTER_SPLITTER"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare the input data\n",
"\n",
"First, let's set current directory and input data directory, and load the raw data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"dir_cur = os.getcwd()\n",
"pdf_file = \"1408.5882_page-1.pdf\"\n",
"input_file = os.path.join(f\"{dir_cur}/data/raw_input/\", pdf_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List all the available splitters\n",
"These are the different splitters we can use to post-process the loaded PDF."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ParagraphSplitter', 'MarkdownHeaderSplitter', 'RecursiveCharacterSplitter']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"SplitterOpsFactory.list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Load the pdf using recursive splitter"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/anaconda3/envs/uniflow/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)\n",
" return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]\n"
]
}
],
"source": [
"data = [\n",
" {\"filename\": input_file},\n",
"]\n",
"\n",
"config = ExtractPDFConfig(\n",
" model_config=NougatModelConfig(\n",
" model_name = \"0.1.0-small\",\n",
" batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU\n",
" ),\n",
" splitter=RECURSIVE_CHARACTER_SPLITTER,\n",
")\n",
"nougat_client = ExtractClient(config)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 1/1 [00:05<00:00, 5.07s/it]\n"
]
}
],
"source": [
"output = nougat_client.run(data)\n",
"contexts = output[0]['output'][0]['text']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Process the output\n",
"\n",
"Let's take a look of the generation output. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('chunk_0: # Convolutional Neural Networks for Sentence Classification Yoon '\n",
" 'KimNew York [email protected]###### AbstractWe report on a series of '\n",
" 'experiments with convolutional neural networks (CNN) traine...')\n",
"('chunk_1: Deep learning models have achieved remarkable results in computer '\n",
" 'vision [11] and speech recognition [1] in recent years. Within natural '\n",
" 'language processing, much of the work with deep learning method...')\n",
"('chunk_2: Convolutional neural networks (CNN) utilize layers with convolving '\n",
" 'filters that are applied to local features [1]. Originally invented for '\n",
" 'computer vision, CNN models have subsequently been shown to b...')\n",
"('chunk_3: In the present work, we train a simple CNN with one layer of '\n",
" 'convolution on top of word vectors obtained from an unsupervised neural '\n",
" 'language model. These vectors were trained by Mikolov et al. (2013)...')\n",
"('chunk_4: Our work is philosophically similar to Razavian et al. (2014) which '\n",
" 'showed that for image classification, feature extractors obtained from a '\n",
" 'pre-trained deep learning model perform well on a variety o...')\n"
]
}
],
"source": [
"for i, _s in enumerate(contexts):\n",
" pprint.pprint(f\"chunk_{i}: {_s[:200]}...\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of the notebook\n",
"\n",
"Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!\n",
"\n",
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
" <img src=\"../image/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
"</a>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "uniflow",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading