CambioML · SayaZhang · Feb 3, 2024 · Jan 22, 2024 · Jan 23, 2024 · Jan 24, 2024
@@ -67,13 +67,13 @@
     {
      "data": {
       "text/plain": [
-       "{'extract': ['ExtractImageFlow',\n",
+       "{'extract': ['ExtractHTMLFlow',\n",
+       "  'ExtractImageFlow',\n",
        "  'ExtractIpynbFlow',\n",
        "  'ExtractMarkdownFlow',\n",
        "  'ExtractPDFFlow',\n",
        "  'ExtractTxtFlow',\n",
-       "  'ExtractS3TxtFlow',\n",
-       "  'ExtractHTMLFlow'],\n",
+       "  'ExtractS3TxtFlow'],\n",
        " 'transform': ['TransformAzureOpenAIFlow',\n",
        "  'TransformCopyFlow',\n",
        "  'TransformHuggingFaceFlow',\n",
@@ -116,7 +116,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -132,7 +132,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -141,21 +141,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "  0%|          | 0/1 [00:00<?, ?it/s]"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 1/1 [00:00<00:00,  4.53it/s]\n"
+      "100%|██████████| 1/1 [00:00<00:00, 10330.80it/s]\n"
      ]
     }
    ],
@@ -174,40 +167,51 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "['22.11. Information Theory — Dive into Deep Learning 1.0.3 documentation22.',\n",
+      "['22.11. Information Theory — Dive into Deep Learning 1.0.3 documentation',\n",
       " 'Appendix: Mathematics for Deep Learning',\n",
+      " 'navigate_next',\n",
+      " 'Information Theory',\n",
+      " 'Quick search',\n",
+      " 'Show Source',\n",
+      " 'Preview Version',\n",
+      " 'Table Of Contents',\n",
+      " 'Installation',\n",
+      " '1. Introduction',\n",
+      " '2. Preliminaries',\n",
       " '2.1. Data Manipulation',\n",
       " '2.2. Data Preprocessing',\n",
+      " '2.3. Linear Algebra',\n",
+      " '2.4. Calculus',\n",
       " '2.5. Automatic Differentiation',\n",
       " '2.6. Probability and Statistics',\n",
-      " '3. Linear Neural Networks for Regression',\n",
-      " '3.1. Linear Regression',\n",
-      " '3.2. Object-Oriented Design for Implementation',\n",
-      " '3.3. Synthetic Regression Data',\n",
-      " '3.4. Linear Regression Implementation from Scratch',\n",
-      " '3.5. Concise Implementation of Linear Regression',\n",
-      " '4. Linear Neural Networks for Classification',\n",
-      " '4.1. Softmax Regression',\n",
-      " '4.2. The Image Classification Dataset',\n",
-      " '4.3. The Base Classification Model',\n",
-      " '4.4. Softmax Regression Implementation from Scratch',\n",
-      " '4.5. Concise Implementation of Softmax Regression',\n",
-      " '4.6. Generalization in Classification',\n",
-      " '4.7. Environment and Distribution Shift']\n"
+      " '2.7. Documentation']\n"
      ]
     }
    ],
    "source": [
-    "text = output[0]['output'][0]['text'][0]\n",
-    "text = [p for p in text.split(\"\\n\") if len(p) > 20]\n",
-    "pprint.pprint(text[:20])"
+    "text = output[0]['output'][0]['text'][0:30]\n",
+    "text = [p for p in text if len(p) > 10]\n",
+    "pprint.pprint(text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## End of the notebook\n",
+    "\n",
+    "Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!\n",
+    "\n",
+    "<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
+    "    <img src=\"../image/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
+    "</a>"
    ]
   }
  ],

@@ -0,0 +1,265 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Example of loading PDF using recursive splitter\n",
+    "\n",
+    "Recursive Splitter: Splitting text by recursively look at characters.\n",
+    "Recursively tries to split by different characters to find one that works."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Before running the code\n",
+    "\n",
+    "You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation. Furthermore, make sure you have the following packages installed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pip3 install nougat-ocr"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%reload_ext autoreload\n",
+    "%autoreload 2\n",
+    "\n",
+    "import sys\n",
+    "\n",
+    "sys.path.append(\".\")\n",
+    "sys.path.append(\"..\")\n",
+    "sys.path.append(\"../..\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ubuntu/anaconda3/envs/uniflow/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import pprint\n",
+    "from uniflow.flow.client import ExtractClient, TransformClient\n",
+    "from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig\n",
+    "from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig\n",
+    "from uniflow.op.prompt import PromptTemplate, Context\n",
+    "from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory\n",
+    "from uniflow.op.extract.split.constants import RECURSIVE_CHARACTER_SPLITTER"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare the input data\n",
+    "\n",
+    "First, let's set current directory and input data directory, and load the raw data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dir_cur = os.getcwd()\n",
+    "pdf_file = \"1408.5882_page-1.pdf\"\n",
+    "input_file = os.path.join(f\"{dir_cur}/data/raw_input/\", pdf_file)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### List all the available splitters\n",
+    "These are the different splitters we can use to post-process the loaded PDF."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['ParagraphSplitter', 'MarkdownHeaderSplitter', 'RecursiveCharacterSplitter']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "SplitterOpsFactory.list()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### Load the pdf using recursive splitter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ubuntu/anaconda3/envs/uniflow/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)\n",
+      "  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]\n"
+     ]
+    }
+   ],
+   "source": [
+    "data = [\n",
+    "    {\"filename\": input_file},\n",
+    "]\n",
+    "\n",
+    "config = ExtractPDFConfig(\n",
+    "    model_config=NougatModelConfig(\n",
+    "        model_name = \"0.1.0-small\",\n",
+    "        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU\n",
+    "    ),\n",
+    "    splitter=RECURSIVE_CHARACTER_SPLITTER,\n",
+    ")\n",
+    "nougat_client = ExtractClient(config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "  0%|          | 0/1 [00:00<?, ?it/s]"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 1/1 [00:05<00:00,  5.07s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "output = nougat_client.run(data)\n",
+    "contexts = output[0]['output'][0]['text']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Process the output\n",
+    "\n",
+    "Let's take a look of the generation output. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "('chunk_0: # Convolutional Neural Networks for Sentence Classification Yoon '\n",
+      " 'KimNew York [email protected]###### AbstractWe report on a series of '\n",
+      " 'experiments with convolutional neural networks (CNN) traine...')\n",
+      "('chunk_1: Deep learning models have achieved remarkable results in computer '\n",
+      " 'vision [11] and speech recognition [1] in recent years. Within natural '\n",
+      " 'language processing, much of the work with deep learning method...')\n",
+      "('chunk_2: Convolutional neural networks (CNN) utilize layers with convolving '\n",
+      " 'filters that are applied to local features [1]. Originally invented for '\n",
+      " 'computer vision, CNN models have subsequently been shown to b...')\n",
+      "('chunk_3: In the present work, we train a simple CNN with one layer of '\n",
+      " 'convolution on top of word vectors obtained from an unsupervised neural '\n",
+      " 'language model. These vectors were trained by Mikolov et al. (2013)...')\n",
+      "('chunk_4: Our work is philosophically similar to Razavian et al. (2014) which '\n",
+      " 'showed that for image classification, feature extractors obtained from a '\n",
+      " 'pre-trained deep learning model perform well on a variety o...')\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i, _s in enumerate(contexts):\n",
+    "    pprint.pprint(f\"chunk_{i}: {_s[:200]}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## End of the notebook\n",
+    "\n",
+    "Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!\n",
+    "\n",
+    "<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
+    "    <img src=\"../image/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
+    "</a>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "uniflow",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}