Skip to content

Commit 7fc56b0

Browse files
committed
[skip test] Adding demo notebook for Reader2Image
1 parent e71c33f commit 7fc56b0

File tree

1 file changed

+28
-54
lines changed

1 file changed

+28
-54
lines changed

examples/python/data-preprocessing/SparkNLP_Reader2Image_Demo.ipynb

Lines changed: 28 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -17,33 +17,7 @@
1717
"source": [
1818
"# Introducing Reader2Image in SparkNLP\n",
1919
"\n",
20-
"This notebook showcases the newly added `Reader2Image` annotator in Spark NLP. It provides a streamlined and user-friendly interface for reading image files and integrating them with VLM annotators in Spark NLP. The annotator is useful for preprocessing data in NLP pipelines that rely on information contained within images. Currently, it supports HTML and Markdown files."
21-
]
22-
},
23-
{
24-
"cell_type": "markdown",
25-
"metadata": {},
26-
"source": [
27-
"## Setup and Initialization\n",
28-
"Let's keep in mind a few things before we start 😊\n",
29-
"\n",
30-
"Support for **Reader2Image** files was introduced in Spark NLP 6.1.3. Please make sure you have upgraded to the latest Spark NLP release."
31-
]
32-
},
33-
{
34-
"cell_type": "markdown",
35-
"metadata": {},
36-
"source": [
37-
"- Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script"
38-
]
39-
},
40-
{
41-
"cell_type": "code",
42-
"execution_count": null,
43-
"metadata": {},
44-
"outputs": [],
45-
"source": [
46-
"! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
20+
"This notebook showcases the newly added `Reader2Image` annotator in Spark NLP. It provides a streamlined and user-friendly interface for reading image files and integrating them with VLM annotators in Spark NLP. The annotator is useful for preprocessing data in NLP pipelines that rely on information contained within images."
4721
]
4822
},
4923
{
@@ -54,7 +28,7 @@
5428
"base_uri": "https://localhost:8080/"
5529
},
5630
"id": "xvycj4qAObCw",
57-
"outputId": "1b5e5472-5ed3-45c5-e1df-54e1f3b1c583"
31+
"outputId": "46be2c16-710c-4642-fda9-f3532d51dfb8"
5832
},
5933
"outputs": [
6034
{
@@ -68,7 +42,7 @@
6842
"source": [
6943
"import sparknlp\n",
7044
"\n",
71-
"# let's start Spark with Spark NLP\n",
45+
"# let's start Spark with Spark NLP with GPU enabled. If you don't have GPUs available remove this parameter.\n",
7246
"spark = sparknlp.start()\n",
7347
"print(sparknlp.version())\n",
7448
"\n",
@@ -86,14 +60,14 @@
8660
},
8761
{
8862
"cell_type": "code",
89-
"execution_count": 14,
63+
"execution_count": 5,
9064
"metadata": {
9165
"colab": {
9266
"base_uri": "https://localhost:8080/",
9367
"height": 129
9468
},
9569
"id": "6ZUkBA7rZ1lp",
96-
"outputId": "75d845e9-bcc6-4409-e55c-1ef3877a272e"
70+
"outputId": "9db16c69-c198-47cc-cd15-adf06625a1fe"
9771
},
9872
"outputs": [
9973
{
@@ -192,19 +166,19 @@
192166
"base_uri": "https://localhost:8080/"
193167
},
194168
"id": "pZwclDzKVVX_",
195-
"outputId": "cbbdabb9-c1bc-472a-b478-41ab8516cda8"
169+
"outputId": "18b3ee6a-281f-423e-ddf0-ddbea5332f85"
196170
},
197171
"outputs": [
198172
{
199173
"name": "stdout",
200174
"output_type": "stream",
201175
"text": [
202-
"+--------------------+--------------------+--------------------+-------------------+--------------------+\n",
203-
"| path| content| partition| fileName| image|\n",
204-
"+--------------------+--------------------+--------------------+-------------------+--------------------+\n",
205-
"|file:/content/exa...|\\n<!DOCTYPE html>...|[{Title, Test Ima...|example-images.html|[{image, example-...|\n",
206-
"|file:/content/exa...|\\n<!DOCTYPE html>...|[{Title, Test Ima...|example-images.html|[{image, example-...|\n",
207-
"+--------------------+--------------------+--------------------+-------------------+--------------------+\n",
176+
"+-------------------+--------------------+---------+\n",
177+
"| fileName| image|exception|\n",
178+
"+-------------------+--------------------+---------+\n",
179+
"|example-images.html|[{image, example-...| NULL|\n",
180+
"|example-images.html|[{image, example-...| NULL|\n",
181+
"+-------------------+--------------------+---------+\n",
208182
"\n"
209183
]
210184
}
@@ -262,19 +236,19 @@
262236
"base_uri": "https://localhost:8080/"
263237
},
264238
"id": "i7vMR6AHVt_w",
265-
"outputId": "fb88f19b-2e22-416e-b6de-91e30a351a7a"
239+
"outputId": "fa524ea7-12fe-4179-d770-1f442add9899"
266240
},
267241
"outputs": [
268242
{
269243
"name": "stdout",
270244
"output_type": "stream",
271245
"text": [
272-
"+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+\n",
273-
"| path| content| partition| fileName| image| text|\n",
274-
"+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+\n",
275-
"|file:/content/exa...|\\n<!DOCTYPE html>...|[{Title, Test Ima...|example-images.html|[{image, example-...|<|im_start|>syste...|\n",
276-
"|file:/content/exa...|\\n<!DOCTYPE html>...|[{Title, Test Ima...|example-images.html|[{image, example-...|<|im_start|>syste...|\n",
277-
"+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+\n",
246+
"+-------------------+--------------------+---------+--------------------+\n",
247+
"| fileName| image|exception| text|\n",
248+
"+-------------------+--------------------+---------+--------------------+\n",
249+
"|example-images.html|[{image, example-...| NULL|<|im_start|>syste...|\n",
250+
"|example-images.html|[{image, example-...| NULL|<|im_start|>syste...|\n",
251+
"+-------------------+--------------------+---------+--------------------+\n",
278252
"\n"
279253
]
280254
}
@@ -291,7 +265,7 @@
291265
"base_uri": "https://localhost:8080/"
292266
},
293267
"id": "ufF265kuV0-7",
294-
"outputId": "916d69fb-fcd3-42a5-b23c-ed940df4819f"
268+
"outputId": "fd06edf7-8718-425c-a23a-9a2986e6f315"
295269
},
296270
"outputs": [
297271
{
@@ -334,19 +308,19 @@
334308
"base_uri": "https://localhost:8080/"
335309
},
336310
"id": "XiAw_vbVWqlN",
337-
"outputId": "c1c84963-90b7-4252-efb4-1d0f61218e26"
311+
"outputId": "a7aa778f-b16b-47fd-ad56-d4d97c3f5f81"
338312
},
339313
"outputs": [
340314
{
341315
"name": "stdout",
342316
"output_type": "stream",
343317
"text": [
344-
"+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
345-
"|origin |result |\n",
346-
"+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
347-
"|[example-images.html]|[The image is a simple, solid-colored background with a gradient effect. The background is composed of of two primary colors: a bright yellow and a slightly darker yellow. The yellow on the left side of the image is brighter and more vivid, while the yellow on the right side is slightly muted and less intense. The gradient effect creates a smooth transition from one color to the other, giving the impression of a gradient background.] |\n",
348-
"|[example-images.html]|[The image depicts a stylized representation of an atom. The atom is composed of three main parts: the nucleus, the electron shell, and the atomic nucleus. The nucleus is represented by a central, circular shape, which is the core of the atom. The electron shell is depicted as a series of concentric circles, each representing a different energy level or shell of electrons. The outermost shell is the highest energy level, and the innermost shell is the lowest energy level. The electron shell is typically divided into two subshells, with one subshell containing one electron and the other containing two electrons. which are held together by a single bond.\\n\\nThe background of the image is a solid red color, which contrasts with the pink outline of the atom. The overall design is minimalistic and focuses on the essential components of an atom.]|\n",
349-
"+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
318+
"+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
319+
"|origin |result |\n",
320+
"+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
321+
"|[example-images.html]|[The image is a simple, solid-colored background with a gradient effect. The colors blend smoothly from a lighter yellow at the top to a darker yellow at the bottom. The gradient effect creates a subtle visual effect, giving the impression of a gradient background.] |\n",
322+
"|[example-images.html]|[The image depicts a stylized representation of an atom. The atom is composed of three main parts: the nucleus, which is the central core of the atom, and two electron shells, which are the outer shells around the nucleus. electron can orbit. The electron shells are depicted as concentric circles, with the nucleus at the center and the electron shells extending outward. The color scheme is primarily pink and red, with the nucleus being a lighter pink and the electron shells being a darker pink.]|\n",
323+
"+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
350324
"\n"
351325
]
352326
}
@@ -361,7 +335,7 @@
361335
"id": "mnsyx37VZlUm"
362336
},
363337
"source": [
364-
"Voilà! As you can see above, we have accurate descriptions of the images generated by Qwen2VLTransformer."
338+
"Voilà! As you can see above, we have accurate descriptions of the images generated by `Qwen2VLTransformer`."
365339
]
366340
}
367341
],

0 commit comments

Comments
 (0)