Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block'

environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.

issue: 
```run lib.add_files() ```

and ingest documents that the C parser will extract images pending downstream OCR with ```lib.run_ocr_on_images(add_to_library=True)```
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
```lib.run_ocr_on_images(add_to_library=True, other_params)```
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's 
```lib.install_new_embedding(params)``` 
chunks/sentences for embedding are  retrieved and collated into batches from 'text_search' 
so far so good

at Query time - 
```Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)```

would return results where 'text' is empty!  - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide --  returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.

the following images show this clearly...


![Screenshot 2024-12-01 at 19 02 43](https://github.com/user-attachments/assets/77c3a416-670d-40ce-b01a-86b2408054be)


![Screenshot 2024-12-02 at 17 25 10](https://github.com/user-attachments/assets/02dfbee7-c580-4ea6-8ac8-995b9a8f6cf9)


![Screenshot 2024-12-01 at 19 01 45](https://github.com/user-attachments/assets/0e6708d0-d4bd-4f4d-80cf-fe9c44d67a50)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions