-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Question
I am working on a custom chunking method where I need to identify headings, subheadings, and child headings separately. Here's the detailed explanation:
Current Issue:
I am using Docling to tag headings in a PDF.
Currently, all nested headings (subheadings and child headings) are marked as regular headings with ##.
There is no differentiation between parent headings and sub-level headings.
Objective:
I want to store section headings and titles as metadata for the content under each subheading.
Example: For a PDF with 3 chapters, each having multiple subheadings, the chunk should have:
Chapter Name as the Title.
Subheading as the Section Heading.
Current Limitation:
While I can extract the lowest level of headings, I am unable to identify the parent headings since the tags do not differentiate between them.
Assumption for Hierarchy:
I assume that chapter names are typically larger in font size compared to subheadings and child headings.
A hierarchy based on text size or boldness could be useful to identify different levels of headings.
Question:
Is there a way to distinguish headings, subheadings, and child headings separately based on these characteristics (e.g., font size, boldness)?
Any solution or guidance to achieve this would be highly appreciated.