Identify table of contents for better chunking Hierarchy Identification

Question
I am working on a custom chunking method where I need to identify headings, subheadings, and child headings separately. Here's the detailed explanation:

Current Issue:

I am using Docling to tag headings in a PDF.
Currently, all nested headings (subheadings and child headings) are marked as regular headings with ##.
There is no differentiation between parent headings and sub-level headings.
Objective:

I want to store section headings and titles as metadata for the content under each subheading.
Example: For a PDF with 3 chapters, each having multiple subheadings, the chunk should have:
Chapter Name as the Title.
Subheading as the Section Heading.
Current Limitation:

While I can extract the lowest level of headings, I am unable to identify the parent headings since the tags do not differentiate between them.
Assumption for Hierarchy:

I assume that chapter names are typically larger in font size compared to subheadings and child headings.
A hierarchy based on text size or boldness could be useful to identify different levels of headings.
Question:

Is there a way to distinguish headings, subheadings, and child headings separately based on these characteristics (e.g., font size, boldness)?
Any solution or guidance to achieve this would be highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify table of contents for better chunking Hierarchy Identification #287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Identify table of contents for better chunking Hierarchy Identification #287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions