Skip to content

feat: Enhance the triplets extraction in the knowledge graph by the batch size#2091

Merged
Aries-ckt merged 20 commits into
eosphoros-ai:mainfrom
Appointat:feat/async_triplets_extraction
Nov 5, 2024
Merged

feat: Enhance the triplets extraction in the knowledge graph by the batch size#2091
Aries-ckt merged 20 commits into
eosphoros-ai:mainfrom
Appointat:feat/async_triplets_extraction

Conversation

@Appointat
Copy link
Copy Markdown
Contributor

@Appointat Appointat commented Oct 23, 2024

Description

It calls the async function to accelerate the process of triplets extraction from the chunk text. The config can be set in the .env TRIPLET_EXTRACTION_BATCH_SIZE (default to 20).

How Has This Been Tested?

I have run the app server by set the value of TRIPLET_EXTRACTION_BATCH_SIZE differently. (1, 5, 100), and the running time varies.

Snapshots:

        batch_size = self._triplet_extraction_batch_size

        for i in range(0, len(chunks), batch_size):
            batch_chunks = chunks[i : i + batch_size]

            extraction_tasks = [
                self._graph_extractor.extract(chunk.content) for chunk in batch_chunks
            ]
            async_graphs: List[List[MemoryGraph]] = await asyncio.gather(
                *extraction_tasks
            )

            for chunk, graphs in zip(batch_chunks, async_graphs):
                for graph in graphs:
                    if document_graph_enabled:
                        # append the chunk id to the edge
                        for edge in graph.edges():
                            edge.set_prop("_chunk_id", chunk.chunk_id)
                            graph.append_edge(edge=edge)

                    # upsert the graph
                    self._graph_store_apdater.upsert_graph(graph)

                    # chunk -> include -> entity
                    if document_graph_enabled:
                        for vertex in graph.vertices():
                            self._graph_store_apdater.upsert_chunk_include_entity(
                                chunk=chunk, entity=vertex
                            )

Checklist:

  • My code follows the style guidelines of this project
  • I have already rebased the commits and make the commit message conform to the project standard.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Any dependent changes have been merged and published in downstream modules

Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
@github-actions github-actions Bot added the enhancement New feature or request label Oct 23, 2024
@Appointat
Copy link
Copy Markdown
Contributor Author

Appointat commented Oct 23, 2024

@Aries-ckt @fanzhidongyzby Could you please review it and add some tags? thanks

Aries-ckt
Aries-ckt previously approved these changes Oct 24, 2024
Copy link
Copy Markdown
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read and write order of chunk_history in _graph_extractor needs to be adjusted, otherwise it will lead to inconsistency between text block recall and serial semantics.

Appointat and others added 4 commits October 28, 2024 16:30
Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
Co-authored-by: Appointat <appointat@shu.edu.cn>
…thod

Co-authored-by: Appointat <appointat@shu.edu.cn>
@Appointat
Copy link
Copy Markdown
Contributor Author

The read and write order of chunk_history in _graph_extractor needs to be adjusted, otherwise it will lead to inconsistency between text block recall and serial semantics.

Thank you for your comment, I fixed it just now.

Copy link
Copy Markdown
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refine the code by following comments

Comment thread .env.template Outdated
Comment thread dbgpt/storage/knowledge_graph/community_summary.py
Comment thread dbgpt/rag/transformer/graph_extractor.py Outdated
@Appointat
Copy link
Copy Markdown
Contributor Author

Appointat commented Oct 30, 2024

image image

Copy link
Copy Markdown
Collaborator

@fanzhidongyzby fanzhidongyzby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Collaborator

@Aries-ckt Aries-ckt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aries-ckt Aries-ckt merged commit 25d47ce into eosphoros-ai:main Nov 5, 2024
@Appointat Appointat deleted the feat/async_triplets_extraction branch November 5, 2024 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request hacktoberfest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants