Skip to content

Indexing large tables has very poor throughput #39

@rvause

Description

@rvause

I have a table to index with over 20m rows. The chunked row iteration uses limit/offset queries which gradually take longer to execute as the offset increases.

Indexing these rows using the current get_indexing_queryset took over 15 hours to complete.

Working around the issue, using cursor pagination helped a lot with the bulk indexing job taking about ~65 minutes. The downside to this is that Django's prefetch queryset API would not apply.

Here is the implementation of the get_indexing_queryset method using QuerySet.iterator():

class Document(DODocument):
    def get_indexing_queryset(
        self,
        verbose: bool = False,
        filter_: Optional[Q] = None,
        exclude: Optional[Q] = None,
        count: Optional[int] = None,
        action: OpensearchAction = OpensearchAction.INDEX,
        stdout: TextIO = sys.stdout,
    ) -> Iterable[Any]:
        chunk_size = cast(int, self.django.queryset_pagination)
        qs = self.get_queryset(filter_=filter_, exclude=exclude, count=count)
        qs = qs.order_by("pk") if not qs.query.is_sliced else qs
        count = qs.count()
        model = self.django.model.__name__
        action = action.present_participle.title()

        i = 0
        done = 0
        start = time.time()
        if verbose:
            stdout.write(f"{action} {model}: 0% ({self._eta(start, done, count)})\r")

        iterqs = qs.iterator()
        while done < count:
            if verbose:
                eta = self._eta(start, done, count)
                stdout.write(f"{action} {model}: {round(i / count * 100)}% ({eta})   \r")

            for _ in range(chunk_size):
                done += 1
                try:
                    yield next(iterqs)
                except StopIteration:
                    # Ensure we break out of the outer loop
                    done = count + 1
                    break

            i = min(i + chunk_size, count)
        if verbose:
            stdout.write(f"{action} {count} {model}: OK          \n")

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions