I have a table to index with over 20m rows. The chunked row iteration uses limit/offset queries which gradually take longer to execute as the offset increases.
Indexing these rows using the current get_indexing_queryset took over 15 hours to complete.
Working around the issue, using cursor pagination helped a lot with the bulk indexing job taking about ~65 minutes. The downside to this is that Django's prefetch queryset API would not apply.
Here is the implementation of the get_indexing_queryset method using QuerySet.iterator():
class Document(DODocument):
def get_indexing_queryset(
self,
verbose: bool = False,
filter_: Optional[Q] = None,
exclude: Optional[Q] = None,
count: Optional[int] = None,
action: OpensearchAction = OpensearchAction.INDEX,
stdout: TextIO = sys.stdout,
) -> Iterable[Any]:
chunk_size = cast(int, self.django.queryset_pagination)
qs = self.get_queryset(filter_=filter_, exclude=exclude, count=count)
qs = qs.order_by("pk") if not qs.query.is_sliced else qs
count = qs.count()
model = self.django.model.__name__
action = action.present_participle.title()
i = 0
done = 0
start = time.time()
if verbose:
stdout.write(f"{action} {model}: 0% ({self._eta(start, done, count)})\r")
iterqs = qs.iterator()
while done < count:
if verbose:
eta = self._eta(start, done, count)
stdout.write(f"{action} {model}: {round(i / count * 100)}% ({eta}) \r")
for _ in range(chunk_size):
done += 1
try:
yield next(iterqs)
except StopIteration:
# Ensure we break out of the outer loop
done = count + 1
break
i = min(i + chunk_size, count)
if verbose:
stdout.write(f"{action} {count} {model}: OK \n")
I have a table to index with over 20m rows. The chunked row iteration uses limit/offset queries which gradually take longer to execute as the offset increases.
Indexing these rows using the current
get_indexing_querysettook over 15 hours to complete.Working around the issue, using cursor pagination helped a lot with the bulk indexing job taking about ~65 minutes. The downside to this is that Django's prefetch queryset API would not apply.
Here is the implementation of the
get_indexing_querysetmethod usingQuerySet.iterator():