Skip to content

Bypass total frequency check if field uses custom term frequency [LUCENE-10048] #11086

@asfimport

Description

@asfimport

For all fields whose index option is not IndexOptions.NONE. There is a check on per field total token count (i.e. field-length) to ensure we don't index too many tokens. This is done by accumulating the token's TermFrequencyAttribute.

 

Given that currently Lucene allows custom term frequency attached to each token and the usage of the frequency can be pretty wild. It is possible to have the following case where the check fails with only a few tokens that have large frequencies. Currently Lucene will skip indexing the whole document.

"foo|<very large number> bar|<very large number>"

 

What should be way to inform the indexing chain not to check the field length?

A related observation, when custom term frequency is in use, user is not likely to use the similarity for this field. Maybe we can offer a way to specify that, too?


Migrated from LUCENE-10048 by Tony Xu (@Tony-X), 1 vote, resolved Aug 13 2021

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions