Compute 'length' of fields with custom term frequencies as docFreq#15584
Compute 'length' of fields with custom term frequencies as docFreq#15584msokolov wants to merge 2 commits intoapache:mainfrom
Conversation
|
Oh... interesting. If this |
|
This won't fix the issue of all kinds of crazy/undefined behavior at search-time. It is just allowing it to happen at index-time. |
|
high-level though, the idea is a good one to indicate custom term frequencies in the FieldInfo. I wouldn't try to do it as an "attribute" though, I think I feel like after that, remaining concerns are in scoring system and pruning. Needs lots of testing to make sure they behave reasonably |
|
I tried using an attribute since it seemed less intrusive, but using the enum to avoid unwanted combinations makes sense. |
|
I swear I posted a comment over on the linked issue, but today I don't see it. Basically just wondering what can go wrong at search time. I added some randomness to the test infra here to see if we can catch some flies. I guess someone might try using BM25 simlarity with this field, or some scoring that assumes that totalTermFrequency > termFrequency(doc)? |
Searching and pruning would be my concerns. Look at BaseSimilarityTestCase for the former. Enabling the option randomly won't find anything unless you actually trigger the excessive values too, I think test logic for the similarities must be explicit. |
|
OK, I think we also need to revise the approach somewhat since we can have overflows when there are multiple instances of the same term in the same document, but the intent here is that this field type stores a single score-per-document-per-term, and it makes no sense to index the same term twice in the same document. I think we can raise an exception when we detect this in order to prevent it. |
|
@msokolov Noticed this is 10.4, do we want this to be in 10.4? |
|
I would like it to be, but I don't think it's going to be ready very soon, so it doesn't seem realiztic to hold up a release for it |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
fixes #11086
This change introduces a "termdoc" field type whose term frequencies are to be interpreted as scores, not frequencies. For such fields we want the number of terms encoded by DefaultIndexingChain to count terms rather than summing term frequencies (same as docFreq).