-
Notifications
You must be signed in to change notification settings - Fork 1.8k
tag based filtering: avoid duplicate rows in results #13442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tag based filtering: avoid duplicate rows in results #13442
Conversation
mtesauro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
|
@fopina can you please take a look at this one? It may pique your interest 😄 |
fopina
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting topic, thanks for tagging!
dojo/filters.py
Outdated
| ) | ||
|
|
||
|
|
||
| class TagExistsIContainsFilter(CharFilter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this filter picked up automatically by something or was it some experiment and forgotten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the leftover.
dojo/filters.py
Outdated
| for name, f in self.filters.items(): | ||
| field_name = getattr(f, "field_name", "") or "" | ||
| # filtering on tag names would result duplicate rows, one for each matching tag | ||
| if "tags__name" in field_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tags and tag filters use tags__name field but only tag does a contains.
The exact match of tags should never yield duplicates as tags__name is unique, right?
Maybe a more generic would be if filter lookup_expr is icontains and exclude is False and field is an m2m then use distinct?
Not sure how to check for the latter (field is an m2m), need to inspect filter object to see if it is already bound to the actual model field (if so, it would be clean otherwise a bit hacky)
Of course restricting the distinct to "icontains" would only make sense for joins on unique fields. If it's a join on a non-unique field, it could yield duplicate results even with iexact matches...
Leaving some food for thought but maybe the simple condition (to cover all) would be if not exclude and field is m2m. It would apply to more cases then needed, compared to previous suggestion (eg: exact matches on unique relations) but the same as current condition, extended to any field
Not only covers other existing m2m fields (if there's any, didn't check), but also prepares for future fields that might drop into the models/filters and looks "better" (in sense it doesn't have "tags__name" hardcoded)
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've narrowed the condition which triggers the distinct.
dojo/filters.py
Outdated
| if value not in (None, "", [], (), {}): | ||
| # distinct has a performance impact, so only apply it if needed. | ||
| # we considered Postgress' DISTINCT ON, but it would enforce ordering by id | ||
| # we considered changing to an EXISTS subquery, but it would make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shame for DISTINCT ON 😄 it's the price to pay for the performance boost!
The EXISTS option however sounds appealing! What internals would it be pinned to? Just the .tags.through, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some stuff around mapping the fields through the m2m model. nothing crazy but I feel the current approach is easier to understand for everyone and has less chances or corner cases where some magical combination of filters/ordering might break if we rewrite the tag filter to an EXISTs query.
dojo/filters.py
Outdated
| # we considered changing to an EXISTS subquery, but it would make | ||
| # our code dependant on the some of the django-tagulous internal | ||
| return qs.distinct() | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exceptions are expected here? Shouldn't we avoid these broad excepts? Especially as it's also returning a .distinct() (in case that would be the source for unexpected exceptions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the leftover
|
@fopina Thanks for reviewing. To be honest initially I just planned on slapping on the |
13f587a to
ed3f9d8
Compare
|
Far from guru sorry if I made myself try to sound like one..! I’d likely just slap the distinct() for everything and only notice when production list of findings continuously timed out I guess tag contains is something rare to use as the whole point is to pinpoint them (even a list), makes sense to go unnoticed thanks! |
|
I was just joking. I think it's good to have people around to keep me sharp. |
|
@fopina FWIW I am/was impressed with your insight 😄 |
Because the
tagsfield is a M2M relationship behind the scenes, this can result in duplicate rows in the result due to the nature of SQL JOINs.This PR resolves this by applying DISTINCT when filtering on tag names.
Fixes #13429