Conversation
Remove redundant and unnecessary sortings
Refactor get_top_k_items to return DataFrame with 'rank' column
same as pyspark's
gramhagen
left a comment
There was a problem hiding this comment.
this is great, small improvement suggested
| .apply(lambda x: x.nlargest(k, col_rating)) | ||
| .reset_index(drop=True) | ||
| ) | ||
| top_k_items["rank"] = top_k_items.groupby(col_user).cumcount() + 1 |
There was a problem hiding this comment.
you can avoid the repeated groupby too
groups = dataframe.groupby(col_user, as_index=False)
top_k_items = groups.apply(lambda x: x.nlargest(k, col_rating)).reset_index(drop=True)
top_k_items["rank"] = groups.cumcount() + 1
|
|
||
| Returns: | ||
| pd.DataFrame: DataFrame of top k items for each user | ||
| pd.DataFrame: DataFrame of top k items for each user, sorted by `col_user` and `"rank"` |
There was a problem hiding this comment.
i would remove the double quotes from rank to match just the backticks like col_user
There was a problem hiding this comment.
also, in the returns section of get_top_k_items =)
gramhagen
left a comment
There was a problem hiding this comment.
one more "rank" is there, if you can fix that then we're good
|
@gramhagen Few changes since the last review:
|
|
oh interesting, i didn't realize we use the python evaluation to validate test results for spark, we should remove that linkage, I'll add a separate feature request |
|
oh, i take it back, I guess that's an additional check just to ensure they match. i guess it helped in this case. |
|
@loomlike feel free to merge when you think it is convenient |
* Python evaluator module fix
Remove redundant and unnecessary sortings
Refactor get_top_k_items to return DataFrame with 'rank' column
same as pyspark's
* Update test to catch corner case
Description
Python evaluation module' ranking metric functions have redundant and unnecessary sorting codes.
E.g.
doesn't need to use
rank()sincedf_hitis already sorted by user and ratings as it is generated by groupby user (pandas groupby'ssortargument is by default True) and nlargest ratings.This change removes those redundant and unnecessary sorts and also refactor
get_top_k_itemsto return DataFrame with 'rank' column to make its behavior the same as our pyspark evaluation module.Related Issues
Checklist: