Skip to content

Conversation

@orapic
Copy link

@orapic orapic commented Feb 10, 2022

Add the possibility to concatenate simhashes to make a larger one.

This way one can make a sort of "signature" where multiple simhashes are combined into one. Also made the tests for it.

Example with 4 simhashes of 32 bits:

input_to_concat = [simhash1, simhash2, simhash3, simhas4]
new_simhash =  MultiSimhash(input_to_concat)

@1e0ng
Copy link
Owner

1e0ng commented Feb 12, 2022

Hi @orapic, Thanks for the PR!
I can see one build failed. Could you check the cause and fix it?

@orapic
Copy link
Author

orapic commented Mar 6, 2022

Ok, should be fixed now.

for i in simhashes:
multi_f = multi_f + i.f
if multi_f % 8:
raise Exception('Simhashes do not the same length (f)')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, could you explain here, what do you want to check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, well looking at it twice, that code doesn't make much sense. I will change it to check them all 1 by 1 to the length of the first simshash.

if multi_f % 8:
raise Exception('Simhashes do not the same length (f)')
multi_value = self._concatenate_simhashes(simhashes)
super(MultiSimhash, self).__init__(value=multi_value, f=multi_f, hashfunc=simhashes[0].hashfunc)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before using simhases[0], we need to check the length of simhashes to make sure it's not empty. Also since you are using the first element's hashfunc, do we assume all hashfunc should be the same for each element in simahases list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will add some check to look if its empty or not.
Regarding the hashfunc, I think it's safe they must be the same. If they are not, which one do you chose for the new multihash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants