Skip to content

Compute expected ancestry proof size#6

Closed
serban300 wants to merge 5 commits intoparitytech:masterfrom
serban300:small_fixes
Closed

Compute expected ancestry proof size#6
serban300 wants to merge 5 commits intoparitytech:masterfrom
serban300:small_fixes

Conversation

@serban300
Copy link
Copy Markdown
Collaborator

@serban300 serban300 commented Aug 16, 2024

Related to: paritytech/polkadot-sdk#4523

This is a follow-up to paritytech/polkadot-sdk#5188

The main purpose of this PR is to add logic for computing the expected ancestry proof size .

This PR also aligns some functions from the ancestry_proof.rs file with the mmr.rs file, which should make it easier to upstream the ancestry proofs logic in the future.

@serban300
Copy link
Copy Markdown
Collaborator Author

Converted to draft. Need to check some test failures

@serban300 serban300 changed the title Accept only optimal ancestry proofs Compute expected ancestry proof size Aug 17, 2024
@serban300 serban300 marked this pull request as ready for review August 17, 2024 12:23
@serban300
Copy link
Copy Markdown
Collaborator Author

Converted to draft. Need to check some test failures

Fixed

@acatangiu
Copy link
Copy Markdown
Collaborator

cc @Lederstrumpf

Copy link
Copy Markdown
Collaborator

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level looks good, would like an audit too for peace of mind

Comment thread src/ancestry_proof.rs Outdated
Comment thread src/ancestry_proof.rs Outdated
Comment thread src/ancestry_proof.rs
@serban300 serban300 requested a review from ordian August 20, 2024 12:32
Copy link
Copy Markdown

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since i'm new to the code, left a couple of questions mainly about algorithmic complexity.
still need some time to get an understanding of why this change preserves validity

Comment thread src/ancestry_proof.rs Outdated

let mut sibs_processed_from_back = Vec::new();
{
if !queue.insert_sorted_by(value, cmp_nodes) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, the q was sorted by the pos first and now it's (height, pos)
my understanding is that the nodes will be sorted by pos, so this algorithm might be running in $\mathcal{O}({nˆ2})$, is that correct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case scenario we want to prove all the nodes. Let's say the number of nodes = n. The complexity of inserting in an ordered queue is O(log(n)). So I think the total complexity would be O(n * log(n)).

But if we want to prove just one node, we need to start from it and go up until we reach the root. Let's say the number of leaves = l. The height of the MMR is log(l). The number of nodes in the proof is at most the height of the MMR (we need at most one node per level) so the queue length is at most log(l). So the complexity should be O(log(l) * log(l))

I hope I'm not missing anything.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity of inserting in an ordered queue is O(log(n)).

This is not true. Insertion at a random position in a VecDeque is $\mathcal{O}(n)$, because if you insert in the middle, you need to shift every element on the right. If you want to guarantee logarithmic complexity, I'd suggest using a BTreeSet instead of a VecDeque.
But if the number in the proof is always bounded to a small number, that's probably fine as is. Just want to make sure we are not introducing a DoS vector in the worst case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, you're right. It's O(n), because it's a vec-based structure. I missed that.

For our use case (ancestry proofs), yes the number of nodes in the proof should be small, but we should probably make this scalable in general. I will change it to a BTreeSet or a LinkedList.

Copy link
Copy Markdown

@ordian ordian Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkedList would still have $\mathcal{O}(n)$ for finding the insertion position, so the right choice is BTreeSet or BinaryHeap 👍🏽

Copy link
Copy Markdown
Collaborator Author

@serban300 serban300 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkedList would still have O ( n ) for finding the insertion position,

If it's sorted I think it should have O(log(n)). But yes, BTreeSet would have O(1) amortized. It's only that for small numbers it would be worse than a sorted structure. And usually the number of nodes should be small.
BinaryHeap is also a good option.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or sorry, in a linked list it's not that easy to take advantage of the fact that it's sorted. You're right.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BTreeSet/BTreeMap since to be the best option. Done.

Comment thread src/ancestry_proof.rs
bagging_peaks_hashes::<_, M>(peaks_hashes)
}

pub fn expected_ancestry_proof_size(prev_mmr_size: u64, mmr_size: u64) -> usize {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the title of the PR, this seems like the main change in the PR. however, the rest seems like a totally unrelated rewriting of algorithm/refactoring. any reason to mix the two in one PR? or is it not a refactoring and actually this function depends on it?

also, what is the complexity of this function (seems non-trivial to guesstimate given nested loops)

Copy link
Copy Markdown
Collaborator Author

@serban300 serban300 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the title of the PR, this seems like the main change in the PR.

Yes, exactly.

however, the rest seems like a totally unrelated rewriting of algorithm/refactoring. any reason to mix the two in one PR? or is it not a refactoring and actually this function depends on it?

Yes, it's just a refactoring. No particular reason. Just did it for readability. I tried to follow the previous implementation to understand how to compute the expected proof size and it was just very hard. So I thought it might be helpful to refactor that as well. Also if we're doing a security audit for this PR, maybe it would be good to cover both changes. Also we'll need to upstream ancestry proofs at some point, and this change would help there, because it makes the code more similar to the upstream.

also, what is the complexity of this function (seems non-trivial to guesstimate given nested loops)

Worst case scenario, for each peak we need to walk from a node inside it to the peak height. So should be something like O (num_peaks * log(num_leafs))

Copy link
Copy Markdown

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't fully grasp the details (probably best to ask @Lederstrumpf), but let's kick off the audit.

Comment thread src/util.rs Outdated
Comment thread src/ancestry_proof.rs
Comment on lines +291 to +325
let mut expected_proof_size: usize = 0;
let mut prev_peaks = get_peaks(prev_mmr_size);
let peaks = get_peaks(mmr_size);

for (peak_idx, peak) in peaks.iter().enumerate() {
let mut local_prev_peaks: Vec<u64> = take_while_vec(&mut prev_peaks, |pos| *pos <= *peak);

// Pop lowest local prev peak under the current peak
match local_prev_peaks.pop() {
Some(mut node) => {
let mut height = pos_height_in_tree(node);
while node < *peak {
if pos_height_in_tree(node + 1) > height {
// Node is right sibling
node += 1;
height += 1;
} else {
// Node is left sibling
expected_proof_size += 1;
node += parent_offset(height);
height += 1;
}
}
}
None => {
if peak_idx <= peaks.len() {
// Account for rhs bagging peaks
expected_proof_size += 1;
}
break;
}
}
}

expected_proof_size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works by checking the frequency of left-sibling-hood when iterating up from the last prev_peak.
In some sense, it's a myopic climber ascending a dark mountain and feeling out its shape from relative location :P
The mountain peaks here are the co-path items for proving said last prev_peak.
We can also determine the exact shape of the mountain from just knowing its width (i.e. the number of leaves beneath it) though, and skip the climbing step.

I've opened up #7 to show the changes, but feel free to cherry pick the new algorithm over here instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! Sounds good ! I'll take a look.

@serban300
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #7

@serban300 serban300 closed this Aug 30, 2024
serban300 added a commit that referenced this pull request Oct 15, 2024
* Adjust node proofs verification logic

* Compute optimal ancestry proof size

* Add comments

* Use BTreeSet/BTreeMap

* use entry API

* calculate proof size from flipped sub-mmr

Calculating the number of copath items required for the proof of the
`prev_peaks` living beneath some peak can be done
simultaneously more succinctly & efficiently as follows:
Each peak is a perfect merkle tree. Therefore if we have some `prev_peaks`
living beneath a given `peak`, then the number of co-path items to
prove said `prev_peaks` equals the number of peaks that would be formed
for a sub-mmr with `leaf_count = leaves(peak) - leaves(prev_peaks)`. This is akin to
flipping the construction of the sub-mmr to be right-to-left and
counting its peaks.

* add benchmarks for ancestry & node proofs

* add proof size logic from #6 for benching

* Revert "add proof size logic from #6 for benching"

This reverts commit 6cd3808.

* improve readability with slice comparison

Suggested-by: Serban Iorga <[email protected]>

* Revert "Compute optimal ancestry proof size"

This reverts commit 5e0fa6e.

* Revert "Adjust node proofs verification logic"

This reverts commit dc1f5e8.

* add tests: "Adjust node proofs verification logic"

Co-authored-by: Serban Iorga <[email protected]>

* add tests: "Compute optimal ancestry proof size"

Co-authored-by: Serban Iorga <[email protected]>

* add applicable comment from "Add comments"

Co-authored-by: Serban Iorga <[email protected]>

---------

Co-authored-by: Serban Iorga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants