Skip to content

Code for the Location Heatmaps paper.#47

Open
ebagdasa wants to merge 13 commits intogoogle-research:masterfrom
ebagdasa:old_m
Open

Code for the Location Heatmaps paper.#47
ebagdasa wants to merge 13 commits intogoogle-research:masterfrom
ebagdasa:old_m

Conversation

@ebagdasa
Copy link

@ebagdasa ebagdasa commented Nov 4, 2021

This code demonstrates ability to build location heatmaps using
distributed differential privacy mechanism and proposed adaptive
algorithm. The code represents this paper: https://arxiv.org/abs/2111.02356 .
It also includes the Google Colab example for the experiments.

@google-cla
Copy link

google-cla bot commented Nov 4, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Nov 4, 2021
@ebagdasa
Copy link
Author

ebagdasa commented Nov 4, 2021

@googlebot I signed it!

@google-cla google-cla bot added cla: yes and removed cla: no labels Nov 4, 2021

To experiment with the code there is a working [notebook](dp_location_heatmaps.ipynb)
with all the examples from the paper, please don't hesitate to contact the
[author](mailto:[email protected]) or raise an issue.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous '1' at end of line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

image: Any
level_sample_size: int = 10000
secagg_round_size: int = 10000
threshold: float = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 'split_threshold' to contrast with 'collapse_threshold' below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return np.log(2 * self.num_clients / self.lam - 1)

def get_noise_tensor(self, input_shape):
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is unused for RapporNoise and should not be called by users. Shall we raise a NotImplementedError instead of silently returning? Should this method be _get_noise_tensor instead of get_noise_tensor to discourage direct usage, pointing users toward apply_noise instead?

Copy link
Author

@ebagdasa ebagdasa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -36,13 +37,32 @@ class Metrics:
f1: f1 score on the discovered hot spots.
mutual_info: mutual information metric.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add definition/description of new metrics (mape, smape, maape, nmse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also note how the zeros are handled (replaced with the next smallest true value from the image)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, added clarification to get_metrics()

f'MSE: {metric.mse:.2e}')
ax.imshow(test_image)
f'MSE: {metric.mse:.2e}', fontsize=30)
ax.imshow(test_image, interpolation='gaussian')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Gaussian interpolation for image display? The statistics calculated and displayed over the image would be different if calculated on the gaussian-interpolated image, wouldn't they?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just for visualization, it doesn't impact metrics. Mostly it improves rendering of lines in the contour grid image.

# print(f'Collapsed: {collapsed}, created when collapsing: {created},' + \
# f'new expanded: {fresh_expand},' + \
# f'unchanged: {unchanged}, total: {len(new_tree_prefix_list)}')
if fresh_expand == 0: # len(new_tree_prefix_list) <= len(tree_prefix_list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove or uncomment extraneous debugging code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return current_image, pos_image, neg_image


def split_regions(tree_prefix_list,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is very long with a lot of nested conditions, which makes it hard to read. can we extract some reasonable helpers to improve readability by making the overall structure of the threshold checks and tree traversal more apparent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return vector


def makeGaussian(image, total_size, fwhm=3, center=None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_gaussian for consistent style

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

elif level == 98:
z = 2.326
else:
raise ValueError(f'Incorrect confidence level {level}.')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nicer to just compute the z score analytically, rather than having sparse lookup table. I think the following should do the trick:

from scipy.stats import norm

z = norm.ppf(1-(1-level/100)/2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


def make_step(samples, eps, threshold, partial,
prefix_len, dropout_rate, tree, tree_prefix_list,
noiser, quantize, total_size, positivity, count_min):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@ebagdasa
Copy link
Author

@samellem please take a look, addressed your comments

Copy link

@samellem samellem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good! Just a few additional ideas.

Also, it looks like you missed a few comments from the previous review that got auto-collapsed in the GitHub UI:
image

Most of those were small nits, and some may not be as relevant after your changes, but please do take a look at them if you missed them the first time.

count_min: use count-min sketch
Returns:
new_tree, new_tree_prefix_list, finished
new_tree, new_tree_prefix_list, fresh_expand

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of fresh_expand is not obvious, especially since this function both collapses and expands. Maybe num_newly_expanded_nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks a lot!

split_threshold: threshold value used to split the nodes.
image_bit_level: stopping criteria once the final resolution is reached.
collapse_threshold: threshold value used to collapse the nodes.
expand_all: expand all regions,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we achieve this functionality by just passing split_threshold = -np.inf and eliminate the extra parameter & special-casing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, great idea, fixed

return current_image, pos_image, neg_image


def update_tree(prefix, tree, tree_prefix_list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe append_to_tree instead of update_tree?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

collapsed = 0
created = 0
fresh_expand = 0
unchanged = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsed, created, and unchanged do not appear to be used for anything anymore. let's delete them xor do something with them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing the results in the end of the function now

collapsed = 0
created = 0
fresh_expand = 0
unchanged = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise re. collapsed, created, and unchanged being unused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return new_tree, new_tree_prefix_list, fresh_expand


def split_regions_aux(tree_prefix_list,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that even more of these two functions could be shared (in particular, the basic structure of looping over prefixes and adding nodes to the tree as appropriate for the splitting & collapsing criteria), but acknowledge that it may not actually improve readability much more to do further surgery. Please consider sharing that prefix-looping structure, but if you can't see a clean and easy way to do so, that's fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree, it's just that I need to look at both bits in data that is hard to unify. Maybe once we go to multiple dimensions we can just unify everything.



def compute_conf_intervals(sum_vector: np.ndarray, level=95):
from scipy.stats import norm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to do imports inline like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def make_step(samples, eps, threshold, partial,
def create_confidence_interval_condition(last_result, prefix, count, split_threshold):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're ultimately just returning a boolean here, so maybe evaluate_confidence_interval_condition instead? the current name makes me think that we're returning some kind of predicate function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated and added a docstring

quantize: apply quantization to the vectors.
noise_class: use specific noise, defaults to GeometricNoise.
save_gif: saves all images as a gif.
count_min: use count-min sketch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@ebagdasa
Copy link
Author

yeah, thanks a lot and sorry for missed comments. Should be all good now.

Copy link

@samellem samellem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with some minor suggestions

binary version of the coordinate.
"""
x_coord, y_coord = xy_tuple
if len(xy_tuple) == 2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment applies to aux_data now that that is being pulled from xy_tuple.

"""Returns a quad tree first 4 nodes. If aux_data (boolean) provided expands
to 2 more bits or a specific pos/neg nodes.
Args:
aux_data: a boolean to use additional bit for data, e.g. pos/neg.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO aux_data sounds like an actual data object rather than a boolean parameter; I would prefer a name like "has_aux_data" or even "has_aux_bit" since a single bit is all that's supported here. This also goes for other usages of "aux_data" as a boolean in other functions, below.

Really it would be ideal to just generalize this to support an arbitrary number of extra bits with an automatic encoding from the value specified in "split", rather than a single extra bit with a predefined 'pos'-->1 and 'neg'-->0 encoding, but I understand that is probably out of scope at present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let me change it to has_aux_bit for now, and maybe can expand it later

@@ -227,6 +225,7 @@ def run_experiment(true_image,
noiser = noise_class(dp_round_size, sensitivity, eps)
if ignore_start_eps and start_with_level <= i:
print_output('Ignoring eps spent', flag=output_flag)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is a frightening message; it would be nice to have a bit of extra context here (e.g., "Ignoring epsilon spent expanding first {start_with_level} levels, including current level {i}.").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants