Code for the Location Heatmaps paper.#47
Code for the Location Heatmaps paper.#47ebagdasa wants to merge 13 commits intogoogle-research:masterfrom
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
|
@googlebot I signed it! |
|
|
||
| To experiment with the code there is a working [notebook](dp_location_heatmaps.ipynb) | ||
| with all the examples from the paper, please don't hesitate to contact the | ||
| [author](mailto:[email protected]) or raise an issue.1 |
| image: Any | ||
| level_sample_size: int = 10000 | ||
| secagg_round_size: int = 10000 | ||
| threshold: float = 0 |
There was a problem hiding this comment.
how about 'split_threshold' to contrast with 'collapse_threshold' below?
| return np.log(2 * self.num_clients / self.lam - 1) | ||
|
|
||
| def get_noise_tensor(self, input_shape): | ||
| return |
There was a problem hiding this comment.
IIUC, this is unused for RapporNoise and should not be called by users. Shall we raise a NotImplementedError instead of silently returning? Should this method be _get_noise_tensor instead of get_noise_tensor to discourage direct usage, pointing users toward apply_noise instead?
| @@ -36,13 +37,32 @@ class Metrics: | |||
| f1: f1 score on the discovered hot spots. | |||
| mutual_info: mutual information metric. | |||
| """ | |||
There was a problem hiding this comment.
add definition/description of new metrics (mape, smape, maape, nmse)
There was a problem hiding this comment.
could also note how the zeros are handled (replaced with the next smallest true value from the image)
There was a problem hiding this comment.
updated, added clarification to get_metrics()
| f'MSE: {metric.mse:.2e}') | ||
| ax.imshow(test_image) | ||
| f'MSE: {metric.mse:.2e}', fontsize=30) | ||
| ax.imshow(test_image, interpolation='gaussian') |
There was a problem hiding this comment.
Why Gaussian interpolation for image display? The statistics calculated and displayed over the image would be different if calculated on the gaussian-interpolated image, wouldn't they?
There was a problem hiding this comment.
this is just for visualization, it doesn't impact metrics. Mostly it improves rendering of lines in the contour grid image.
| # print(f'Collapsed: {collapsed}, created when collapsing: {created},' + \ | ||
| # f'new expanded: {fresh_expand},' + \ | ||
| # f'unchanged: {unchanged}, total: {len(new_tree_prefix_list)}') | ||
| if fresh_expand == 0: # len(new_tree_prefix_list) <= len(tree_prefix_list): |
There was a problem hiding this comment.
remove or uncomment extraneous debugging code
| return current_image, pos_image, neg_image | ||
|
|
||
|
|
||
| def split_regions(tree_prefix_list, |
There was a problem hiding this comment.
this function is very long with a lot of nested conditions, which makes it hard to read. can we extract some reasonable helpers to improve readability by making the overall structure of the threshold checks and tree traversal more apparent?
| return vector | ||
|
|
||
|
|
||
| def makeGaussian(image, total_size, fwhm=3, center=None, |
| elif level == 98: | ||
| z = 2.326 | ||
| else: | ||
| raise ValueError(f'Incorrect confidence level {level}.') |
There was a problem hiding this comment.
It'd be nicer to just compute the z score analytically, rather than having sparse lookup table. I think the following should do the trick:
from scipy.stats import norm
z = norm.ppf(1-(1-level/100)/2)
|
|
||
| def make_step(samples, eps, threshold, partial, | ||
| prefix_len, dropout_rate, tree, tree_prefix_list, | ||
| noiser, quantize, total_size, positivity, count_min): |
|
@samellem please take a look, addressed your comments |
samellem
left a comment
There was a problem hiding this comment.
Looking pretty good! Just a few additional ideas.
Also, it looks like you missed a few comments from the previous review that got auto-collapsed in the GitHub UI:

Most of those were small nits, and some may not be as relevant after your changes, but please do take a look at them if you missed them the first time.
| count_min: use count-min sketch | ||
| Returns: | ||
| new_tree, new_tree_prefix_list, finished | ||
| new_tree, new_tree_prefix_list, fresh_expand |
There was a problem hiding this comment.
The meaning of fresh_expand is not obvious, especially since this function both collapses and expands. Maybe num_newly_expanded_nodes?
| split_threshold: threshold value used to split the nodes. | ||
| image_bit_level: stopping criteria once the final resolution is reached. | ||
| collapse_threshold: threshold value used to collapse the nodes. | ||
| expand_all: expand all regions, |
There was a problem hiding this comment.
can we achieve this functionality by just passing split_threshold = -np.inf and eliminate the extra parameter & special-casing?
| return current_image, pos_image, neg_image | ||
|
|
||
|
|
||
| def update_tree(prefix, tree, tree_prefix_list): |
There was a problem hiding this comment.
nit: maybe append_to_tree instead of update_tree?
| collapsed = 0 | ||
| created = 0 | ||
| fresh_expand = 0 | ||
| unchanged = 0 |
There was a problem hiding this comment.
collapsed, created, and unchanged do not appear to be used for anything anymore. let's delete them xor do something with them.
There was a problem hiding this comment.
printing the results in the end of the function now
| collapsed = 0 | ||
| created = 0 | ||
| fresh_expand = 0 | ||
| unchanged = 0 |
There was a problem hiding this comment.
likewise re. collapsed, created, and unchanged being unused
| return new_tree, new_tree_prefix_list, fresh_expand | ||
|
|
||
|
|
||
| def split_regions_aux(tree_prefix_list, |
There was a problem hiding this comment.
I suspect that even more of these two functions could be shared (in particular, the basic structure of looping over prefixes and adding nodes to the tree as appropriate for the splitting & collapsing criteria), but acknowledge that it may not actually improve readability much more to do further surgery. Please consider sharing that prefix-looping structure, but if you can't see a clean and easy way to do so, that's fine.
There was a problem hiding this comment.
yeah, I agree, it's just that I need to look at both bits in data that is hard to unify. Maybe once we go to multiple dimensions we can just unify everything.
|
|
||
|
|
||
| def compute_conf_intervals(sum_vector: np.ndarray, level=95): | ||
| from scipy.stats import norm |
There was a problem hiding this comment.
I'd prefer not to do imports inline like this.
|
|
||
|
|
||
| def make_step(samples, eps, threshold, partial, | ||
| def create_confidence_interval_condition(last_result, prefix, count, split_threshold): |
There was a problem hiding this comment.
we're ultimately just returning a boolean here, so maybe evaluate_confidence_interval_condition instead? the current name makes me think that we're returning some kind of predicate function
| quantize: apply quantization to the vectors. | ||
| noise_class: use specific noise, defaults to GeometricNoise. | ||
| save_gif: saves all images as a gif. | ||
| count_min: use count-min sketch. |
|
yeah, thanks a lot and sorry for missed comments. Should be all good now. |
samellem
left a comment
There was a problem hiding this comment.
approved with some minor suggestions
| binary version of the coordinate. | ||
| """ | ||
| x_coord, y_coord = xy_tuple | ||
| if len(xy_tuple) == 2: |
There was a problem hiding this comment.
I think this comment applies to aux_data now that that is being pulled from xy_tuple.
| """Returns a quad tree first 4 nodes. If aux_data (boolean) provided expands | ||
| to 2 more bits or a specific pos/neg nodes. | ||
| Args: | ||
| aux_data: a boolean to use additional bit for data, e.g. pos/neg. |
There was a problem hiding this comment.
IMO aux_data sounds like an actual data object rather than a boolean parameter; I would prefer a name like "has_aux_data" or even "has_aux_bit" since a single bit is all that's supported here. This also goes for other usages of "aux_data" as a boolean in other functions, below.
Really it would be ideal to just generalize this to support an arbitrary number of extra bits with an automatic encoding from the value specified in "split", rather than a single extra bit with a predefined 'pos'-->1 and 'neg'-->0 encoding, but I understand that is probably out of scope at present.
There was a problem hiding this comment.
I agree, let me change it to has_aux_bit for now, and maybe can expand it later
| @@ -227,6 +225,7 @@ def run_experiment(true_image, | |||
| noiser = noise_class(dp_round_size, sensitivity, eps) | |||
| if ignore_start_eps and start_with_level <= i: | |||
| print_output('Ignoring eps spent', flag=output_flag) | |||
There was a problem hiding this comment.
Nit: this is a frightening message; it would be nice to have a bit of extra context here (e.g., "Ignoring epsilon spent expanding first {start_with_level} levels, including current level {i}.").
This code demonstrates ability to build location heatmaps using
distributed differential privacy mechanism and proposed adaptive
algorithm. The code represents this paper: https://arxiv.org/abs/2111.02356 .
It also includes the Google Colab example for the experiments.