Skip to content

Conversation

@manavsinghal157
Copy link
Contributor

Part of the Empirical Analysis of Privacy Preserving Learning Project.

This PR introduces a command line argument that implements aggregated learning by saving only those features that have seen a minimum threshold of users thus upholding the privacy of the user.

Methodology:

  • For each feature, a 32-bit vector is defined. (vowpalwabbit/array_parameters.h and vowpalwabbit/array_parameters_dense.h)
  • We calculate a 5-bit hash of the tag of the example. (vowpalwabbit/parser.cc)
  • For each feature weight updated by a non-zero value, we use the 5-bit hash to look up a bit in the 32-bit vector and set it to 1.(vowpalwabbit/gd_predict.h -> (vowpalwabbit/array_parameters.h and vowpalwabbit/array_parameters_dense.h))
  • When saving the weights into a file, we calculate the number of bits set to 1 for a feature. If it is greater than the threshold, the weights for that feature are saved. (vowpalwabbit/gd.cc->(vowpalwabbit/array_parameters.h and vowpalwabbit/array_parameters_dense.h))

(The default value of the threshold is 10)

This PR includes:

  • Command line argument to activate privacy preservation and set the threshold. (vowpalwabbit/parse_args.cc)
  • Runtests to test the desired output on a small dataset. (test/core.vwtest.json)
  • Unit-tests for checking output when threshold is reached for a feature and when it is not. (test/unit_test/weights_test.cc)
  • Benchmarks to test time taken for learning in privacy preserving method. (test/benchmarks/standalone/benchmark_text_input.cc )

Implementation details:

--privacy_activation : To activate the feature
--privacy_activation_threshold arg (=10) : To set the threshold

Future Work:

  • Implement the feature for save_resume.
  • Work on aggregations in the online setting.

Wiki page for the same : https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Privacy-Preserving-Learning

@olgavrou olgavrou added this to the VW 9.0 milestone Nov 24, 2021
uint32_t _stride_shift;
bool _seeded; // whether the instance is sharing model state with others
size_t _privacy_activation_threshold;
std::unordered_map<uint64_t, std::bitset<32>> _feature_bitset; // define the bitset for each feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a unique_ptr set to nullptr unless the privacy mode is on, to be a bit explicit and avoid extra memory allocations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion, making it a shared ptr due to shallow copy fn

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
private:
// struct to store the tag hash and if it is set or not
struct tag_hash_info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it could be an optional

{
INTERACTIONS::generate_interactions<audit_regressor_data, const uint64_t, audit_regressor_feature, true,
audit_regressor_interaction, sparse_parameters>(rd.all->interactions, rd.all->extent_interactions,
audit_regressor_interaction, sparse_parameters, true>(rd.all->interactions, rd.all->extent_interactions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
audit_regressor_interaction, sparse_parameters, true>(rd.all->interactions, rd.all->extent_interactions,
audit_regressor_interaction, sparse_parameters, true /*privacy_activation*/>(rd.all->interactions, rd.all->extent_interactions,

GD::foreach_feature<std::pair<float, float>, float, vec_add_with_norm, LazyGaussian>(w, all->ignore_some_linear,
all->ignore_linear, all->interactions, all->extent_interactions, all->permutations, *ec, dotwithnorm,
all->_generate_interactions_object_cache);
if (all->privacy_activation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not great that this is new global state

float weight = 1.f; // a relative importance weight for the example, default = 1
v_array<char> tag; // An identifier for the example.
size_t example_counter = 0;
uint64_t tag_hash; // Storing the hash of the tag for privacy preservation learning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize

Comment on lines +260 to +277
if (b.all->weights.sparse && privacy_activation)
{
b.all->weights.sparse_weights.set_tag(
hashall(ec.tag.begin(), ec.tag.size(), b.all->hash_seed) % b.all->feature_bitset_size);
GD::foreach_feature<ftrl_update_data, inner_update_proximal>(*b.all, ec, b.data);
b.all->weights.sparse_weights.unset_tag();
}
else if (!b.all->weights.sparse && privacy_activation)
{
b.all->weights.dense_weights.set_tag(
hashall(ec.tag.begin(), ec.tag.size(), b.all->hash_seed) % b.all->feature_bitset_size);
GD::foreach_feature<ftrl_update_data, inner_update_proximal>(*b.all, ec, b.data);
b.all->weights.dense_weights.unset_tag();
}
else
{
GD::foreach_feature<ftrl_update_data, inner_update_proximal>(*b.all, ec, b.data);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive uptick in complexity when just doing an update. Can this be abstracted?

{
// iterate through one namespace (or its part), callback function FuncT(some_data_R, feature_value_x, feature_index)
template <class DataT, void (*FuncT)(DataT&, float feature_value, uint64_t feature_index), class WeightsT>
template <class DataT, void (*FuncT)(DataT&, float feature_value, uint64_t feature_index), bool privacy_activation>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

privacy_activation seems unused?

@olgavrou
Copy link
Collaborator

Replaced by #3334

@olgavrou olgavrou closed this Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants