Statistical Tradeoffs between Generalization and Suppression in the De-identification of Large-Scale Data Sets

Citation:

Angiuli O, Waldo J. Statistical Tradeoffs between Generalization and Suppression in the De-identification of Large-Scale Data Sets, in IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). Vol 2. IEEE ; 2016 :589-593.

Abstract:

Data sets containing private information about individuals must satisfy privacy standards before being publicly released. One such standard, k-anonymity, reduces the probability of the re-identification of individuals by requiring that rare combinations of personally-identifiable information be represented by at least k distinct individuals. Records that violate this standard must be altered, which can lead to significant distortion of the statistical properties of the data set. In this paper, we discuss improvements to two techniques used to achieve k-anonymity, generalization and suppression, that confer k-anonymity while better preserving the statistical properties of an educational data set taken from a massive online open course platform, edX.

Publisher's Version