# Publications

Submitted
Hao Wang, Hsiang Hsu, Mario Diaz, and Flavio P. Calmon. Submitted. “To Split or Not to Split: The Impact of Disparate Treatment in Classification.” In . ArxivAbstract
Disparate treatment occurs when a machine learning model produces different decisions for groups defined by a legally protected or sensitive attribute (e.g., race, gender). In domains where prediction accuracy is paramount, it is acceptable to fit a model which exhibits disparate treatment. We explore the effect of splitting classifiers (i.e., training and deploying a separate classifier on each group) and derive an information-theoretic impossibility result: there exists precise conditions where a group-blind classifier will always have a non-trivial performance gap from the split classifiers. We further demonstrate that, in the finite sample regime, splitting is no longer always beneficial and relies on the number of samples from each group and the complexity of the hypothesis class. We provide data-dependent bounds for understanding the effect of splitting and illustrate these bounds on  real-world datasets.
2020
Mario Diaz, Hao Wang, Flavio P. Calmon, and Lalitha Sankar. 4/2020. “On the Robustness of Information-Theoretic Privacy Measures and Mechanisms.” IEEE Transactions on Information Theory, 66, 4, Pp. 1949 - 1978. ArxivAbstract
Consider a data publishing setting for a dataset composed by both private and non-private features. The publisher uses an empirical distribution, estimated from n i.i.d. samples, to design a privacy mechanism which is applied to new fresh samples afterward. In this paper, we study the discrepancy between the privacy-utility guarantees for the empirical distribution, used to design the privacy mechanism, and those for the true distribution, experienced by the privacy mechanism in practice. We first show that, for any privacy mechanism, these discrepancies vanish at speed $O(1/\sqrt{n})$ with high probability. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. Then we prove that the optimal privacy mechanisms for the empirical distribution approach the corresponding mechanisms for the true distribution as the sample size n increases, thereby establishing the statistical consistency of the optimal privacy mechanisms. Finally, we introduce and study uniform privacy mechanisms which, by construction, provide privacy to all the distributions within a neighborhood of the estimated distribution and, thereby, guarantee privacy for the true distribution with high probability.
2019
Hao Wang, Lisa Vo, Flavio P. Calmon, Muriel Médard, Ken R. Duffy, and Mayank Varia. 12/2019. “Privacy with Estimation Guarantees.” IEEE Transactions on Information Theory, 65, 12, Pp. 8025 - 8042. ArxivAbstract
We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. In this setting, we present an estimation-theoretic analysis of the privacy-utility trade-off (PUT). Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how chi-square information captures the fundamental PUT in this case and provide bounds for the best PUT. We propose a convex program to compute privacy-assuring mappings when the functions to be disclosed and hidden are known a priori and the data distribution is known. We derive lower bounds on the minimum mean-squared error of estimating a target function from the disclosed data and evaluate the robustness of our approach when an empirical distribution is used to compute the privacy-assuring mappings instead of the true data distribution. We illustrate the proposed approach through two numerical experiments.
Hao Wang, Mario Diaz, José Cândido S. Santos Filho, and Flavio P. Calmon. 2019. “An Information-Theoretic View of Generalization via Wasserstein Distance.” In Proc. IEEE Int. Symp. on Inf. Theory (ISIT).Abstract
We capitalize on the Wasserstein distance to obtain two information-theoretic bounds on the generalization error of learning algorithms. First, we specialize the Wasserstein distance into total variation, by using the discrete metric. In this case we derive a generalization bound and, from a strong data-processing inequality, show how to narrow the bound by adding Gaussian noise to the output hypothesis. Second, we consider the Wasserstein distance under a generic metric. In this case we derive a generalization bound by exploiting the geometric nature of the Kantorovich-Rubinstein duality theorem. We illustrate the use of these bounds with examples. Our bounds can handle certain cases in which existing bounds via mutual information fail.
Hao Wang, Berk Ustun, and Flavio P. Calmon. 2019. “Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions.” In Proc. International Conference on Machine Learning (ICML). ArxivAbstract
When the performance of a machine learning model varies over groups defined by sensitive attributes (e.g., gender or ethnicity), the performance  disparity can be expressed in terms of the probability distributions of the input and output variables over each group. In this paper, we exploit this fact to reduce the disparate impact of a fixed classification model over a population of interest. Given a black-box classifier, we aim to eliminate the performance gap by perturbing the distribution of input variables for the disadvantaged group. We refer to the perturbed distribution as a counterfactual distribution, and characterize its properties for common fairness criteria. We introduce a descent algorithm to learn a counterfactual distribution from data. We then discuss how the estimated distribution can be used to build a data preprocessor that can reduce disparate impact without training a new model. We validate our approach through experiments on real-world datasets, showing that it can repair different forms of disparity without a significant drop in accuracy.
2018
Hao Wang, Berk Ustun, and Flavio P. Calmon. 2018. “Avoiding Disparate Impact with Counterfactual Distributions.” In NeurIPS Workshop on Ethical, Social and Governance Issues in AI.Abstract
When a classification model is used to make predictions on individuals, it may be undesirable or illegal for the performance of the model to change with respect to a sensitive attribute such as race or gender. In this paper, we aim to evaluate and mitigate such disparities in model performance through a distributional approach. Given a black-box classifier that performs unevenly across sensitive groups, we consider a counterfactual distribution of input variables that minimizes the performance gap. We characterize properties of counterfactual distributions for common fairness criteria. We then present novel machinery to efficiently recover counterfactual distributions given a sample of points from its target population. We describe how counterfactual distributions can be used to avoid discrimination between protected groups by: (i) identifying proxy variables to omit in training; and (ii) building a preprocessor that can mitigate discrimination. We validate both use cases through experiments on a real-world dataset.
Hao Wang, Berk Ustun, and Flavio P. Calmon. 2018. “On the Direction of Discrimination: An Information-Theoretic Analysis of Disparate Impact in Machine Learning.” In Proc. IEEE Int. Symp. on Inf. Theory (ISIT). ArxivAbstract
In the context of machine learning, disparate impact refers to a form of systematic discrimination whereby the output distribution of a model depends on the value of a sensitive attribute (e.g., race or gender). In this paper, we propose an information-theoretic framework to analyze the disparate impact of a binary classification model. We view the model as a fixed channel, and quantify disparate impact as the divergence in output distributions over two groups. Our aim is to find a correction function that can perturb the input distributions of each group to align their output distributions. We present an optimization problem that can be solved to obtain a correction function that will make the output distributions statistically indistinguishable. We derive closed-form expressions to efficiently compute the correction function, and demonstrate the benefits of our framework on a recidivism prediction problem based on the ProPublica COMPAS dataset.
Hao Wang, Mario Diaz, Flavio P. Calmon, and Lalitha Sankar. 2018. “The Utility Cost of Robust Privacy Guarantees.” In Proc. IEEE Int. Symp. on Inf. Theory (ISIT). ArxivAbstract
Consider a data publishing setting for a data set with public and private features. The objective of the publisher is to maximize the amount of information about the public features in a revealed data set, while keeping the information leaked about the private features bounded. The goal of this paper is to analyze the performance of privacy mechanisms that are constructed to match the distribution learned from the data set. Two distinct scenarios are considered: (i) mechanisms are designed to provide a privacy guarantee for the learned distribution; and (ii) mechanisms are designed to provide a privacy guarantee for every distribution in a given neighborhood of the learned distribution. For the first scenario, given any privacy mechanism, upper bounds on the difference between the privacy-utility guarantees for the learned and true distributions are presented. In the second scenario, upper bounds on the reduction in utility incurred by providing a uniform privacy guarantee are developed.
2017
Hao Wang and Flavio P. Calmon. 2017. “An Estimation-Theoretic View of Privacy.” In Proc. 55th Annual Allerton Conference on Communication, Control, and Computing. LinkAbstract

We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. We present an estimation-theoretic analysis of the privacy-utility trade-off (PUT) in this setting. Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be recon- structed with distortion below a certain threshold (privacy). We demonstrate how a chi-square-based information measure captures the fundamental PUT, and characterize several properties of this function. In particular, we give a sharp bound for the PUT. We then propose a convex program to compute privacy-assuring mappings when the functions to be disclosed and hidden are known a priori. Finally, we evaluate the robustness of our approach to finite samples.