1. bookVolume 2022 (2022): Issue 1 (January 2022)
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
access type Open Access

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Published Online: 20 Nov 2021
Page range: 481 - 500
Received: 31 May 2021
Accepted: 16 Sep 2021
Journal Details
License
Format
Journal
First Published
16 Apr 2015
Publication timeframe
4 times per year
Languages
English
Abstract

Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility.

In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

Keywords

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, New York, NY, USA, 2016. Association for Computing Machinery. Search in Google Scholar

[2] Mohammad Alaggan, Mathieu Cunche, and Sébastien Gambs. Privacy-preserving Wi-Fi analytics. Proceedings on Privacy Enhancing Technologies, 2018(2):4–26, 2018. Search in Google Scholar

[3] Mohammad Alaggan, Sébastien Gambs, and Anne-Marie Kermarrec. BLIP: Non-interactive differentially-private similarity computation on Bloom filters. In Andréa W. Richa and Christian Scheideler, editors, Stabilization, Safety, and Security of Distributed Systems - 14th International Symposium, volume 7596 of Lecture Notes in Computer Science, pages 202–216, Toronto, Canada, 2012. Springer. Search in Google Scholar

[4] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, pages 5977–5987, Montreal, Canada, 2018. Search in Google Scholar

[5] Sean Augenstein, H. Brendan McMahan, Daniel Ramage, Swaroop Ramaswamy, Peter Kairouz, Mingqing Chen, Rajiv Mathews, and Blaise Agüera y Arcas. Generative models for effective ML on private, decentralized datasets. In Proceedings of the 8th International Conference on Learning Representations (ICLR), virtual, 2020. OpenReview.net. Search in Google Scholar

[6] Raef Bassily, Kobbi Nissim, Uri Stemmer, and Abhradeep Guha Thakurta. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems, pages 2288–2296, Long Beach, CA, USA, 2017. Curran Associates Inc. Search in Google Scholar

[7] Gabrielle Berman, Sara de la Rosa, and Tanya Accone. Ethical considerations when using geospatial technologies for evidence generation. Technical report, Innocenti Research Briefs, 2018. Search in Google Scholar

[8] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SIGNSGD: Compressed optimisation for non-convex problems. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stock-holmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 559–568. PMLR, 2018. Search in Google Scholar

[9] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, New York, NY, USA, 2017. Association for Computing Machinery. Search in Google Scholar

[10] Nader Bouacida, Jiahui Hou, Hui Zang, and Xin Liu. Adaptive federated dropout: Improving communication effi-ciency and generalization for federated learning. CoRR, abs/2011.04050, 2020. Search in Google Scholar

[11] Mark Bun, Jelani Nelson, and Uri Stemmer. Heavy hitters and the structure of local privacy. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 435–447, New York, NY, USA, 2018. Association for Computing Machinery. Search in Google Scholar

[12] Joseph A. Calandrino, Ann Kilzer, Arvind Narayanan, Edward W. Felten, and Vitaly Shmatikov. “You might also like”: Privacy risks of collaborative filtering. In 2011 IEEE Symposium on Security and Privacy (S&P), pages 231–246, Berkeley, California, USA, 2011. IEEE Computer Society. Search in Google Scholar

[13] Sebastian Caldas, Jakub Konečný, H. Brendan McMahan, and Ameet Talwalkar. Expanding the reach of federated learning by reducing client resource requirements. CoRR, abs/1812.07210, 2018. Search in Google Scholar

[14] Differential Privacy Team. Learning with privacy at scale. Apple Machine Learning Journal, 1(8), 2017. Search in Google Scholar

[15] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems, pages 3571–3580, Long Beach, CA, USA, 2017. Curran Associates Inc. Search in Google Scholar

[16] Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In 2nd Workshop on Machine Learning in HPC Environments, MLHPC@SC, pages 1–8, Salt Lake City, UT, USA, 2016. IEEE Computer Society. Search in Google Scholar

[17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. Search in Google Scholar

[18] Marco F. Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826–838, 2004. Search in Google Scholar

[19] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018. Search in Google Scholar

[20] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Roth-blum, and Salil Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pages 381–390, Bethesda, MD, USA, 2009. ACM. Search in Google Scholar

[21] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014. Search in Google Scholar

[22] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song, Kunal Talwar, and Abhradeep Thakurta. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. CoRR, abs/2001.03618, 2020. Search in Google Scholar

[23] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1054–1067, Scottsdale, AZ, USA, 2014. ACM. Search in Google Scholar

[24] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. Building a Rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies, 3:1–21, 2016. Search in Google Scholar

[25] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Inverting gradients – How easy is it to break privacy in federated learning? In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, virtual, 2020. Curran Associates Inc. Search in Google Scholar

[26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, Montreal, Quebec, Canada, 2014. Curran Associates Inc. Search in Google Scholar

[27] Markus Herdin, Nicolai Czink, Hüseyin Özcelik, and Ernst Bonek. Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In 2005 IEEE 61st Vehicular Technology Conference, volume 1, pages 136–140, Stockholm, Sweden, 2005. IEEE. Search in Google Scholar

[28] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, and Raman Arora. Communicationefficient distributed SGD with sketching. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 13144–13154, Vancouver, BC, Canada, 2019. Search in Google Scholar

[29] Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. Distributed learning without distress: Privacy-preserving empirical risk minimization. In Advances in Neural Information Processing Systems, pages 6343–6354. Curran Associates Inc., 2018. Search in Google Scholar

[30] Yuang Jiang, Shiqiang Wang, Bong-Jun Ko, Wei-Han Lee, and Leandros Tassiulas. Model pruning enables efficient federated learning on edge devices. CoRR, abs/1909.12326, 2019. Search in Google Scholar

[31] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, Jan 2011. Search in Google Scholar

[32] François Kawala, Ahlame Douzal-Chouakria, Eric Gaussier, and Eustache Dimert. Prédictions d’activité dans les réseaux sociaux en ligne. In 4ième conférence sur les modèles et l’analyse des réseaux : Approches mathématiques et informatiques, page 16, France, 2013. Search in Google Scholar

[33] Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 2014. OpenReview.net. Search in Google Scholar

[34] Ron Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, pages 202–207, Portland, Oregon, USA, 1996. AAAI Press. Search in Google Scholar

[35] Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Search in Google Scholar

[36] Haoran Li, Li Xiong, and Xiaoqian Jiang. Differentially private synthesization of multi-dimensional data using copula functions. In Advances in Database Technology: Proceedings of the International Conference on Extending Database Technology, volume 2014, pages 475–486, Athens, Greece, 2014. NIH Public Access, OpenProceedings.org. Search in Google Scholar

[37] Ruixuan Liu, Yang Cao, Masatoshi Yoshikawa, and Hong Chen. Fedsel: Federated SGD under local differential privacy with top-k dimension selection. In Database Systems for Advanced Applications - 25th International Conference, DASFAA 2020, Jeju, South Korea, September 24-27, 2020, Proceedings, Part I, volume 12112 of Lecture Notes in Computer Science, pages 485–501. Springer, 2020. Search in Google Scholar

[38] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282, Fort Lauderdale, FL, USA, 2017. PMLR. Search in Google Scholar

[39] Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 2018. OpenReview.net. Search in Google Scholar

[40] Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy (S&P), pages 739–753, San Francisco, CA, USA„ 2019. IEEE. Search in Google Scholar

[41] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 11(10):1071–1083, 2018. Search in Google Scholar

[42] John C. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization, page 185–208. MIT Press, Cambridge, MA, USA, 1999. Search in Google Scholar

[43] Zhan Qin, Yin Yang, Ting Yu, Issa Khalil, Xiaokui Xiao, and Kui Ren. Heavy hitter estimation over set-valued data with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 192–203, Vienna, Austria, 2016. ACM. Search in Google Scholar

[44] Xuebin Ren, Chia-Mu Yu, Weiren Yu, Shusen Yang, Xinyu Yang, Julie A McCann, and S Yu Philip. LoPub: High-dimensional crowdsourced data publication with local differential privacy. IEEE Transactions on Information Forensics and Security, 13(9):2151–2166, 2018. Search in Google Scholar

[45] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, editors, INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pages 1058–1062, Singapore, 2014. ISCA. Search in Google Scholar

[46] Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso. Synthetic data - A privacy mirage. CoRR, abs/2011.07018, 2020. Search in Google Scholar

[47] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, pages 4452–4463, Montreal, Canada, 2018. Search in Google Scholar

[48] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. Wasserstein auto-encoders. In International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 2018. OpenReview.net. Search in Google Scholar

[49] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. DP-CGAN: Differentially private synthetic data and label generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – Workshops, pages 98–104, Long Beach, CA, USA, 2019. Computer Vision Foundation /IEEE. Search in Google Scholar

[50] Aleksei Triastcyn and Boi Faltings. Federated generative privacy. IEEE Intelligent Systems, 35(4):50–57, 2020. Search in Google Scholar

[51] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou. A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, pages 1–11, 2019. Search in Google Scholar

[52] Ning Wang, Xiaokui Xiao, Yin Yang, Jun Zhao, Siu Cheung Hui, Hyejin Shin, Junbum Shin, and Ge Yu. Collecting and analyzing multidimensional data with local differential privacy. In Proceedings of the 35th IEEE International Conference on Data Engineering, pages 638–649, 2019. Search in Google Scholar

[53] Teng Wang, Xinyu Yang, Xuebin Ren, Wei Yu, and Shusen Yang. Locally private high-dimensional crowdsourced data release based on copula functions. IEEE Transactions on Services Computing, pages 1–1, 2019. Search in Google Scholar

[54] Tianhao Wang, Ninghui Li, and Somesh Jha. Locally differentially private frequent itemset mining. In 2018 IEEE Symposium on Security and Privacy, pages 127–143, San Francisco, California, USA, 2018. IEEE Computer Society. Search in Google Scholar

[55] Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. Beyond inferring class representatives: User-level privacy leakage from federated learning. In IEEE Conference on Computer Communications (INFOCOM), pages 2512–2520, Paris, France, 2019. IEEE. Search in Google Scholar

[56] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. Search in Google Scholar

[57] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via Bayesian networks. ACM Transactions on Database Systems, 42(4):25:1–25:41, 2017. Search in Google Scholar

[58] Yang Zhao, Jun Zhao, Mengmeng Yang, Teng Wang, Ning Wang, Lingjuan Lyu, Dusit Niyato, and Kwok-Yan Lam. Local differential privacy based federated learning for internet of things. IEEE Internet of Things Journal, 2020. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo