Privacy-preserving distributed statistics with Emnet

Current health data re-use landscape

Health data reuse, which refers to the use of data for purposes other than those for which the data were initially collected, has enormous potential for the improvement of healthcare quality [1]. However, it raises legitimate privacy concerns from different stakeholders.

The reuse of health data is controlled by legal and ethical frameworks that cover aspects such as privacy rights, data protection regulations, and duties of confidentiality [1]. Legal measures such as the GDPR, create challenges for the effective reuse of health data [2][3], and impose the adoption of data reuse techniques that protect the security and privacy of the people and organisations represented by these data [4].

Emnet: Tackling data re-use challenges with distributed analytics

The common approach for privacy-preserving processing of data distributed across multiple data sources is a centralized collection of data, where each data source de-identifies its data before disclosure. De-identification requires a balance between data utility and privacy where strong privacy protection necessitates significant alterations to the data that considerably reduce the data utility [5].

In ASCLEPIOS we have designed a privacy-preserving tool, called Emnet, for the computation of statistics on combined data of multiple data sources, based on an emerging approach called privacy-preserving distributed data mining (PPDDM). This approach is also known as privacy-preserving distributed statistical computation (PPDSC) [6][7][8][9]. PPDSC addresses the problem of running statistical algorithms on confidential data divided across two or more different data sources, without allowing any party to view the private data of another data source. This method reveals statistics generated from the combined data for a group of data sources, while at the same time it does not disclose any sensitive information about the input.

The current implementation of the tool supports statistics such as count, variance, covariance, ratio, mean, percentile (e.g., median), minimum, maximum. In future releases, it will support more statistics including standard deviation and Pearson’s r.

Emnet maintains the access control of the data sources and can scale to a large number of data sources and records; something that has been a challenge for the widespread adoption of PPDDM tools.

References:

[1] The Nuffield Council on Bioethics (NCOB). The collection, linking and use of data in biomedical research and health care: ethical issues. The Nuffield Council on Bioethics (NCOB); 2015.

[2] Kobayashi S, Kane TB, Paton C. The privacy and security implications of open data in healthcare. Yearb Med Inform. 2018.

[3] Malin B, Goodman K, Section SE for the IYS. Between Access and Privacy: Challenges in Sharing Health Data. Yearb Med Inform. 2018.

[4] Holmes J, Soualmia L, Séroussi B. A 21st century embarrassment of riches: the balance between health data access, usage, and sharing. Yearb Med Inform. 2018.

[5] Dankar FK, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak. 2012;12:66.

[6] Aldeen YAAS, Salleh M, Razzaque MA. A comprehensive review on privacy preserving data mining. SpringerPlus. 2015.

[7] Aggarwal CC, Yu PS. A general survey of privacy-preserving data mining models and algorithms. Privacy-preserving data mining. New York: Springer; 2008.

[8] Kantarcioglu M. A survey of privacy-preserving methods across horizontally partitioned data. Privacy-preserving data mining. New York: Springer; 2008.

[9] Vaidya J. A survey of privacy-preserving methods across vertically partitioned data. Privacy-preserving data mining. New York: Springer; 2008.

Icons used in featured image, made by Freepik from Flaticon