Measuring the impact of anonymization on real-world consolidated health datasets engineered for secondary research use: Experiments in the context of MODELHealth project

Pitoglou, Stavros and Filntisi, Arianna and Anastasiou, Athanasios and Matsopoulos, George K. and Koutsouris, Dimitrios (2022) Measuring the impact of anonymization on real-world consolidated health datasets engineered for secondary research use: Experiments in the context of MODELHealth project. Frontiers in Digital Health, 4. ISSN 2673-253X

[thumbnail of pubmed-zip/versions/1/package-entries/fdgth-04-841853.pdf] Text
pubmed-zip/versions/1/package-entries/fdgth-04-841853.pdf - Published Version

Download (667kB)

Abstract

Introduction: Electronic Health Records (EHRs) are essential data structures, enabling the sharing of valuable medical care information for a diverse patient population and being reused as input to predictive models for clinical research. However, issues such as the heterogeneity of EHR data and the potential compromisation of patient privacy inhibit the secondary use of EHR data in clinical research.

Objectives: This study aims to present the main elements of the MODELHealth project implementation and the evaluation method that was followed to assess the efficiency of its mechanism.

Methods: The MODELHealth project was implemented as an Extract-Transform-Load system that collects data from the hospital databases, performs harmonization to the HL7 FHIR standard and anonymization using the k-anonymity method, before loading the transformed data to a central repository. The integrity of the anonymization process was validated by developing a database query tool. The information loss occurring due to the anonymization was estimated with the metrics of generalized information loss, discernibility and average equivalence class size for various values of k.

Results: The average values of generalized information loss, discernibility and average equivalence class size obtained across all tested datasets and k values were 0.008473 ± 0.006216252886, 115,145,464.3 ± 79,724,196.11 and 12.1346 ± 6.76096647, correspondingly. The values of those metrics appear correlated with factors such as the k value and the dataset characteristics, as expected.

Conclusion: The experimental results of the study demonstrate that it is feasible to perform effective harmonization and anonymization on EHR data while preserving essential patient information.

Item Type: Article
Subjects: SCI Archives > Multidisciplinary
Depositing User: Managing Editor
Date Deposited: 12 Jan 2023 06:42
Last Modified: 13 Aug 2024 06:26
URI: http://science.classicopenlibrary.com/id/eprint/1070

Actions (login required)

View Item
View Item