Search for Articles:

Contents

Predicting the privacy status of potentially private data items using feature selection algorithms

Hidayet Takci1
1Cumhuriyet University, Computer Engineering Department, Sivas, Turkey
Copyright © Hidayet Takci. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

A privacy-based risk analysis demands a proper identification of whether a data element is private, non-private or potentially private. Though some of the personal characteristics may be inherently sensitive, others acquire sensitivity due to their statistical and semantic association with already known private variables. In this paper, we propose a feature-selection based technique to identify potentially private variables based on their relevance to known private variables. Our approach considers each feature as a target variable at a time, performs three different filter-based feature selection techniques: chi-square filter, correlation-based feature selection, fast correlation-based feature selection, generates feature-distance matrices based on the ranks obtained from these techniques and identifies relevant pairs based on the threshold distance. We have conducted experiments on the Adult dataset which shows that there are strong relations between workclass, occupation, marital-status, relationship, race, native-country, gender, income-class and age attributes. A relevant features subset for the income-class attribute that can predict as well as the complete feature set predicts with an 83% accuracy whereas the complete feature set predicts with an 85% accuracy. These findings show that feature selection can support privacy risk assessment by revealing implicit privacy dependencies that are not visible when variables are examined in isolation.

Keywords: data privacy, privacy risk assessment, feature selection, filter methods, feature relevance, adult data set