Classification mining algorithms can predict the categories or labels of unseen data. In database terms, this practice can infer or approximate a functional dependency between the target column (categories/labels) and other columns. This has an impact on database security if the target columns, or the values of target columns, are protected sensitive data elements.
In this paper, the authors first review the evaluation algorithm Exact_OB1, presented by Johnsten and Raghavan , which assesses the risk of disclosure of protected data with respect to decision-region-based classification algorithms. Then, the evaluation algorithm Exact_OB2 is introduced, for extended decision-region-based classification algorithms. Due to the potentially high execution time of Exact_OB2, an alternative algorithm APPROX_VAEL, which approximates the exact evaluation of EXACT_OB2, is also presented. The experimental results show that the APPROX_EVAL algorithm appears to provide an effective and efficient evaluation of disclosure risks of protected data. Throughout the paper, a decision tree is used to illustrate the risk of disclosure of protected data and of the security polices implemented. It is worth noting that all these evaluation algorithms are only applicable when the data elements (tuples) are integers or categorical values. They are not applicable to continuous data elements.
The goal of the implementation of security policies is to effectively remove unauthorized inference from data, so that classification mining algorithms cannot correctly infer or predict the values of protected data elements. However, the implementation of such security policies may also remove some legitimate inferences from the data. Data miners need to be aware of what dependencies have been removed, and to be much more careful about their findings if they are working with sanitized data. This also poses an interesting question: What is the impact of security policy on data mining?
This paper is of interest to both the database security and data mining communities. Some of the terms used in this paper are not clearly explained; it would be better read Johnsten and Raghavan’s paper .