Measuring data utility

ARX is able to classify to complete solution space and automatically determine a solution that is optimal in terms of a given general-purpose data utility measure. ARX supports a broad spectrum of such measures.

Utility measures may either be based on equivalence classes (called single-dimensional) or based on the individual information loss of each attribute (called multi-dimensional). In ARX, the latter type of measures can be parameterized with different aggregate functions, which define how the individual measures for each attribute will be compiled into a global measure for the overall dataset. We currently support the following aggregate functions:

  1. Rank: Ordered list of the utilities of all attributes, which will be compared lexicographically. This is the recommended default aggregate function.
  2. Geometric mean: The geometric mean of the utilities of all attributes
  3. Arithmetic mean: The arithmetic mean of the utilities of all attributes
  4. Sum: The sum of the utilities of all attribute

Multi-dimensional measures may be further parameterized with attribute weights. This allows to reduce the degree of generalization that will be applied to attributes that are important to your use cases. ARX implements the following multi-dimensional measures:

  1. Loss: This measure summarizes the coverage of the domain of an attribute. ARX implements sophisticated methods for measuring this coverage with functional representations of generalization rules. Moreover, the variant implemented by ARX may be further parametrized to influence the degree of generalization and suppression that will be applied to a dataset. More information can be found here. This is the recommended default measure for data utility.
  2. Non-uniform entropy: This metric measures information loss based on the loss of entropy, i.e., information content. To this end, it utilizes the concept of mutual information to quantify the amount of information which can be obtained about the original variables in the input dataset by observing the variables in the output dataset. The non-uniform variant implemented in ARX is described here.
  3. Height: This metric measures information loss based on the sum of the generalization levels applied to all quasi-identifiers. It is independent of the actual input dataset. More information can be found here.
  4. Precision: This metric measures information loss based on the normalized generalization levels summarized for all values of all quasi-identifiers. More information can be found here.

Moreover, ARX also supports the following single-dimensional measures for data utility:

  1. Average equivalence class size: This metric measures information loss based on the size of the equivalence classes resulting from a transformation. It does not take into account the actual values of the quasi-identifiers in the input dataset. More information can be found here.
  2. Discernibility: This metric measures information loss based on the size of the equivalence classes resulting from a transformation. It does not take into account the actual values of the quasi-identifiers in the input dataset. More information about the non-monotonic discernibility metric can be found here. A monotonic version is described here.
  3. KL Divergence: This model measures the distance between the distributions of the sizes of equivalence classes induced by transforming the data. More information can be found here.
  4. Ambiguity: This model measures the degree to which the records in the output dataset are ambiguous. More information can be found here.

Since version 3.5.0 ARX also implements a model for maximizing the data publisher’s benefit following the game-theoretic approach presented in this paper. We have also added support for the record-level entropy-based information loss model proposed in the same paper.