Measuring data utility

Measuring data quality is a non‑trivial issue as the nature of usefulness of data often depends on the use case. As it is typically unknown in advance how the data will be analyzed, models are needed which quantify data quality for general-purpose use. Typically, these models define a decrease in data quality as an increase in information loss, which can be quantified. ARX supports quality models, which focus on individual cells, attributes or records.

Cell-oriented and attribute-oriented models can be parameterized with different aggregate functions, which define how the individual measures will be compiled into a global measure for the overall dataset. We currently support the following aggregate functions:

  1. Rank: Ordered list of the utilities of all attributes, which will be compared lexicographically. This is the recommended default aggregate function.
  2. Geometric mean: The geometric mean of the utilities of all attributes
  3. Arithmetic mean: The arithmetic mean of the utilities of all attributes
  4. Sum: The sum of the utilities of all attribute

Cell-oriented and attribute-oriented measures may be further parameterized with attribute weights. This allows to reduce the degree of generalization that will be applied to attributes that are important to your use cases. ARX implements the following cell-oriented models:

  1. Loss: This measure summarizes the coverage of the domain of an attribute. ARX implements sophisticated methods for measuring this coverage with functional representations of generalization rules. Moreover, the variant implemented by ARX may be further parametrized to influence the degree of generalization and suppression that will be applied to a dataset. More information can be found here. This is the recommended default measure for data utility.
  2. Precision: This metric measures information loss based on the normalized generalization levels summarized for all values of all quasi-identifiers. More information can be found here.

Moreover, ARX implements the following attribute-level models:

  1. Non-uniform entropy: This metric measures information loss based on the loss of entropy, i.e., information content. To this end, it utilizes the concept of mutual information to quantify the amount of information which can be obtained about the original variables in the input dataset by observing the variables in the output dataset. The non-uniform variant implemented in ARX is described here.
  2. Height: This metric measures information loss based on the sum of the generalization levels applied to all quasi-identifiers. It is independent of the actual input dataset. More information can be found here.

Finally, ARX supports the following record-level measures for data utility. These models typically capture variations in the indistinguishability of records:

  1. Average equivalence class size: This model measures information loss based on the size of the equivalence classes resulting from a transformation. It does not take into account the actual values of the quasi-identifiers in the input dataset. More information can be found here.
  2. Discernibility: This model measures information loss based on the size of the equivalence classes resulting from a transformation. Records which are completely suppressed are penalized. It does not take into account the actual values of the quasi-identifiers in the input dataset. More information about the non-monotonic discernibility metric can be found here. A monotonic version is described here.
  3. KL Divergence: This model measures the distance between the distributions of the sizes of equivalence classes in input and output data. More information can be found here.
  4. Ambiguity: This model measures the degree to which the records in the output dataset are ambiguous. More information can be found here.

Since version 3.5.0 ARX also implements a model for maximizing the data publisher’s benefit following the game-theoretic approach presented in this paper. We have also added support for the record-level entropy-based information loss model used by the approach.