Data quality models

Supported quality/utility models

Measuring data quality is a complex issue and ARX supports multiple models that can be used as objective functions for optimizing the output data of an anonymization process. Typically, these methods model a decrease in data quality as an increase in information loss, which can be quantified. ARX supports quality models, which measure quality based on individual cell values, attributes or records and a workload-aware model focusing on the generation of classification models.

Cell-oriented and attribute-oriented models can be parameterized with different aggregate functions, which define how the individual measures will be compiled into a global measure for the overall dataset. The following aggregate functions are available:

  1. Rank: Ordered list of measurements for all attributes, which will be compared lexicographically.
  2. Geometric mean: The geometric mean of the measurements for all attributes.
  3. Arithmetic mean: The arithmetic mean of the measurements for all attributes.
  4. Sum: The sum of the measurements for all attribute
  5. Maximum: The maximum of the measurements for all attributes

Cell-oriented and attribute-oriented measures can further parameterized with attribute weights, which specify the importance of the attributes for further analyses. ARX will try to reduce the degree of transformation that is applied to attributes with higher importance.

Cell-oriented general-purpose models

  1. Granularity/loss: This measure summarizes the degree to which transformed attribute values cover the original domain of an attribute. ARX implements sophisticated methods for quantifying this coverage based on functional representations of generalization rules. Moreover, the model can be parameterized to influence the degree of generalization and suppression that will be applied to a dataset. More information can be found here.
  2. Precision: This model estimates data quality based on normalized generalization levels of transformed attribute values. More information can be found here.

Attribute-oriented general-purpose models

  1. Non-uniform entropy: This model quantifies loss of information based on mutual information, which measures the amount of information that can be obtained about the original values of variables in the input dataset by observing the values of variables in the output dataset. The non-uniform variant implemented in ARX is described here
  2. Height: This very simple model quantifies loss of information as the sum of the generalization levels applied to all attribute values.

Record-oriented general-purpose models

  1. Average equivalence class size: This model estimates data quality by calculating the average size of classes of indistinguishable records. It does not take into account the actual attribute values in the output dataset. More information can be found here.
  2. Discernibility: This model also estimates data quality based on the size of the equivalence classes in the output dataset. Records which are suppressed are penalized. It does not take into account the actual attribute values in the output dataset. For further information, see here and here.
  3. Ambiguity: This model quantifies the degree to which the records in the output dataset are ambiguous. More information can be found here.
  4. Entropy-based model: This model has been proposed here.

Special-purpose models

ARX also implements a model for maximizing the data publisher's benefit following the game-theoretic approach implemented in the profitability privacy model. For further details, please see here and here.

Since version 3.7.0 ARX further implements a model for optimizing output data towards suitability as a training set for building classification models. For more details, please see here.