Utility analysis

Overview
Comparing data
Summary statistics
Empirical distribution
Contingency table
Classes and records
Input properties
Output properties
Classification accuracy
Enhancing data utility

Analyzing data utility

When a solution candidate has been selected, this perspective can be used to evaluate its utility for the anticipated usage scenario. For this purpose, it is possible to compare the transformed data set to the original input data set. In the center, this perspective will display the original data (area 1) and the result of the selected transformation (area 2). Both tables are synchronized when they are browsed using the horizontal or vertical scroll bars. The areas 3 and 4 allow to compare statistical information about the currently selected attribute(s).

Utility analysis

The view further displays results of univariate and bivariate statistics and basic properties about the input and output data set. Moreover, it provides access to statistics about the distribution of equivalence classes and the suppressed records. Finally, classification accuracy can be analyzed with a generic logistic regression method.

Note: ARX tries to present comparable data visualizations for original and transformed data sets. For this purpose, it uses information from the attributes' data types and relationships between values extracted from the generalization hierarchies. As a result, specifying reasonable data types and hierarchies will increase the quality and comparability of data visualizations.

Comparing input and output data

In this section a transformed data set can be compared to the original input data set. Both tables are synchronized when they are browsed using the horizontal or vertical scroll bars.

The check boxes indicate, which rows are part of the research sample. The check boxes in the table displaying the output data set, indicate the sample that was selected when the de-identification process was performed. They cannot be altered. The check boxes in the table displaying the input data set represent the current research sample. They are editable.

Data analysis

Each table offers a few options, that are accessible via buttons in the top-right corner of the view:

  1. Pressing the first button will sort the data according to the currently selected attribute.
  2. Pressing the second button will sort the output data set according to all quasi-identifiers and then highlight the equivalence classes.
  3. Pressing the third button will toggle whether all records of the data set are displayed or only the current research sample.
Buttons

Summary statistics

The first tab on the bottom of the utility analysis perspective shows summary statistics for the currently selected attribute.

Summary statistics

The displayed parameters depend on the scale of measure of the variable. For attributes with a nominal scale, the following parameters will be provided:

  1. Mode.

For attributes with an ordinal scale, the following additional parameters will be displayed:

  1. Median.
  2. Minimum.
  3. Maximum.

For attributes with an interval scale, the following additional parameters will be provided:

  1. Arithmetic mean.
  2. Sample variance.
  3. Population variance.
  4. Standard deviation.
  5. Range.
  6. Kurtosis.

For attributes with a ratio scale, the following additional parameters will be displayed:

  1. Geometric mean.

Note: These statistical parameters are calculated using listwise deletion, which is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing. This behavior can be change in the project settings dialog.

Empirical distribution

This section shows a histogram or a table visualizing the frequency distribution of the values of the currently selected attribute.

From Wikipedia: In statistics, a frequency distribution is a table that displays the frequency of various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval, and in this way, the table summarizes the distribution of values in the sample.

From Wikipedia: A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable. To construct a histogram, the first step is to "bin" the range of values - that is, divide the entire range of values into a series of intervals - and then count how many values fall into each interval.

Histogram

Note: ARX tries to present comparable data visualizations of properties of the original and transformed data sets. For this purpose, it uses information from the attributes' data types and relationships between its values, which are extracted from the generalization hierarchies. As a consequence, specifying reasonable data types and hierarchies will increase the quality and comparability of data visualizations.

Distribution

Contingency

This area shows a heat map or a table visualizing the contingency of two selected attributes.

From Wikipedia: In statistics, a contingency table is a type of table in a matrix format that displays the multivariate frequency distribution of the variables.

Contingency table

Note: ARX tries to present comparable data visualizations of properties of the original and transformed data sets. For this purpose, it uses information from the attributes' data types and relationships between its values, which are extracted from the generalization hierarchies. As a consequence, specifying reasonable data types and hierarchies will increase the quality and comparability of data visualizations.

Contingency plot

Classes and records

This view summarizes information about the records in the data set. It shows the minimal, maximal and average size of the equivalence classes, the number of classes and the number of remaining records as well as the number of suppressed records.

Equivalence classes

For the output data set, all parameters are calculated in two variants. One variant considers suppressed records and the other variant ignores suppressed records.

Note: The variant that ignores suppressed records uses listwise deletion, which is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing.

Properties of input data

This section displays basic properties about the input data set and the configuration used for de-identification. These properties include, the number of data entries and the suppression limit as well as a shallow specification of all attributes in the data set, including data about the associated transformation functions.

Input properties

Properties of output data

This section displays basic properties about the selected data transformation as well as the resulting output data set. These properties include the score calculated using the specified utility measure and further settings (e.g. attribute weights), the number of suppressed records, the number of equivalence classes and the number of classes containing outliers. If the transformation is privacy-preserving, a complete specification of all fulfilled privacy models is provided.

Output properties

Note: The information provided in this view is based on the specification which has been defined in the configuration perspective prior to performing the de-identification process. The state of the workbench will remain unchanged, even if these definitions are changed. To incorporate changes (e.g., adjusted generalization hierarchies) into the exploration and utility analysis perspectives, the data de-identification process needs to be executed again.

Classification accuracy

This section can be used to compare the classification accuracy that can be achieved with a generic implementation of logistic regression for both the input and the output data sets. The results are obtained with 10-fold cross-validation. In the view displayed at the bottom, the features and classes that are to be analyzed can be selected.

Classification accuracy

The section displayed at the top shows the results of the classification process. For input as well as output data, it displays the number of instances of the class attribute as well as a baseline accuracy. For the input data set the baseline accuracy is defined as the accuracy that can be achieved with the ZeroR classification method. For the output data set the baseline accuracy is defined as the accuracy that can be achieved when classifying the input data set. For output data, the view also displays the difference in the number of instances of the class attribute compared to the input data set.

Classification accuracy

Local recoding

This section allows to perform local recoding to further enhance data utility. It is recommended to start with favoring suppression over generalization, which can be achieved by moving the slider to the left, and then to perform a fixpoint-adaptive recoding process with a parameter of 0.05.

Local recoding

ARX will perform local recoding by recursively executing a global recoding method on all records that have been suppressed in the previous iteration. With this method, a significant improvement in homogeneity and data utility can be achieved compared to other local recoding algorithms. Moreover, the method supports various privacy models, including models against attribute disclosure.