Overview of ARX's perspectives
The ARX anonymization tool is divided into four perspectives, which model different aspects of the de-identification process. As is shown in the below screen shot, these steps consist of 1) configuring privacy models, utility measures and transformation methods, 2) exploring the solution space, 3) analyzing data utility and 4) analyzing privacy risks.
In the configuration perspective, input data can be loaded, transformation rules can be specified and all further parameters, such as privacy models and utility measures, can be selected and parameterized. If required, this step can be prepared by performing a risk analysis. Next, the solution space is characterized by executing a de-identification algorithm. The result can be inspected in the exploration perspective. Here, it is possible to search for privacy-preserving data transformations which fulfill the user's requirements, i.e. for transformations that result in output data that is suited for the intended usage scenario. To assess this suitability, the utility analysis perspective provides methods for comparing transformed data sets to the original input data set with methods of descriptive statistics and machine learning. In the fourth perspective, risk analyses can be performed for input data set as well as transformed output data. Based on the results of these analyses, the suitability of a solution candidate may either be confirmed or the configuration of the de-identification process can be modified.
Configuring the de-identification process
In the configuration perspective, data can be imported, generalization hierarchies can be created, and the anonymization process can be parameterized.
The data import process supports a variety of data sources with user defined syntax, and allows the specification of attribute meta-data, including data type and format.
Generalization hierarchies can be created for quasi-identifier and sensitive attributes in interval, ordinal and mask form. Hierarchies can be created with software assistance or imported from CSV files.
The anonymization process is parameterized in the privacy and utility sub-panels. The privacy sub-panel allows the selection of one or more privacy models. For applicable privacy models, population and financial data can also be specified. The utility sub-panel allows the selection of a utility measure. For applicable utility measures, a variety of related parameters can be configured (e.g. suppression, generalization, aggregation, monotonicity and precomputation).
Exploration of the solution space
During the de-identification process, ARX constructs a search space consisting of a set of transformations that can be applied to the datasets. This search space is then characterized based on the given privacy models and utility measure. This perspective allows users to browse the solution candidates identified by ARX and to select interesting transformations for further analysis.
Analysis of data utility
The utility analysis perspective is used to assess the suitability of a specific transformation for a given usage scenario. To this end, input and transformed data are displayed side by side with attribute specific univariate and bivariate statistics and machine learning metrics. A variety of graphical and numerical representations are displayed to aid interpretation.
Configurable logistic regression is to classify data for statistical and machine learning analysis.
The utility analysis perspective also provides an interface for local recoding, which is used to enhance the utility of the global transformation by reassessing outliers.
Analysis of risks
In this perspective, various metrics reflecting privacy risks are presented. Metrics implemented by ARX include re-identification risks for the prosecutor, journalist and marketer attacker models as well as estimates of population uniqueness, which can be calculated using different statistical models. Moreover, the perspective also provides access to a method for detecting attributes which must be modified or altered when de-identifying data in compliance with the Safe Harbor method of the US Health Insurance Portability and Accountability Act (HIPAA identifiers) and a method for finding potential quasi-identifiers.
ARX aims to provide a high degree of interoperability with other software systems. Generalization hierarchies and data sets can be imported from and exported to files containing character separated values (CSV). Moreover, ARX is able to import data from other sources, including MS Excel spreadsheets and relational database systems, such as MS SQL, DB2, MySQL or PostgreSQL.
The data import wizard also supports the renaming, removing and reordering of columns. During data import, data types are automatically detected and data cleansing may be performed. This means that values that do not conform to the specified data type will be replaced with null values, which are handled correctly by all methods implemented in ARX.
To promote cross platform analysis, all tabular data created by ARX can be exported, via context menus, into CSV formatted files.