Viewing data sets
Specifying attribute properties
Creating generalization hierarchies
Privacy, population model, risk and benefit
Transformation and utility model
Selecting a research sample
Configuring the de-identification process
In this perspective, first, a data set can be imported into the tool and attribute meta data can be specified, including data types and attribute properties in terms of privacy risks. Second, generalization hierarchies for quasi-identifiers or sensitive attributes can be created semi-automatically with built-in wizards or imported into the tool from CSV files. Third, privacy models, the method for measuring data utility and further parameters, which control the transformation process, can be specified.
The perspective is divided into five main areas. Area 1 shows the current input data set, area 2 provides means for specifying meta data about its attributes, area 3 supports configuring the privacy model, area 4 implements controls for configuring further properties of the transformation process, such as the coding model, how data utility should be measured and how important certain attributes are. Area 5 provides methods for extracting a research sample, which is a subset of the overall data set that is to be de-identified and exported.
Viewing data sets
ARX displays input and output data as tables. In the table headers, attribute types are indicated by displaying different colors:
- Red indicates an identifying attribute. Identifying attributes are associated with a high risk of re-identification. They will be removed from the data set. Typical examples are names or Social Security Numbers.
- Yellow indicates a quasi-identifying attribute. Quasi-identifying attributes are associated with a high risk of re-identification. They will be transformed by means of generalization, suppression or micro-aggregation. Typical examples are gender, date of birth and ZIP codes.
- Purple indicates a sensitive attribute. Sensitive attributes encode properties with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. They will be kept unmodified but may be subject to further constraints, such as t-closeness or l-diversity. Typical examples are diagnoses.
- Green indicates an insensitive attribute. Insensitive attributes are not associated with any privacy risks. They will be kept unmodified.
Each row is further associated with a checkbox. These checkboxes indicate, which rows have been selected for inclusion into the research sample. The checkboxes in the view for output data sets, indicate the sample that was specified when the de-identification process was last executed. They cannot be edited. The checkboxes in the view for the input data set represent the current research sample. They are editable.
Each table offers a few options that are accessible via buttons in the top-right corner:
- Pressing the first button will sort the data according to the currently selected column.
- Pressing the second button will sort the output data set according to all quasi-identifiers and then highlight the equivalence classes.
- Pressing the third button will toggle between showing all records of the data set or only the research sample.
Specifying attribute properties
This area allows to assign privacy risks and data types to attributes and to specify transformation rules in terms of generalization hierarchies and/or micro-aggregation functions. In terms of privacy risks, ARX distinguishes between the following four types of attributes:
- Identifying attributes are associated with a high risk of re-identification. They will be removed from the data set. Typical examples are names or Social Security Numbers.
- Quasi-identifying attributes are associated with a high risk of re-identification. They will be transformed by means of generalization, suppression or micro-aggregation. Typical examples are gender, date of birth and ZIP codes.
- Sensitive attributes encode properties with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. They will be kept unmodified but may be subject to further constraints, such as t-closeness or l-diversity. Typical examples are diagnoses.
- Insensitive attributes are not associated with any privacy risks. They will be kept unmodified.
As can be seen in area 2.1, each tab is associated with one attribute of the input data set. The drop-down list in area 2.2 supports specifying the type of the attribute, whereas the drop-down list in area 2.3 allows to specify a data type. Specifying a data type is optional but yields better results when creating generalization hierarchies via ARX's built-in wizard and when visualizing data properties in the analysis perspective. While ARX will happily treat all data as strings, please note that specifying a data type can be important for generating meaningful generalization hierarchies and obtaining a more intuitive graphical representation of data properties. Currently, the following data types are supported:
- String: a generic sequence of characters. This is the default data type.
- Integer: a data type for numbers without a fractional component.
- Decimal: a data type for numbers with fractional component.
- Date/time: a data type for dates and time stamps.
- Ordered string: this data type represents strings with ordinal scale.
Some data types require a format string, which can be specified in area 2.4. Integers and dates/time stamps are typical examples. More information on format strings for decimals can be found here (http://docs.oracle.com/javase/7/docs/api/java/text/DecimalFormat.html) and information on format strings for dates/time stamps can be found here (http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html)
In area 2.5 the coding model can be configured. Generalization and micro-aggregation are supported. The drop-down lists in area 2.6 allow specifying minimal and maximal generalization levels for the selected attribute if generalization is selected as a coding model. In case of micro-aggregation, this area allows to choose the aggregate function and the way how missing data entries should be handled.Area 2.7 displays a tabular representation of the associated generalization hierarchy. The values from the original input data set are shown in the first column and the level of generalization increases from left to right. This area also implements a basic editor for generalization hierarchies, which allows to move, add and delete columns or rows and to alter the labels of individual cells. Please note that any generalization hierarchy must be a mono-hierarchy (i.e. it must form a tree structure). As a starting point for defining generalization hierarchies, ARX offers a wizard, which can be used to create generalization hierarchies for many common types of attributes. The wizard can be launched via the application menu or via an associated button in the application toolbar.
Creating generalization hierarchies
ARX offers different methods for creating generalization hierarchies for different types of attributes. Generalization hierarchies created with the wizard are represented in a functional manner, meaning that they can be created for the entire domain of an attribute without explicitly addressing the specific values in a concrete data set. This enables the handling of continuous variables. Moreover, hierarchy specifications can be imported and exported and they can thus be reused for de-identifying different data sets with similar attributes. Please make sure that you first select an appropriate data type for the attribute. The wizard supports three different types of hierarchies:
- Masking-based hierarchies: this general-purpose mechanism allows creating hierarchies for a broad spectrum of attributes.
- Interval-based hierarchies: these hierarchies can be applied to variables with a ratio scale.
- Order-based hierarchies: this method an be used for variables with an ordinal scale.
Masking is a highly flexible mechanism that can be applied to many attributes and is especially suitable for alphanumeric codes, such as ZIP codes or telephone numbers. The following image shows a screenshot of the respective wizard:
In ARX, masking follows a two-step process. First, values are aligned either to the left or to the right. Then the characters are masked, again, either from left to right, or from right to left. All values are adjusted to a common length, by introducing padding characters. This character, as well as the character used for masking can be specified by the user.
Intervals are a natural means of generalization for values with a ratio scale, such as integers, decimals or dates/time stamps. ARX offers a graphical editor for efficiently defining sets of intervals over the complete range of any of the above data types. As is shown in the following screen shot, first a sequence of intervals can be defined (shown on the left side of the screen shot). In the next step, subsequent levels consisting of groups can be defined. Each group combines a given number of elements from the previous level. Any sequence of intervals or groups is automatically repeated to cover the complete range of the attribute. For example, to generalize arbitrary integers into intervals of length 10, only one interval [0, 10] needs to be defined. Defining a group of size two on the next level, automatically generalizes integers into groups of size 20. As is shown in the image (W1), the editor visually indicates automatically created repetitions of intervals and groups.
For creating transformation rules, each element is associated with an aggregate function (W2). These functions implement means for creating labels for intervals, groups and values that are to be translated into a common generalized value. In case of intervals, these functions are applied to the boundaries of each interval. Currently, the following aggregate functions are supported:
- Set: a set-representation of input values is returned.
- Prefix: a set of prefixes of the input values is returned. A parameter allows defining the length of these prefixes.
- Common-prefix: returns the largest common prefix.
- Bounds: returns the first and last elements of the set.
- Interval: an interval between the minimal and maximal value is returned.
- Constant: returns a predefined-constant value.
Clicking, on an interval or a group selects an editor that allows altering its further parameters. Elements can be removed, added and merged by right-clicking their graphical representation. Intervals are defined by a minimum (inclusive) and maximum (exclusive) boundary. Groups are defined by their size, as is shown in the following screen shot:
Interval-based hierarchies might define ranges in which they are to be applied. Any value out of the "label" range, will produce an error message. This can be used for sanity checks. Any value within the "snap" range, will be added to the first/last interval within the "repeat" range. Within this range, the sequences of intervals are groups will be repeated. The first and last intervals within the "repeat" range will be adjusted to cover the "snap" range.
Order-based hierarchies follow a similar principle as interval-based hierarchies, but they can be applied to attributes with ordinal scale only. In addition to the types of attributes covered by interval-based hierarchies this includes strings, using their lexicographical order, and ordered strings. First, the attributes within the domain are ordered in a user-defined manner, or as defined by the data type. Second, the ordered values can be grouped using a mechanism similar to the one used for interval-based hierarchies. Note that order-based hierarchies are especially useful for ordered strings and therefore display the complete domain of an attribute instead of only the values contained in the concrete data set. The mechanism can be used to create semantic hierarchies from a pre-defined meaningful ordering of the domain of a discrete variable. Subsequent generalizations of values from the domain can be labeled with user defined constants.
In a final step, all wizards show a tabular representation of the resulting hierarchy for the current input data set. Additionally, the number of groups on each level is computed. The abstract specification created in the process can be exported and imported to allow reuse for different data sets with similar attributes.
Privacy, population, costs and benefits
Three types of privacy threats are commonly considered when de-identifying data:
- Membership disclosure means that data linkage allows an attacker to determine whether or not data about an individual is contained in a data set. While this does not directly disclose any information from the data set itself, it may allow an attacker to infer meta-information. While this deals with implicit sensitive attributes (meaning attributes of an individual that are not contained in the data set), other disclosure models deal with explicit sensitive attributes.
- Attribute disclosure may be achieved even without linking an individual to a specific item in a data set. It protects sensitive attributes, which are attributes from the data set with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. As an example, linkage to a set of data entries allows inferring information if all items share a certain sensitive attribute value.
- Identity disclosure (or re-identification) means that an individual can be linked to a specific data entry. This is a very serious type of attack, as it has legal consequences for data owners according to many laws and regulations worldwide. From the definition it also follows that an attacker can learn all sensitive information contained in the data entry about the individual.
ARX supports the following privacy models against membership disclosure: δ-presence, the following privacy models against attribute disclosure: l-diversity, t-closeness and δ-disclosure privacy and the following privacy models against identity disclosure: k-Anonymity, k-Map, risk-based privacy models for prosecutor, journalist and marketer risks. Moreover, the profitability privacy model implements a game-theoretic approach for performing monetary cost/benefit analyses (considering re-identification risks and data quality) to create de-identified datasets which maximize the profit of the data publisher. Additionally, the tool supports a non-interactive implementation of (ε,δ)-differential privacy.
This area allows to select and configured one or multiple of these privacy models for de-identifying the data set:
Models that have been selected for de-identifying the data set are displayed in a table. Privacy criteria can be added or removed by clicking the plus and minus button, respectively. The third button allows to change their parameterization. With the up- and down-arrows it is possible to transfer parameterizations between privacy models against attribute disclosure.
Most buttons will bring up the following configuration dialog. Here, the down-arrow can be used to select a parameterization out of a set of common parameterizations for the selected privacy model.
k-Anonymity, k-Map, δ-presence, risk-based privacy models and differential privacy apply to all quasi-identifiers and can therefore always be enabled. In contrast, l-diversity, t-closeness and δ-disclosure privacy protect specific sensitive attributes. They can thus only be enabled if a sensitive attribute is currently selected.
Note: If "hierarchical distance" is used as a ground-distance for δ-presence, a generalization hierarchy must be specified for the respective attribute.
Note: Entropy-l-diversity can be configured to use traditional Shannon entropy or the corrected Grassberger estimator.
Note: k-Map, and δ-presence require specifying a research sample and population data.
Note: If a model based on population uniqueness is used, the underlying population must also be specified. This can be done with the following section of the perspective:
Note: Monetary cost/benefit analyses require the configuration of various parameters which can be found in an associated section of the perspective:
The quality model publisher payout can be used to optimize the monetary gain of the data publisher. In the configuration section, the following parameters must be specified:
- Adversary cost: the amount of money needed by an attacker for trying to re-identify a single record.
- Adversary gain: the amount of money earned by an attacker for successfully re-identifying a single record.
- Publisher benefit: the amount of money earned by the data publisher for publishing a single record.
- Publisher loss: the amount of money lost by the data publisher, e.g. due to a fine, if a single record is attacked successfully.
Transformation and utility model
In the first tab of this view, general properties of the transformation process can be specified.
The first slider allows to define the suppression limit, which is the maximal number of outliers that can be tolerated in the de-identified data set. Records that are outliers will be removed from the data set. The recommended value for this parameter is "100%". The option "Approximate" can be enabled to compute an approximate solution with potentially significantly reduced execution times. The solution is guaranteed to fulfill the given privacy model, but it might not be optimal. The recommended setting is "off". For some utility measures, e.g. Non-Uniform Entropy and Loss, precomputation can be enabled, which may also significantly reduced execution times. Precomputation is switched on when, for each quasi-identifier, the number of distinct data values divided by the total number of records in the data set is lower than the configured threshold. Experiments have shown that 0.3 is often a good value for this parameter. The recommended setting is "on".
The second tab allows specifying the measure that is to be used for estimating data utility.
ARX will display data quality / utility in terms of "scores". A lower score means higher data quality, lower loss of information of higher publisher payout, depending on the selected quality model. ARX currently supports the following models:
- Average equivalence class size.
- Non-uniform entropy.
- Normalized non-uniform entropy.
- KL divergence.
- Publisher payout.
- Entropy-based record-level information loss.
Monotonicity is a property of privacy models and utility measures than can be exploited to make the de-identification process more efficient. In real-world settings, privacy models and utility measures are only rarely monotonic. ARX can be configured to always assume monotonicity, which will speed-up the de-identification process significantly but which may reduce the quality of output data. The recommended setting is "off". Since version 2.3. ARX also supports user-defined aggregate functions for many measures. These aggregate functions are used to compile the utility estimates obtained for the individual attributes of a data set into a global value. The recommended setting is "Arithmetic mean". ARX currently supports the following aggregate functions:
- Arithmetic mean.
- Geometric mean.
Some utility models also support considering the effect of microaggregation on data quality. If supported, it can be specified whether the mean squared error or a simple notion of information loss should be included.
For most measures, attributes may be weighted as is shown in the following screen shot.
Each of the knobs may be used to associated a weight to a specific quasi-identifier. When de-identifying a data set, ARX will try to reduce the loss of information for attributes with higher weights.
If "Loss" has been selected for measuring data utility, the coding model may also be further specified in an additional tab.
With the slider it can be configured whether ARX should prefer generalization or suppression for de-identifying the data set.
Defining a research sample
In this view, a research sample can be specified. It represents a sample of the underlying population that will be de-identified and exported. The population can be approximated by loading further data into ARX. This information can be used for the δ-presence and k-map privacy models as well as for analyses of re-identification risks. The buttons in the top-right corner implement different options for extracting a sample from the data set:
- Selecting no data records.
- Selecting all data records.
- Selecting some records, by matching the current population data against an external sample.
- Selecting some records by querying the data set.
- Selecting some records with random sampling.
The view will show the size of the current sample and the method with which it was created. At any time, the research sample can be altered by clicking the checkboxes shown in the data tables. The query syntax is as follows: fields and constants must be enclosed in single quotes. In the example 'age' is a field and '50' is a constant. The following operators are supported: >, >=, <, <=, =, or, and, ( and ). The dialog dynamically shows the number of rows matching the query and displays error messages in case of syntactic errors.
The settings window is accessible from either the application menu or the application toolbar. The Project section allows to alter general project properties, including the syntax for importing and exporting data.
The Internals section allows to adjust some parameters of the optimizations implemented by ARX. Changing the number of snapshots enables users to control a space-time trade-off used during data de-identification. Choosing a value of 1 significantly reduces the main memory requirements of the tool, but may result in increased execution times. Choosing a larger value will reduce execution times, but increase memory consumption. The default value is 200. Finally, this area allows to control the visualizations of the solution space in the exploration perspective.
The section Transformation provides options for altering aspects of the data transformations that are performed by ARX. First, it can be specified which types of attributes are to be masked in records that are outliers. The suppression string is used as a masking character.
In the section Search method the maximal size of the search space can be specified. If this size is exceed a heuristic search will be performed. The heuristic search process will again be termined after a user-defined amount of time. When a risk-based de-identification process is performed, ARX can either suppress records based on the sizes of their equivalence classes or use the utility measure for this purpose.
In the Utility analysis section, users can control properties of the summary statistics computed by ARX and specify whether functional representations of generalization hierarchies should be used for computing data utility. Moreover, some aspects of the logistic regression classification algorithm implemented by ARX can be specified.
The section Risk analysis allows defining parameters of the Newton-Raphson method used for solving bivariate non-linear equation systems when evaluating population-based risk models. Parameter include the total number of iterations, iterations per try and the required accuracy.