Viewing data sets
Specifying attribute properties
Creating generalization hierarchies
Privacy, population model, risk and benefit
Transformation and utility model
Selecting a research sample
Configuring the de-identification process
In the configuration perspective, data can be imported, generalization hierarchies can be created, and the anonymization process can be parameterized.
The perspective is divided into five main areas.Area 1 shows the current input data set:
- The data import process supports a variety of data sources including Excel, JDBC databases and character separated value files with user defined syntax.
- For each attribute, data type, data format, and sensitivity can be specified.
- Generalization hierarchies can be created in interval, ordinal and mask form for quasi-identifier and sensitive attributes. Hierarchies can be created with software assistance or imported from CSV files.
- Multiple privacy models can be selected and configured. For applicable models, population and financial data can also be specified.
- A single utility measure can be selected and configured. For applicable utility measures, suppression, generalization, aggregation, monotonicity and precomputation threshold can also be specified.
- The research subset can be select manually, randomly, by query or by file
Viewing data sets
ARX displays input and output data as tables. In the table headers, attribute types are indicated by displaying different colors:
- Red indicates an identifying attribute. Identifying attributes are associated with a high risk of re-identification. They will be removed from the data set. Typical examples are names or Social Security Numbers.
- Yellow indicates a quasi-identifying attribute. Quasi-identifying attributes are associated with a high risk of re-identification. They will be transformed by means of generalization, suppression or micro-aggregation. Typical examples are gender, date of birth and ZIP codes.
- Purple indicates a sensitive attribute. Sensitive attributes encode properties with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. They will be kept unmodified but may be subject to further constraints, such as t-closeness or l-diversity. Typical examples are diagnoses.
- Green indicates an insensitive attribute. Insensitive attributes are not associated with any privacy risks. They will be kept unmodified.
Each row is further associated with a checkbox. These checkboxes indicate, which rows have been selected for inclusion into the research sample. The checkboxes in the view for output data sets, indicate the sample that was specified when the de-identification process was last executed. They cannot be edited. The checkboxes in the view for the input data set represent the current research sample. They are editable.
Each table offers a few options that are accessible via buttons in the top-right corner:
- Pressing the first button will sort the data according to the currently selected column.
- Pressing the second button will sort the output data set according to all quasi-identifiers and then highlight the equivalence classes.
- Pressing the third button will toggle between showing all records of the data set or only the research sample.
Specifying attribute properties
Attribute properties are set in the Transformation and Metadata tabs within the Configuration tab. To set an attribute's properties, first select an attribute in the input tab (e.g. age in the figure below). The transformation and metadata tabs will be linked with that attribute and updated accordingly.
The transformation tab is used to set the type of privacy risk and transformation method associated with an attribute. The transformation tab also displays an attribute's generalization hierarchy and provides a context menu for in-place editing.
Types of privacy risk include:
- Identifying attributes are associated with a high risk of re-identification. They will be removed from the data set. Typical examples are names or Social Security Numbers.
- Quasi-identifying attributes are associated with a high risk of re-identification. They will be transformed by means of generalization, suppression or micro-aggregation. Typical examples are gender, date of birth and ZIP codes.
- Sensitive attributes encode properties with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. They will be kept unmodified but may be subject to further constraints, such as t-closeness or l-diversity. Typical examples are diagnoses.
- Insensitive attributes are not associated with any privacy risks. They will be kept unmodified.
The assigned risk type is also indicated by a color coded bullet, next to the attribute name, in the input tab.
Transformation methods include generalization and micro-aggregation. When generalization is selected, minimum and maximum applicable generalization levels can be set. When micro-aggregation is selected, the aggregation function and handling of missing data can be set.
When available, an attributes generalization hierarchy is displayed in tabular form, below the risk type and transformation inputs. Attribute values in their most specific form are displayed in the leftmost column, with generalization levels increasing from left to right. Right clicking in the hierarchy table displays an editor that can be used to create a simple hierarchy or perform row, column, and cell updates on an existing hierarchy. To create a complex hierarchies, consider using the Hierarchy Wizard, which can be launched from the Edit menu and application toolbar.
The metadata tab is used to set the data type and format of an attribute.
Supported data types include:
- String: a generic sequence of characters. This is the default data type.
- Integer: a data type for numbers without a fractional component.
- Decimal: a data type for numbers with fractional component.
- Date/time: a data type for dates and time stamps.
- Ordered string: this data type represents strings with ordinal scale.
Data type is set by right clicking on an attribute's row in the metadata table. Format is prompted for automatically for applicable data types. Specifying a data type is optional, with String type used by default, but yields better results when creating generalization hierarchies and visualizing data in the analysis perspective.
Information on decimal formats can be found at -(http://docs.oracle.com/javase/7/docs/api/java/text/DecimalFormat.html).
Information on date formats can be found at - (http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html)
Creating generalization hierarchies
ARX offers different methods for creating generalization hierarchies for different types of attributes. Generalization hierarchies created with the wizard are represented in a functional manner, meaning that they can be created for the entire domain of an attribute without explicitly addressing the specific values in a concrete data set. This enables the handling of continuous variables. Moreover, hierarchy specifications can be imported and exported and they can thus be reused for de-identifying different data sets with similar attributes. Please make sure that you first select an appropriate data type for the attribute. The wizard supports three different types of hierarchies:
- Masking-based hierarchies: this general-purpose mechanism allows creating hierarchies for a broad spectrum of attributes.
- Interval-based hierarchies: these hierarchies can be applied to variables with a ratio scale.
- Order-based hierarchies: this method an be used for variables with an ordinal scale.
Masking is a highly flexible mechanism that can be applied to many attributes and is especially suitable for alphanumeric codes, such as ZIP codes or telephone numbers. The following image shows a screenshot of the respective wizard:
In ARX, masking follows a two-step process. First, values are aligned either to the left or to the right. Then the characters are masked, again, either from left to right, or from right to left. All values are adjusted to a common length, by introducing padding characters. This character, as well as the character used for masking can be specified by the user.
Intervals are a natural means of generalization for values with a ratio scale, such as integers, decimals or dates/time stamps. ARX offers a graphical editor for efficiently defining sets of intervals over the complete range of any of the above data types. As is shown in the following screen shot, first a sequence of intervals can be defined (shown on the left side of the screen shot). In the next step, subsequent levels consisting of groups can be defined. Each group combines a given number of elements from the previous level. Any sequence of intervals or groups is automatically repeated to cover the complete range of the attribute. For example, to generalize arbitrary integers into intervals of length 10, only one interval [0, 10] needs to be defined. Defining a group of size two on the next level, automatically generalizes integers into groups of size 20. As is shown in the image (W1), the editor visually indicates automatically created repetitions of intervals and groups.
For creating transformation rules, each element is associated with an aggregate function (W2). These functions implement means for creating labels for intervals, groups and values that are to be translated into a common generalized value. In case of intervals, these functions are applied to the boundaries of each interval. Currently, the following aggregate functions are supported:
- Set: a set-representation of input values is returned.
- Prefix: a set of prefixes of the input values is returned. A parameter allows defining the length of these prefixes.
- Common-prefix: returns the largest common prefix.
- Bounds: returns the first and last elements of the set.
- Interval: an interval between the minimal and maximal value is returned.
- Constant: returns a predefined-constant value.
Clicking, on an interval or a group selects an editor that allows altering its further parameters. Elements can be removed, added and merged by right-clicking their graphical representation. Intervals are defined by a minimum (inclusive) and maximum (exclusive) boundary. Groups are defined by their size, as is shown in the following screen shot:
Interval-based hierarchies might define ranges in which they are to be applied. Any value out of the "label" range, will produce an error message. This can be used for sanity checks. Any value within the "snap" range, will be added to the first/last interval within the "repeat" range. Within this range, the sequences of intervals are groups will be repeated. The first and last intervals within the "repeat" range will be adjusted to cover the "snap" range.
Order-based hierarchies follow a similar principle as interval-based hierarchies, but they can be applied to attributes with ordinal scale only. In addition to the types of attributes covered by interval-based hierarchies this includes strings, using their lexicographical order, and ordered strings. First, the attributes within the domain are ordered in a user-defined manner, or as defined by the data type. Second, the ordered values can be grouped using a mechanism similar to the one used for interval-based hierarchies. Note that order-based hierarchies are especially useful for ordered strings and therefore display the complete domain of an attribute instead of only the values contained in the concrete data set. The mechanism can be used to create semantic hierarchies from a pre-defined meaningful ordering of the domain of a discrete variable. Subsequent generalizations of values from the domain can be labeled with user defined constants.
In a final step, all wizards show a tabular representation of the resulting hierarchy for the current input data set. Additionally, the number of groups on each level is computed. The abstract specification created in the process can be exported and imported to allow reuse for different data sets with similar attributes.
Privacy, population, costs and benefits
Three types of privacy threats are commonly considered when de-identifying data:
- Membership disclosure means that data linkage allows an attacker to determine whether or not data about an individual is contained in a data set. While this does not directly disclose any information from the data set itself, it may allow an attacker to infer meta-information. While this deals with implicit sensitive attributes (meaning attributes of an individual that are not contained in the data set), other disclosure models deal with explicit sensitive attributes.
- Attribute disclosure may be achieved even without linking an individual to a specific item in a data set. It protects sensitive attributes, which are attributes from the data set with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. As an example, linkage to a set of data entries allows inferring information if all items share a certain sensitive attribute value.
- Identity disclosure (or re-identification) means that an individual can be linked to a specific data entry. This is a very serious type of attack, as it has legal consequences for data owners according to many laws and regulations worldwide. From the definition it also follows that an attacker can learn all sensitive information contained in the data entry about the individual.
ARX supports the following privacy models against membership disclosure: δ-presence, the following privacy models against attribute disclosure: l-diversity, t-closeness and δ-disclosure privacy and the following privacy models against identity disclosure: k-Anonymity, k-Map, risk-based privacy models for prosecutor, journalist and marketer risks. Moreover, the profitability privacy model implements a game-theoretic approach for performing monetary cost/benefit analyses (considering re-identification risks and data quality) to create de-identified datasets which maximize the profit of the data publisher. Additionally, the tool supports a non-interactive implementation of (ε,δ)-differential privacy.
This area allows to select and configured one or multiple of these privacy models for de-identifying the data set:
Models that have been selected for de-identifying the data set are displayed in a table. Privacy criteria can be added or removed by clicking the plus and minus button, respectively. The third button allows to change their parameterization. With the up- and down-arrows it is possible to transfer parameterizations between privacy models against attribute disclosure.
Most buttons will bring up the following configuration dialog. Here, the down-arrow can be used to select a parameterization out of a set of common parameterizations for the selected privacy model.
k-Anonymity, k-Map, δ-presence, risk-based privacy models and differential privacy apply to all quasi-identifiers and can therefore always be enabled. In contrast, l-diversity, t-closeness and δ-disclosure privacy protect specific sensitive attributes. They can thus only be enabled if a sensitive attribute is currently selected.
Note: If "hierarchical distance" is used as a ground-distance for δ-presence, a generalization hierarchy must be specified for the respective attribute.
Note: Entropy-l-diversity can be configured to use traditional Shannon entropy or the corrected Grassberger estimator.
Note: k-Map, and δ-presence require specifying a research sample and population data.
Note: If a model based on population uniqueness is used, the underlying population must also be specified. This can be done with the following section of the perspective:
Note: Methods for estimating population uniqueness assume that the dataset is a uniform sample of the population. If this is not the case, results may be inaccurate.
Note: Monetary cost/benefit analyses require the configuration of various parameters which can be found in an associated section of the perspective:
The quality model publisher payout can be used to optimize the monetary gain of the data publisher. In the configuration section, the following parameters must be specified:
- Adversary cost: the amount of money needed by an attacker for trying to re-identify a single record.
- Adversary gain: the amount of money earned by an attacker for successfully re-identifying a single record.
- Publisher benefit: the amount of money earned by the data publisher for publishing a single record.
- Publisher loss: the amount of money lost by the data publisher, e.g. due to a fine, if a single record is attacked successfully.
Transformation and utility model
In the first tab of this view, general properties of the transformation process can be specified.
The first slider allows to define the suppression limit, which is the maximal number of outliers that can be tolerated in the de-identified data set. Records that are outliers will be removed from the data set. The recommended value for this parameter is "100%". The option "Approximate" can be enabled to compute an approximate solution with potentially significantly reduced execution times. The solution is guaranteed to fulfill the given privacy model, but it might not be optimal. The recommended setting is "off". For some utility measures, e.g. Non-Uniform Entropy and Loss, precomputation can be enabled, which may also significantly reduced execution times. Precomputation is switched on when, for each quasi-identifier, the number of distinct data values divided by the total number of records in the data set is lower than the configured threshold. Experiments have shown that 0.3 is often a good value for this parameter. The recommended setting is "on".
The second tab allows specifying the measure that is to be used for estimating data utility.
ARX will display data quality / utility in terms of "scores". A lower score means higher data quality, lower loss of information of higher publisher payout, depending on the selected quality model. ARX currently supports the following models:
- Average equivalence class size.
- Non-uniform entropy.
- Normalized non-uniform entropy.
- KL divergence.
- Publisher payout.
- Entropy-based record-level information loss.
Monotonicity is a property of privacy models and utility measures than can be exploited to make the de-identification process more efficient. In real-world settings, privacy models and utility measures are only rarely monotonic. ARX can be configured to always assume monotonicity, which will speed-up the de-identification process significantly but which may reduce the quality of output data. The recommended setting is "off". Since version 2.3. ARX also supports user-defined aggregate functions for many measures. These aggregate functions are used to compile the utility estimates obtained for the individual attributes of a data set into a global value. The recommended setting is "Arithmetic mean". ARX currently supports the following aggregate functions:
- Arithmetic mean.
- Geometric mean.
Some utility models also support considering the effect of microaggregation on data quality. If supported, it can be specified whether the mean squared error or a simple notion of information loss should be included.
For most measures, attributes may be weighted as is shown in the following screen shot.
Each of the knobs may be used to associated a weight to a specific quasi-identifier. When de-identifying a data set, ARX will try to reduce the loss of information for attributes with higher weights.
If "Loss" has been selected for measuring data utility, the coding model may also be further specified in an additional tab.
With the slider it can be configured whether ARX should prefer generalization or suppression for de-identifying the data set.
Defining a research sample
In this view, a research sample can be specified. It represents a sample of the underlying population that will be de-identified and exported. The population can be approximated by loading further data into ARX. This information can be used for the δ-presence and k-map privacy models as well as for analyses of re-identification risks. The buttons in the top-right corner implement different options for extracting a sample from the data set:
- Selecting no data records.
- Selecting all data records.
- Selecting some records, by matching the current population data against an external sample.
- Selecting some records by querying the data set.
- Selecting some records with random sampling.
The view will show the size of the current sample and the method with which it was created. At any time, the research sample can be altered by clicking the checkboxes shown in the data tables. The query syntax is as follows: fields and constants must be enclosed in single quotes. In the example 'age' is a field and '50' is a constant. The following operators are supported: >, >=, <, <=, =, or, and, ( and ). The dialog dynamically shows the number of rows matching the query and displays error messages in case of syntactic errors.
The settings window is accessible from either the application menu or the application toolbar.The Project section can be used to alter general project properties:
- Project metadata, including name, description and localization, can be updated.
- The default syntax for file I/O can be specified.
- Snapshot settings control the space-time trade-off during de-identification. Larger snapshot values will typically reduce execution time but increase memory consumption.
- To speed up de-identification for larger datasets, utility analyses can be disabled.
- Outliers can be specified to be removed from non-anonymous transformations, leaving only de-identified records in output data.
- Sensitive and insensitive attributes can be masked in suppressed tuples.
- Time limited heuristic search can be enabled by default or for large search spaces.
- Suppression of records with insufficient utility can be disabled.
- Generalization hierarchies can be configured as functional or explicit, where a functional hierarchy represents the entire domain of a given attribute, while an explicit hierarchy represents only values in the dataset.
- Logistic regression performance, used for generic classification and calculation of precision and recall, can be fine tuned through a variety or parameters.
- Optimization parameters include the total number of iterations, iterations per try and the required accuracy.