Privacy models

In the context of statistical disclosure control, three different types of privacy threats are commonly considered: membership disclosure means that an attacker is able to determine whether or not an individual is contained in a dataset utilizing quasi-identifying attributes. From this, additional personal information can potentially be inferred, e.g., if the data is from a cancer registry. Two additional types of disclosure deal with sensitive attributes. These are attributes in which an attacker might be interested and that, if disclosed, could cause harm to the data subject. Attribute disclosure means that an attacker is able to infer additional information about an individual without necessarily linking it to a specific item in a dataset. For example, the individual can be linked to a set of data items. This potentially allows inferring additional information, e.g., if all items share a certain sensitive attribute value. While membership disclosure can be seen as a specific type of attribute disclosure, it requires different countermeasures. Identity disclosure means that an attacker can learn sensitive information about an individual by linking it to a specific data item. This allows disclosing all information contained about the individual.

Disclosure types

Multiple privacy models have been proposed to prevent these types of disclosure, from which the following are currently implemented in the ARX anonymization tool:

k-Anonymity

This privacy model is well-known. It aims at protecting datasets from identity disclosure following the prosecutor attacker model. A dataset is k-anonymous if, regarding the quasi-identifiers, each data item can not be distinguished from at least k − 1 other data items. The tuples with identical values for all quasi-identifiers form an equivalence class. For more details, see this paper.

Population uniqueness

ARX also supports several relaxed privacy models for protecting datasets against re-identification attacks following the marketer model. For example, thresholds can be enforced of the proportion of records that are unique within the underlying population. If no explicit information about this population has been loaded into ARX, this information can be estimated with super-population models. These statistical methods estimate characteristics of the overall population with probability distributions that are parameterized with sample characteristics. We provide default settings for populations, such as the USA, UK, France or Germany, and support the methods by Pitman, Zayatz and the SNB model. ARX also implements the decision rule proposed and validated for clinical datasets by Dankar et al. More information can be found in this paper.

Sample uniqueness

ARX also implements a privacy model which restricts the fraction of records that are unique within the dataset.

k-Map

This privacy model is a variant of k-anonymity, which considers explicit information about the underlying population.

Strict-average risk

ARX also implements strict-average risk, which is a combination of a threshold on average re-identification risks combined with k-anonymity. It can be used to protect datasets from marketer attacks.

-Diversity

This privacy model protects a dataset against attribute disclosure. It ensures that the values of a set of predefined sensitive attributes are at least l-diverse within each equivalence class. l-Diversity also implies l-anonymity. To fulfill the basic definition of l-diversity, a sensitive attribute must have at least ”well represented” distinct values in each equivalence class. Different variants, such as entropy-l-diversity and recursive-(c,l)-diversity, have been proposed, which implement different measures of diversity. It was shown that recursive-(c,l)-diversity delivers the best trade-of between data quality and privacy. For more details, see this paper. Since version 3.5.0 ARX also implements a variant of entropy--diversity, which uses the corrected Grassberger estimator proposed in this paper.

t-Closeness

This privacy model is an alternative for the protection against attribute disclosure. The basic idea is that equivalence classes are not allowed to stand out in the dataset. To achieve this, the distributions of the values of the sensitive attribute within each equivalence class must have a distance of less than t to the distribution of the attribute values in the original dataset. For measuring distances between distributions, the earth mover’s distance (EMD) is used. Different variants exist, which use different ground distances when computing the EMD: (1) equal ground distance, which considers all values to be equally distant from each other, and, (2) hierarchical ground distance, which utilizes generalization hierarchies to determine the distance between data items. For more details, see this paper. Since version 3.5.0 ARX also supports a variant of t-closeness for numeric attributes.

δ-Disclosure privacy

This privacy model is a very strict measure for mitigating attribute disclosure.

δ-Presence

This model aims at protecting datasets against membership disclosure. The basic idea is to model the disclosed dataset as a subset of larger dataset that represents the attacker’s background knowledge. A dataset is (δmin, δmax)-present if the probability that an individual from the global dataset is contained in the research subset lies between δmin and δmax. For more details, see this paper.

Profitability

Since version 3.5.0 ARX implements the game-theoretic approach presented in this paper for performing monetary cost/benefit analyses to create de-identified datasets which maximize the profit of the data publisher.