Supported privacy models
Three types of privacy threats are commonly considered when anonymizing data:
- Membership disclosure means that data linkage allows an attacker to determine whether or not data about an individual is contained in a dataset. While this does not directly disclose any information from the dataset itself, it may allow an attacker to infer meta-information. While this deals with implicit sensitive attributes (meaning attributes of an individual that are not contained in the dataset), other disclosure models deal with explicit sensitive attributes.
- Attribute disclosure may be achieved even without linking an individual to a specific item in a dataset. It protects sensitive attributes, which are attributes from the dataset with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. As an example, linkage to a set of data entries allows inferring information if all items share a certain sensitive attribute value.
- Identity disclosure (or re-identification) means that an individual can be linked to a specific data entry. This is a serious type of attack, as it has legal consequences for data owners according to many laws and regulations worldwide. From the definition it also follows that an attacker can learn all sensitive information contained in the data entry about the individual.
The specifics of the disclosure risks from which a dataset is to be protected can be specified by categorizing the attributes of the input dataset into different types:
- Identifying attributes are associated with a high risk of re-identification. They will be removed from the dataset. Typical examples are names or Social Security Numbers.
- Quasi-identifying attributes can in combination be used for re-identification attacks. They will be transformed. Typical examples are gender, date of birth and ZIP codes.
- Sensitive attributes encode properties with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. They will be kept unmodified but may be subject to further constraints, such as t-closeness or l-diversity. Typical examples are diagnoses.
- Insensitive attributes are not associated with privacy risks. They will be kept unmodified.
Moreover, some privacy and risk models supported by ARX address specific attacker models:
- In the prosecutor model the attacker targets a specific individual and it is assumed that she already knows that data about the individual is contained in the dataset.
- In the journalist model the attacker targets a specific individual but it is not expected that she possesses background knowledge about membership.
- In the marketer model the attacker does not target a specific individual but she aims at re-identifying a high number of individuals. An attack can therefore only be considered successful if a larger fraction of the records could be re-identified.
ARX supports (almost) arbitrary combinations of privacy models and different variants of most models, which focus on different attacker models or which are optimized for attributes with specific data types.
This well-known privacy model aims at protecting datasets from re-identification in the prosecutor model. A dataset is k-anonymous if each record cannot be distinguished from at least k-1 other records regarding the quasi-identifiers. Each group of indistinguishable records forms a so-called equivalence class. More information can be found here, here and here.
This privacy model is related to k-anonymity but risks are calculated based on information about the underlying population. ARX supports a variant which utilizes a user-specified population table as well as two variants based on statistical frequency estimators published here and here.
This privacy model can be used for protecting datasets from re-identification in the marketer model by enforcing a threshold on the average re-identification risk of the records. By combining the model with k-anonymity a privacy model called strict-average risk can be constructed. ARX further supports a variant which permits some records to exceed the risk threshold defined by k.
This privacy model aims at protecting datasets from re-identification in the marketer model by enforcing thresholds on the proportion of records that are unique within the underlying population. For this purpose, basic information about the population has to be specified. Based on this data, statistical super-population models are used to estimate characteristics of the overall population with probability distributions that are parameterized with sample characteristics. ARX supports the methods by Hoshino (Pitman), Zayatz and Chen and McNulty (SNB). Different models may return differently accurate estimates of the number of population uniques. As a rule of thumb, the Pitman model should be used for sampling fractions lower than or equal to 10%. ARX also implements a decision rule proposed and validated for clinical datasets by Dankar et al. More information can be found here.
Note: Methods for estimating population uniqueness assume that the dataset is a uniform sample of the population. If this is not the case, results may be inaccurate.
This privacy model can be used to restrict the fraction of records which are unique regarding the quasi-identifiers.
This privacy model can be used to protect data against attribute disclosure by ensuring that each sensitive attribute has at least ℓ "well represented" values in each equivalence class. Different variants, which implement different measures of diversity, have been proposed of which the following are supported by the software: distinct-ℓ-diversity, recursive-(c, ℓ)-diversity as well as entropy-ℓ-diversity with two different estimators (Shannon or Grassberger). More information can be found here and here.
This privacy model can also be used to protect data from attribute disclosure. It requires that the distributions of values of a sensitive attribute within each equivalence class must have a distance of not more than t to the distribution of the attribute values in the input dataset. For this purpose, it bounds the cumulative absolute difference between the frequency distributions, which is calculated using the earth mover's distance (EMD). Different variants have been proposed for variables with different data types: (1) equal ground distance considers all values to be equally distant from each other, (2) hierarchical ground distance utilizes value generalization hierarchies to determine the distance between values and (3) ordered ground distance calculates distances based on the order of values. More information can be found here.
This privacy model can also be used to protect data against attribute disclosure. It also enforces a restriction on the distances between the distributions of sensitive values but uses a multiplicative definition which is stricter than the definition used by t-closeness. More information can be found here.
This privacy model is related to t-closeness and δ-disclosure privacy and it can also be used to protect data against attribute disclosure. It aims to overcome limitations of prior models by restricting the relative maximal distance between distributions of sensitive attribute values, also considering positive and negative information gain. More information can be found here.
This model can be used to protect data from membership disclosure. A dataset is (δmin, δmax)-present if the probability that an individual from the population is contained in the dataset lies between δmin and δmax. In order to be able to calculate these probabilities, users need to specify a population table. More information can be found here.
This model implements a game-theoretic approach for performing cost/benefit analyses of data publishing to create output datasets which maximize data publisher's monetary benefit. More information can be found here and here.
In this model, privacy protection is not considered a property of a dataset, but a property of a data processing method. Informally, it guarantees that the probability of any possible output of the anonymization process does not change "by much" if data of an individual is added to or removed from input data. Consequently, it becomes very difficult for attackers to derive information about specific individuals and datasets are protected from membership, identity and attribute disclosure. Differential privacy does further not make any strong assumptions about the background knowledge of attackers, e.g. about which attributes could be used for linkage. Instead, all attributes should be defined to be quasi-identifying. More information can be found here.