In this perspective, various privacy risks can be analyzed. These include re-identification risks for the prosecutor, journalist and marketer attacker models as well as population uniqueness estimated with different statistical methods. Moreover, the perspective provides methods for detecting HIPAA identifiers in the data set and for finding potential quasi-identifiers.
Distribution of risks
In this view, the distribution of re-identification risks amongs the records of the dataset can be analyzed. The distribution is displayed for both input and output data, either as a histogram or as a table.
In this section, combinations of attributes can be analyzed for associated risks of re-identification. The view provides information about the degree to which combinations of variables separate the records from each other and to which degree the variables make records distinct. First, you must select a set of attributes to analyze in the bottom left area.
ARX will then calculate the aforementioned parameters. You may use this information to decide which quasi-identifiers need to be protected.
This view displays an overview of several measures for re-identification risks implemented by ARX. In the upper area of this perspective, risk estimates are provided for three different attacker models: (1) the prosecutor scenario, (2) the journalist scenario and (3) the marketer scenario. In the prosecutor model it is assumed that the attacker already knows that data about the targeted individual is contained in the data set. In the journalist model, such background knowledge is not assumed. In the marketer model, it is assumed that the attacker is not interested in re-identifying a specific individual but that she aims at attacking a larger number of individuals. An attack can therefore only be considered successful if a larger portion of the records could be re-identified.
Thresholds can be provided for the highest risk of any record, the records that have a risk higher than this threshold and for the average proportion of records that can successfully be re-identified. More details about the methods implemented by ARX can be found in the book Guide to the De-Identification of Personal Health Information by Khaled el Emam.
In the lower part of the screen, selected measures of prosecutor re-identification risks are displayed. These measures are based on the sample itself. They are complemented by numbers on population uniqueness from a selected statistical model:
- Lowest prosecutor re-identification risk.
- Individuals affected by lowest risk.
- Highest prosecutor re-identification risk.
- Individuals affected by highest risk.
- Average prosecutor re-identification risk.
- Fraction of unique records.
The Safe Harbor method of the US Health Insurance and Portability and Accountability Act specifies 18 identifiers that must be altered or removed in order to derive a de-identified data set. The aim of this perspective is to detect such identifiers.
Note: This method works on a best-effort basis. If no HIPAA identifiers are detected, this does not mean that no HIPAA identifiers are contained in a data set. ARX favors recall over precision and it does not implement methods for detecting all types of HIPAA identifiers. The following types of attributes specified by HIPAA can potentially be detected:
- Geographical subdivisions: regions, states, cities.
- Phone numbers.
- Fax numbers.
- Electronic mail addresses.
- Social Security numbers.
- License plate numbers.
- Universal Resource Locators (URLs).
- Internet Protocol (IP) addresses.
The method computes edit distances between common labels for HIPAA identifiers and the labels of the attributes in the data set. Moreover, it checks the values of the attributes for common patterns (e.g. of license plate numbers, ZIP codes and dates) and common instance values (e.g. first names and last names).
ARX supports the estimation of marketer re-identification risks based on the number of population uniques in a sample. Population uniques are records that are unique within the sample (sample uniques) that are also unique within the overall population. Note: Not all sample uniques are also population uniques. When no data about the population has been loaded into ARX, this number can be estimated with statistical models. Super-population models estimate the characteristics of the overall population with probability distributions that are parameterized with sample characteristics. ARX supports the methods by Hoshino (Pitman), Zayatz and Chen and McNulty (SNB).
Different models may return differently accurate estimates of the number of population uniques. As a rule of thumb, the Pitman model should be used for sampling fractions lower than or equal to 10%. ARX also implements a decision rule proposed and validated for clinical data sets by Dankar et al. More information can be found here (http://www.biomedcentral.com/1472-6947/12/66).
The tool also provides a perspective for comparing the results of different methods under the assumption of different sampling fractions:
This view displays the estimates by the method of Dankar et al., again, for different sampling fractions:
In the process of computing estimates with statistical models ARX must solve non-linear bivariate equation systems. The solver used by ARX can be configured in the settings dialog. Here, you may specify options such as the total number of iterations, iterations per try and the required accuracy etc. Changing these settings may influence the precision of the results and the time required to obtain time.
Methods for estimating the number of population uniques in a data set require some basic data about the population from which the data set was sampled. ARX provides default settings for populations, such as the USA, UK, France or Germany, which can be selected in the following area:
It is important that the underlying population matches the population that the anticipated adversary is likely to know that the data has been sampled from. If the required data is not provided by ARX, it can also be entered manually.