Introduction to anonymization and pseudonymization

1. introduction to anonymization and pseudonymization

Many companies or organizations have personal data that they would like to analyse or process. However, this is not so easy due to data protection regulations, as personal data must be handled with the utmost care. This data includes all data that can identify a person. Initially, one might only think of the first name and surname, perhaps in combination with date of birth, telephone number, email address and address, but there are also other details that can uniquely identify a person, for example when talking about a pediatrician in a small town where there is only one pediatrician. Only if this personal data is sufficiently anonymized and/or pseudonymized may it be forwarded and processed and evaluated by third parties. The legal basis for this is the General Data Protection Regulation (GDPR).

2. definition and differences

What is anonymization?

Anonymization is the alteration of personal data in such a way that the individual details of personal or factual circumstances can no longer be attributed to an identified or identifiable natural person, or only with a disproportionate amount of time, cost and effort. (BDSG old version § 3 para. 6) If data is completely anonymized, the General Data Protection Regulation does not apply, a personal reference and thus a re-identification is almost impossible. Only when every possible combination of data leads to two or more persons in a data set is a data set anonymized.

What is pseudonomization?

Pseudonymization is the replacement of the name and other identifying features with an identifier for the purpose of excluding or significantly complicating the identification of the data subject. (BDSG old version § 3 para. 6a)

3. types of anonymization

Two main points must be guaranteed by anonymization: The data must be irreversible and it must be impossible to clearly assign it to a person. In general, a distinction is made between three types of anonymization, which are described below: absolute, formal and factual anonymization.

Absolute anonymization

This is the strongest type of anonymization; all personal details are removed so that identification is impossible. Completely anonymized data can be made publicly available for all data analyses, but the data is often so strongly alienated by the anonymization that the benefit that can be drawn from the data is low.

Formal anonymization

This is the simplest form of anonymization. In this case, only the direct identifiers of a person are removed, such as name, telephone number and address.

Factual or relative anonymization

Anonymization is carried out to such an extent that it is almost impossible to assign the data to a person or can only be carried out with a disproportionate amount of effort, but sufficient information is still available to carry out an analysis of the data with regard to non-personal content. This data may not be made generally available, but may only be used for scientific projects in accordance with the Federal Statistics Act. When data is sufficiently anonymized depends on the information contained in the data set, but also on the conditions or techniques used for anonymization. For example, the additional information such as keys or whether the data is used externally or internally plays a decisive role. Accordingly, it can be decided to what extent the information should or must be anonymized. The effort required for anonymization and the benefit of the data should be determined for each analysis.

4. anonymization techniques

Various anonymization techniques can be used to achieve de facto anonymization. The GDPR does not specify which anonymization techniques are to be used. To avoid violating the GDPR, it is advisable to involve a data protection officer.

Removing the identifier

For this purpose, data that can identify a person is completely deleted from a data record, for example, name, address, date of birth, account data, social security number, photograph, but also sensitive attributes such as illnesses or very old age.

Edgeonomization

Various techniques are used to disrupt the data to such an extent that the link between data and persons is broken. These techniques include the swapping of data (the values of one person are randomly or pseudo-randomly swapped with the values of another person, whereby care must be taken to ensure that the swapping does not result in the data coincidentally being the original data of a person again), synthetic data generation (artificial data sets are created using the characteristics of the original data set), or perturbation (data is replaced with artificially generated values so that statistical characteristics of the original data set remain).

Aggregation

Different approaches are available to generalize the data set. For example, numerical data, such as age, is categorized in intervals, or a woman's name is replaced with 'woman' or a job title with 'profession'. It should be determined at the beginning to what extent a data set or individual details should be generalized.

5. pseudonomization techniques

Placeholders and keys

In addition to anonymization, pseudonymization plays an important role, but there are a few points that need to be considered in connection with pseudonymization. In pseudonymization, the link between a person and the specified values is not completely removed, but placeholders are used that can be traced back to the person using a key. If the key is not sent along with the pseudonymized data record, the data record is anonymized for the recipient.

In general, however, pseudonymized data is still personal data, as it can be assigned to a person using a key.

In order to make this data available for analysis, special care must therefore be taken to ensure that the necessary key is stored securely, but is also not lost.

If pseudonymization is used instead of anonymization, it is important to take a close look at the Data Protection Act or involve a data protection officer to ensure that no violations occur.

Combined approaches

A combination of anonymization and pseudonymization can also be applied to a data set so that only the data that absolutely cannot be anonymized is replaced by pseudonyms. This increases the certainty that no breaches of data protection law will occur.

The following infographic shows the differences between anonymization and pseudonymization.

6 Artificial intelligence in anonymized data analysis

Advances in the field of artificial intelligence also enable new approaches to anonymized data analysis. Several possibilities can be considered, two of which are described below.

Create synthetic values or data records

This approach is used to create artificial data that has similar statistical properties to the original data set, for example. This allows a data set to be created whose data is anonymized, but which contains sufficient data for analysis.

Featured Learning

The idea here is that data records are not copied to a central server to carry out the analysis, but that the training takes place on each individual user's computer. The models created in the process are then collected on a central server and aggregated into a model. The original data therefore remains on the user's computer and never comes into the hands of the analyst. The big advantage here is that the amount of data does not have to be reduced.

Automated formal anonymization

Furthermore, artificial intelligence can be used to formally anonymize or pseudonymize data records, so that in texts the direct identifiers of persons, i.e. name, address, birthday, etc., are automatically found and deleted, in photos license plates, faces, etc. are automatically recognized and made noisy, or in audios names, addresses, etc. are automatically recognized and made noisy.

7. applications of anonymization

All data that identifies a person must be anonymized before data analysis is carried out or the data is passed on for any other purpose. This applies not only to personal data in texts, but also to data in photos or audio files that identify a person.

Anonymization in texts

Text data may contain personal information such as names, addresses or other unique identifiers. This information must be removed or obscured before the data can be further processed or analyzed. AI-supported systems can effectively recognize such identifiers automatically and either remove them or replace them with placeholders. These methods are used in particular when analyzing survey data or documents containing sensitive information.

Anonymization in photos

Photos and images may also contain personal data, especially if people or vehicles are depicted in them. In such cases, faces, license plates and other unique features must be blurred before the images can be shared or published. AI technologies can be used to automatically recognize and pixelate or blur such identifiers to protect the privacy of the individuals concerned.

Anonymization in audio files

Audio data can contain personal information such as names, addresses or other identifiers that occur in conversations or recordings. To anonymize such data, AI systems can be used to recognize this information and mask it with noise or other forms of sound. This method is often used when processing interviews or customer service recordings.

Anonymization in medical data

Medical data is particularly sensitive as it often contains detailed information about people's health. When anonymizing medical data, personal information must be removed or pseudonymized to protect patient privacy while preserving important data for research. This process requires careful planning and the use of special anonymization techniques to ensure that the data is adequately protected.

8 Conclusions and recommendations

The anonymization and pseudonymization of personal data is a complex but essential topic, especially with regard to the General Data Protection Regulation (GDPR).

Companies and organizations that work with personal data need to be aware of what information is considered personal and how they can adequately protect that data. The choice of the appropriate anonymization technique depends on several factors, including the type of data, the purpose of the analysis and the resources available.

Recommendations

  1. Consult a data protection officer:

    When anonymizing or pseudonymizing data, it is advisable to consult a data protection officer to ensure that all legal requirements are met.

  2. Select suitable anonymization techniques:

    Choosing the right technique is crucial to finding the optimal compromise between data protection and data quality.

  3. Explore AI-powered anonymization:

    Advances in artificial intelligence offer new possibilities for anonymization and data analysis. Companies should consider these technologies to manage their data effectively and securely.

  4. Implement security measures:

    With pseudonymization in particular, it is important to take suitable security measures to protect the key from unauthorized access.
  5. Regular review and adaptation: Data protection requirements and anonymization technologies change over time. Companies should regularly review and adapt their data anonymization practices to keep up with current best practices.

By implementing these recommendations, companies can ensure that they handle personal data responsibly and in accordance with the applicable data protection regulations.

Scroll up