Fundamentals7 min read

Can Anonymized Data Be Re-Identified?

Yes, often: naive masking leaves quasi-identifiers (zip code, birth date, sex) that can be cross-referenced to re-identify the person. Here's why and how.

By Pierre de ONYRI

Yes — poorly “anonymized” data is often re-identified. Erasing direct identifiers (name, address, social security number) isn't enough: quasi-identifiers like zip code, date of birth or sex remain, and once cross-referenced with other public datasets they make it possible to find the individual. Academic work has demonstrated this since the 1990s, and the GDPR draws a clear conclusion: data that is merely masked remains personal data. Only irreversible anonymization — or never exposing the data at all — eliminates the risk.

Why naive masking fails

Removing the name and address gives a false sense of safety. The problem is quasi-identifiers: attributes that look harmless in isolation but become unique once combined. Latanya Sweeney (Data Privacy Lab / Identifiability Project, Carnegie Mellon University) quantified it on 1990 U.S. census data: roughly 87% of the population was likely uniquely identifiable from just the triplet 5-digit zip code + sex + full date of birth. At a coarser grain — town instead of zip code — about 53% were still uniquely identifiable. The name is never needed: cross-referencing is enough.

A 2019 study published in Nature Communications pushed the finding further: using a generative model, its authors estimate that 99.98% of Americans would be correctly re-identified in any “anonymized” dataset from just 15 demographic attributes (age, sex, marital status, etc.). Their conclusion is blunt: even heavily sampled datasets are unlikely to meet the GDPR's anonymization standard. Imperial College London summarized it in a release: anonymizing personal data is “not enough to protect privacy.”

Three landmark demonstrations

The history of re-identification is written through a few cases that have become classics:

  1. 1Governor Weld's medical record (Massachusetts, 1997). An insurance commission had released state employees' hospital records billed as “anonymized” (names, addresses and social security numbers removed). Latanya Sweeney bought the public Massachusetts voter list for about $20 — it contains name, address, zip code and date of birth — and cross-referenced the two: in Cambridge, very few people shared the governor's date of birth, fewer still his sex, and only one his zip code. His medical record was found. The demonstration directly shaped the de-identification rules of the HIPAA Privacy Rule (2003).
  2. 2The Netflix Prize (Narayanan & Shmatikov, 2008 IEEE Symposium on Security and Privacy). The released dataset contained “anonymous” movie ratings from roughly 500,000 subscribers. Using the Internet Movie Database (IMDb) as public auxiliary knowledge, the researchers showed that an adversary knowing just a few of a subscriber's ratings and dates could easily find their record — and infer sensitive information, such as political preferences.
  3. 3The 2019 Nature Communications study (Rocher, Hendrickx, de Montjoye), cited above, which generalizes the mechanism: this is not an isolated accident but a mathematical property of rich demographic data.

Anonymization ≠ pseudonymization: what the GDPR says

European law settles a common confusion. Recital 26 of the GDPR states that data protection principles do not apply to anonymous information — data that does not (or no longer does) relate to an identified or identifiable person. True anonymization is irreversible: the data then falls outside the GDPR. Pseudonymization (Article 4(5)) replaces direct identifiers with codes, but the data can be re-attributed using “additional information” kept separately — a key, a lookup table. So pseudonymized data remains personal data subject to the GDPR. The UK regulator (ICO) puts it plainly: pseudonymization is a security measure, not a method of anonymization. We detail these definitions in our article “Anonymization, pseudonymization, tokenization: what's the difference?”

How do you know whether an “anonymization” holds? The Article 29 Working Party (Opinion 05/2014 on Anonymisation Techniques) sets three tests that effective anonymization must defeat — and that's exactly what the demonstrations above break:

Test (Art. 29 WP)Question to askWarning sign
Singling outCan you isolate a record matching one person?The zip code + sex + date of birth triplet is enough (Weld case)
LinkabilityCan you link two records about the same person?Cross-referencing with IMDb or a public voter list
InferenceCan you deduce information about the person?Political preferences inferred from the Netflix Prize
The three tests of effective anonymization per the Article 29 Working Party (Opinion 05/2014). Techniques split into randomization (noise, permutation, differential privacy) and generalization (k-anonymity, l-diversity, t-closeness).
Diagram: at top, an “anonymized” record with a masked name still leaves quasi-identifiers in the clear (zip code, date of birth, sex, in amber) which, cross-referenced with an auxiliary dataset, let someone find a person under a magnifying glass; at bottom, the same fully tokenized record leaves only interchangeable tokens (cobalt) — no person to cross-reference, a valid checkmark.
After Latanya Sweeney (Data Privacy Lab, Carnegie Mellon), the 2019 Nature Communications study (Rocher, Hendrickx, de Montjoye), Narayanan & Shmatikov (Netflix Prize) and the Article 29 Working Party Opinion 05/2014.

The case of client-side tokenization

ONYRI's tokenization is reversible — so, in GDPR terms, it's pseudonymization, not anonymization. The decisive difference lies in one detail: the token ↔ value lookup table (the very “additional information” that makes re-attribution possible) never leaves your browser. The AI provider (ChatGPT, Claude, Gemini…) receives only tokens stripped of exploitable quasi-identifiers, and without the key needed to cross-reference. So it has nothing to correlate: no name, no zip code/date/sex triplet, nothing to link to an auxiliary dataset. Detokenization happens on the client once the model's reply comes back. The three Article 29 tests — singling out, linkability, inference — fail on the provider's side, for lack of anything to cross-reference. To place this approach among neighboring techniques, see “Anonymization, pseudonymization, tokenization: what's the difference?”

  • Re-identification risk comes from quasi-identifiers left in the clear, not just the name.
  • Data that is merely masked remains personal data under the GDPR.
  • Neutralizing cross-referencing at the source — removing the data from the prompt — is the most robust measure.

That's exactly what ONYRI Sanitize is for: the engine replaces sensitive data with reversible tokens before sending, and only anonymized text reaches the model; both detection and the token ↔ value table stay in your browser. The provider receives no quasi-identifier to cross-reference — so there is nothing to re-identify on its side, whatever auxiliary dataset a third party might hold.

Frequently asked questions

Can anonymized data be re-identified?
Yes, often, when “anonymization” is limited to erasing direct identifiers. Quasi-identifiers remain (zip code, date of birth, sex…) and, cross-referenced with other datasets, they let someone find the person. Latanya Sweeney estimated that about 87% of Americans were identifiable from just the zip code + sex + date of birth triplet. Only irreversible anonymization takes the data outside the GDPR's scope.
What's the difference between anonymization and pseudonymization?
Anonymization is irreversible: the data can no longer be linked to a person, and the GDPR no longer applies (Recital 26). Pseudonymization (Article 4(5)) replaces identifiers with codes, but a separately held key allows re-attribution; it therefore remains personal data subject to the GDPR. Per the ICO, it's a security measure, not a method of anonymization.
Does tokenization really protect against re-identification risk?
Tokenization is technically pseudonymization (reversible). With ONYRI, the token ↔ value table never leaves your browser: the AI provider receives only tokens, with no quasi-identifiers and no cross-referencing key. It has nothing to correlate with an auxiliary dataset, which neutralizes the three re-identification tests (singling out, linkability, inference) on the provider's side.

Sources & references

Keep your sensitive data in your browser

ONYRI Sanitize detects and masks your sensitive data before it reaches the AI, then restores the answer — from names to API keys.

Anonymize my prompt

Read next