A 1996 rule is colliding with a 2026 technology. The collision isn’t a bug. It’s structural.
https://arxiv.org/abs/2602.08997

In 1996, the United States government solved the medical privacy problem.
The solution was elegant in its simplicity. The Health Insurance Portability and Accountability Act — HIPAA — established what became known as the Safe Harbour standard. If you wanted to share clinical data for research, you removed 18 specific fields from the record: name, address, phone number, social security number, dates, geographic identifiers, and so on. Strip those fields, and the law considered the data de-identified. Safe to share. No longer subject to privacy protections.
The standard worked because of an implicit assumption about computational power. Re-identification — figuring out who a stripped record belonged to — required matching it against known identifiers. Without those identifiers, the matching problem became computationally hard. Hard enough to be treated, for legal purposes, as impossible.
That assumption is now gone.
What the 18-Field Model Actually Assumed
To understand why this matters, you need to understand what Safe Harbour was actually doing.
It wasn’t trying to make records anonymous in some absolute philosophical sense. It was trying to make re-identification practically infeasible given the tools available to an adversary. The threat model was someone with a database and a list. Match the name in the clinical record to the name in a voter file. Match the address to a property record. Cross-reference the dates with public records.
Remove those 18 fields and the adversary has nothing to anchor on. The remaining text — clinical observations, diagnostic codes, treatment notes, lab values — was considered informationally inert. Useful for research. Useless for re-identification.
This was a reasonable assumption in 1996. It was probably still a reasonable assumption in 2010. It is not a reasonable assumption in 2026.
What LLMs Actually Learn
Large language models are trained on vast corpora of human-generated text. In doing so, they absorb something that isn’t quite knowledge and isn’t quite memory — it’s more like a statistical model of how different kinds of people express themselves in different contexts.
Clinical notes have a texture. The way a physician describes a patient’s complaint, sequences their observations, orders their differential, and documents their reasoning carries information that goes beyond the explicit content. People write differently. Specialists write differently from generalists. Patients from different demographic and geographic backgrounds present differently in their histories. Chronic conditions leave linguistic fingerprints in the way their course gets documented over time.
None of this is in the 18 fields. All of it survives de-identification.
A paper published in February by researchers including Kyunghyun Cho — one of the architects of the gated recurrent unit and not someone given to overstatement — and Eric Oermann from NYU’s neurosurgery AI lab makes the argument directly. They call it the paradox of de-identification: the same technology that needs de-identified data to train on is also the technology best positioned to break that de-identification.
The paradox cuts deep. It isn’t that LLMs are being deliberately weaponized against medical privacy, though that risk exists. It’s that a model trained on the aggregate texture of human language at scale has implicitly learned things about clinical writing that Safe Harbour never anticipated needing to protect.
Why This Isn’t Fixable With a Patch
When a security vulnerability is discovered in a system, the usual response is a patch. Update the software. Add a new rule. Expand the list.
The instinct with Safe Harbour is the same. If 18 fields aren’t enough, add a 19th. Add a 20th. Make the list longer.
This won’t work. And understanding why it won’t work is the heart of what makes this situation genuinely serious rather than merely technically interesting.
Safe Harbour’s 18-field list is a specification of identifying information as understood in 1996. The fields were chosen because they were the obvious anchors for re-identification given the tools of the time. But “identifying information” is not a fixed property of data. It’s a relationship between data and the tools available to analyze it.
What LLMs have done is make identifiable the statistical fingerprint in the remaining text — the text Safe Harbour never touched because it was considered analytically inert. You can’t patch that by extending the list, because the list is the wrong unit of analysis. The problem isn’t which fields you remove. The problem is that the residual text is now searchable in a fundamentally new way.
Stated plainly: Safe Harbour assumes that linkage remains expensive. LLMs make linkage cheap enough to automate. That’s the actual claim — not that models magically identify patients from prose, but that the transaction cost of assembling the re-identification graph has collapsed. The friction that protected most data in practice, even when theoretical privacy guarantees were thin, is gone.
Fixing this requires rebuilding from a different premise. Approaches like differential privacy — which adds calibrated mathematical noise to data to provide provable privacy guarantees regardless of what analytical tools emerge — or synthetic data generation — which creates statistically representative datasets that never corresponded to any real patient — operate on entirely different foundations. They don’t try to enumerate what’s identifying. They make re-identification mathematically bounded by construction.
That’s not a patch. That’s a paradigm replacement. It requires new infrastructure, new legal frameworks, new validation standards, and new institutional habits. It will take years, probably decades, to fully implement. During that time, Safe Harbour will remain the operative legal standard, applied to data environments it was never designed for.
A Fair Objection
The argument above has a gap worth naming.
Stylometry — the analysis of linguistic style — mostly identifies authors, not subjects. Clinical notes are written by clinicians, not patients. A model learning the distinctive documentation patterns of a particular neurosurgeon’s prose doesn’t immediately tell you which patient is which. And re-identification has always required linkage to an external dataset — the signal in the prose alone doesn’t resolve to a name. Latanya Sweeney’s famous demonstration in the 1990s, the foundational result in this field, depended on matching quasi-identifiers to voter rolls. Without the external dataset, you have a statistical portrait, not a person.
HIPAA even anticipated some of this. The expert determination pathway — less used, more expensive, more honest — always asked statisticians to evaluate re-identification risk using contemporary methods rather than a fixed checklist. Safe Harbour was the shortcut. The failure mode may be less “the framework is structurally wrong” and more “the cheap version of privacy compliance is aging badly.”
These are real concessions. Where they run thin is on the linkage assumption.
LLMs don’t conjure external datasets — but they dramatically lower the cost of constructing one. The same capabilities that allow a model to extract structure from clinical prose also allow it to correlate across insurance claims, social media profiles, physician review sites, and appointment scheduling data at a scale no human analyst could manage. The adversary doesn’t need a pre-built linked dataset. They can build one. The pieces that were previously too expensive to assemble are now cheap.
That changes the threat model even if it doesn’t change the underlying logic of re-identification.
The question of how strong the signal actually is in practice remains genuinely open. If it’s weak, Safe Harbour survives as a crude but workable filter. If it’s strong, it doesn’t. But the history of computational privacy is a sequence of signals that were considered “too weak to worry about” until demonstrated otherwise — ZIP code plus birthdate plus sex was negligible until Sweeney showed it wasn’t.
There’s a useful parallel in cryptography. Side-channel attacks — a class of techniques that extract secret information not from the algorithm itself but from its physical implementation — were long dismissed as theoretical. Power consumption from a chip looked like noise until researchers demonstrated it leaked secret keys. The signal was always present. The detectors weren’t sensitive enough.
Clinical language may be in a similar position. In a medical record, the structured fields are the algorithm — the part everyone agreed to protect. The narrative text is the side-channel. The identifying information may have always been there, latent in the texture of how medicine gets documented. LLMs are simply the first detectors sensitive enough to extract it at scale.
That reframes the empirical uncertainty. The correct response to “we don’t yet know how strong the signal is” isn’t “therefore Safe Harbour is fine.” It’s “therefore we should find out before committing billions of patient records to a framework built on the assumption it’s negligible.”
The expert determination pathway exists precisely for moments like this. The fact that it’s almost never used is its own kind of answer.
The Regulatory Scramble Has Already Started
The practical consequence of a failing framework isn’t a clean transition to its replacement. It’s improvisation.
Washington State has already passed the My Health My Data Act — described by legal scholars as “super-HIPAA” — which extends health privacy protections to entities and data types that HIPAA doesn’t cover, and crucially grants individuals a private right of action to sue companies directly. The Federal Trade Commission has stepped in under its Section 5 authority to regulate health data practices for apps and services that fall outside HIPAA’s scope entirely. California has its own framework. The EU’s GDPR creates a parallel universe of obligations for any system touching European patients.
This fragmentation is not a feature. It’s what happens when a central framework starts failing and practitioners begin improvising local solutions because the standard can no longer hold. Researchers, hospitals, and technology companies are navigating an increasingly incoherent patchwork, often with legal exposure they don’t fully understand and that their lawyers can’t fully advise on.
The companies building health AI systems right now are, in many cases, making technical and legal bets on a standard that is structurally compromised. That’s a significant risk that isn’t yet widely understood outside the privacy research community.
What Comes Next
Safe Harbour won’t collapse overnight. Legal and regulatory frameworks are slow to change, and the health data ecosystem has enormous institutional inertia. HIPAA won’t be repealed or dramatically revised in the near term. The 18-field standard will continue to be used.
But the academic literature is starting to move. Papers like the one from Cho and Oermann’s group don’t appear in isolation — they tend to cluster, cite each other, and build into a critical mass that eventually forces regulatory attention. The FTC and HHS have both signaled awareness of re-identification risks in the LLM era. Congressional interest in AI and health data is growing, albeit slowly and with limited technical sophistication.
The question isn’t whether Safe Harbour will be replaced. It’s how long the transition takes, how much data exposure occurs in the interim, and whether the replacement frameworks — differential privacy, synthetic data, or something not yet articulated — will be adopted fast enough to matter.
For anyone building in the health data space, or advising those who are: the legal ground is less solid than it appears. A compliance posture built on Safe Harbour is a compliance posture built on a 1996 threat model. That’s not necessarily non-compliant today. It may be tomorrow.
But the more likely trajectory isn’t a dramatic collapse. Frameworks rarely die that way. Safe Harbour may instead lose credibility gradually — as regulators absorb the research, as re-identification demonstrations accumulate, as state laws multiply and courts begin interpreting them. Erosion rather than rupture.
The strange irony in the whole story is that the technology threatening the model also offers the tools to replace it. Differential privacy, synthetic datasets, and secure data enclaves all rely heavily on the same statistical machinery as the models creating the problem. The algorithms that threaten anonymity may end up being the algorithms that enforce it.
The assumption that made Safe Harbour work is already gone. What replaces it is still being written.
But the deeper reason it’s hard to replace is worth sitting with. Safe Harbour treated identity as something attached to specific fields in a record — a name, a date, an address. Remove those fields and the identity was gone. The LLM era treats identity as something that can emerge from patterns across data. You can strip every column and the correlations remain.
Identity used to live in columns.
Now it lives in correlations.
That’s why a list-based fix can’t work. And that’s why whatever comes next will look less like a longer checklist and more like a different kind of mathematics.
The paper referenced is: Jiang, L.Y., Liu, X.C., Cho, K., & Oermann, E.K. (2026). “Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs.” arXiv:2602.08997
