Big Data

A New Way to Solve Genetic Mysteries — While Protecting People’s DNA Data

Researchers propose a method to balance the power of genomic searches with privacy concerns.

June 30, 2021

| by Dave Gilson
Eyes placed within a string of DNA. | Illustration by Álvaro Domínguez.

Millions of people who have never taken a DNA test can be traced through public genetic databases. | Illustration by Álvaro Domínguez.

In just a few years, public DNA databases have emerged as powerful tools for solving genetic mysteries. They’ve been used to locate long-lost relatives and help adopted children find their biological parents. They’re perhaps best known for helping cops solve cold cases, from IDing anonymous bodies to sniffing out criminal suspects. In the most famous example, the Golden State Killer was identified from a nearly 40-year-old DNA sample that linked him to the genetic profiles of his distant cousins, which had been posted on the website GEDMatch.

While these kinds of breakthroughs have proved the promise of investigative genetic genealogy, they’ve also raised serious privacy concerns — not just for people who have shared their genomic profiles, but for millions of people who have never even taken a DNA test.

GEDMatch and MyHeritage, two of the most popular DNA sites, currently have 1.4 million and 1.3 million members, respectively. However, the number of people whose identities might be traced through these databases is much larger.

“In theory, you could detect genetic relatedness between two genomes if they share ancestors within the past five generations,” explains Mine Su Erturk, a PhD student in operations, information, and technology at Stanford Graduate School of Business. In other words, your DNA links you to hundreds of distant cousins who share a small percentage of their genetic material with you but are otherwise perfect strangers. And, Erturk points out, “This also goes the other way for the next five generations: It affects your children and grandchildren who might not even be alive yet.”

A 2018 study in Science found that 60% of Americans with European ancestry could be traced through data shared on MyHeritage. If just 2% of U.S. adults uploaded their DNA to a genetic database, that information could be used to reconstruct the identities of 90% of the total population.

Yet according to Erturk, genetic investigation doesn’t have to unnecessarily expose people’s sensitive data to find genealogical needles in the haystack. In a new preprint, Erturk and her advisor, Associate Professor Kuang Xu, detail a new model for genomic searching that’s designed to minimize its privacy risks while maintaining its effectiveness.

When Erturk first became familiar with genetic databases a few years ago, “there was some discussion in the academic community about the privacy aspect of this, but no one was analyzing the problem from an operational perspective,” she says. She and Xu believe their work breaks new ground in an area whose ethical and legal dimensions are being vigorously debated.

By presenting a rigorous model for addressing genetic search’s privacy flaws, they hope more discussion and policy changes will follow. “The current system does not explicitly take privacy risks into account,” Xu says. “Our first goal is to raise awareness of the importance of tracking privacy exposures. But we also want to propose concrete steps toward a solution.”

The Gene Genie

Currently, access to public DNA databases is virtually unrestricted and all but unregulated. Erturk and Xu say genetic data could be collected by pharmaceutical companies seeking to market drugs or life insurance companies screening customers for inherited conditions. (People who share their genetic code also may be vulnerable to data breaches and attacks. Last year, one million GEDMatch profiles were hacked; some of the stolen data was used to target MyHeritage users in phishing attacks.)

To protect DNA database users and their family networks, Erturk and Xu propose a new way of searching for genetic matches. Currently, genetic searches are “static,” meaning that searchers can compare a DNA sample with any record in a database until they find a match. Erturk and Xu have developed a “sequential” model where searchers would not have unlimited access to a database. Instead, they would look for matches in small, selected batches of data, using publicly available genealogical records, such as birth and marriage certificates, to target and refine their search.

Because we are all related, society has to think about genetic privacy as a collective responsibility.
Kuang Xu

Erturk explains how this approach would work while looking for a DNA match on a site like GEDMatch: “I’ll first look at genealogical records to identify a couple of people who might be related or who might give me some leads. Then I’ll only look at their genomes instead of looking at the entire database. If I can locate my person relative to these genomes, I’m done — and I only exposed a couple of people instead of the entire database. If I don’t succeed, I go back to my genealogical records, try to come up with another list of, say, 10 people, and do this process repeatedly, in a sequence.”

By limiting the searcher’s access to sensitive data, this approach exposes the smallest number of people’s data while expanding the search until it hits its target. Erturk and Xu say the mathematical framework detailed in their paper can be controlled precisely and “vastly outperforms” static search in optimizing the trade-off between search time and privacy.

Old Rulers, New Rules

When she was ready to test this model’s effectiveness on real-world data, Erturk faced a particular challenge. She wanted to use an actual genealogical network, but did not want to infringe on anyone’s privacy. Her solution: use the family tree of more than 2,500 interconnected members of European royal families. “It’s public and everyone knows it, so there wouldn’t be any privacy concerns,” she says.

Erturk’s work builds on the literature of search problems, which often involve scenarios where searchers are looking for hidden targets like submarines or terrorists. To their knowledge, Erturk and Xu’s genetic search model is the first time a privacy dimension has been factored into this type of problem.

While their analysis is based on advanced mathematics, its basic concept will be familiar to anyone who’s watched a police procedural like The Wire. Xu compares it to phone tapping: If the cops want to listen to a suspect’s calls, it’s not practical (or legal) to listen in on every phone line in town. Instead, they must target their search — and get a judge to sign a warrant — before they can collect evidence.

Xu thinks that criminal investigators looking for DNA matches in public databases should operate under similar constraints that prevent them from sifting through huge amounts of personal data. (So far, only Montana and Maryland have enacted laws regulating the use of genetic genealogy by law enforcement agencies.)

While Erturk and Xu do not make explicit policy suggestions in their paper, they see their model as a first step toward answering the many logistical and legal questions about how our most personal data is stored and accessed. And the growth of personal DNA collection may require us to adopt a different conception of privacy than we’re used to.

“Because we are all related, society has to think about genetic privacy as a collective responsibility,” Xu says. “You have to protect your mother, your father, your son, your daughter, and your cousins.”

For media inquiries, visit the Newsroom.

Explore More

An illustration of figures and representations of data collection. Credit: Josh Cochran

The Research Revolution

Access to superabundant data has transformed the methods of scholastic inquiry — and possibly the basic tenets of inquiry itself.