DNA donors' identities fairly easy to uncover online, study finds

Genetic information stored anonymously in databases doesn't always stay that way, a new study revealed, raising concern about how much privacy participants in research projects can expect in the Internet era.

Tension has long existed between the need to share data to drive medical discoveries and the fact many people don't want personal health information disclosed. The growing use of genetic sequencing makes this even more challenging because genetic data reveals information not only about an individual, but also about his or her relatives.

In a paper published Thursday in the journal Science, researchers were able to determine the identities of nearly 50 people who had submitted genetic information as part of scientific studies. The people were told that no identifying information would be included in the studies but were warned of the remote possibility that at some point in the future, their identities might become known.

"We have been pretending that by removing enough information from databases that we can make people anonymous. We have been promising privacy, and this paper demonstrates that for a certain percent of a population, those promises are empty,'' said John Wilbanks, chief commons officer at Sage Bionetworks, a nonprofit organization that promotes data sharing, who wasn't involved in the study.

The public and scientific community are concerned about DNA privacy since they worry that genetic information—which can show susceptibility to certain diseases and other ailments—might be used by insurers, employers or others to discriminate against people.

In the new study, the researchers, led by the Whitehead Institute for Biomedical Research in Cambridge, Mass., used the genetic information of people whose genomes had been anonymously published as part of the 1000 Genomes Project, an international collaboration to create a public catalog of data from at least 1,000 people of different ethnic and population groups.

Using a computer algorithm, the researchers focused on identifying unique genetic markers on the Y chromosome of men in the project. They searched publicly accessible genealogy databases that contain both Y chromosome information and men's surnames.

Such genealogy sites, which people join in hopes of compiling their family tree, sometimes include Y chromosome data because it is passed from father to son and can be traced back generations. Some genealogy sites group such genetic information with surnames.

When they got a match to a surname, the researchers ran numerous Internet searches to collect data on each individual's family tree, including obituaries, which often list the names of a deceased's family members. They also searched for demographic data on the public website of the Coriell Institute for Medical Research, a nonprofit in Camden, N.J., that houses collections of genetic material.

With the family-tree data, they were able to identify nearly 50 men and women who participated in genetic studies. "It only takes one male,'' said Yaniv Erlich, a Whitehead fellow, who led the research team. "With one male, we can find even distant relatives.''

Erlich said the technique works best for people who have the highest participation in genetic genealogy services, upper- and middle-class Caucasian Americans. They estimated their technique would have a success rate in identifying the last names of 12 percent of U.S. Caucasian males in similar DNA studies.

The researchers didn't disclose the names of the DNA donors they discovered.

Hank Greely, director of the Center for Law and the Biosciences at Stanford University, said the study raises important questions about expectations of privacy. In an age when genetic information is being collected as part of medical care and can be correlated with personal information people freely post online, Greely said the medical and scientific communities need to be clear that "we cannot promise people confidentiality.''