Updated

This is our brand new freezer,” Don Humphries said. “It holds 4 million vials.”

You’d think a freezer big enough to hold 4 million vials of blood would be easy to spot. But to my great embarrassment, I couldn’t see it.

Humphries and I were standing in a lab in the basement of the Veterans Affairs hospital in the Jamaica Plain neighborhood of Boston. He had led me through a labyrinth of windowless rooms, packed with robots handling tubes of blood donated from veterans, pipes roaring with coolant, and gorilla-sized tanks of liquid nitrogen, until he stopped next to a featureless wall.

After a few awkward moments, I admitted my ignorance. “So, where is the freezer?” I asked.

Humphries, the scientific director of the lab, blinked and then looked at the featureless wall. “Right here,” he said. He craned his head upwards. “This is it.”

I followed his gaze, and then it clicked. The wall was actually the side of a vault that seemed to be about as big as a two-story house.

Near the top I could spy a small window. Humphries led me up a mobile staircase so that I could look through it. Inside the vault was a long, dimly lit corridor, flanked on either side by 16 separate compartments cooled to as low as 80 degrees below zero Celsius. A robot inside the freezer ferried vials to their assigned compartments.

This is no walk-in freezer.

The freezer, in fact, is at the heart of one of the most ambitious projects ever undertaken to understand our DNA. The Department of Veterans Affairs is gathering blood from 1 million veterans and sequencing their DNA. At the same time, computer scientists are creating a database that combines those genetic sequences with electronic medical records and other information about veterans’ health.

The ultimate goal of the project, known as the Million Veteran Program, is to uncover clues about disorders ranging from diabetes to post-traumatic stress disorder.

Since its launch in 2010, the VA has spent $30 million building and running MVP. Caring for 8.76 million veterans enrolled in the Veterans Health Administration, it has a strong interest in understanding the role that genes play in the diseases they develop. The VA is also uniquely situated to carry out this kind of project, in part because veterans tend to have medical records in the system that stretch back decades. But the research being done as part of the MVP — which has already enrolled more than 420,000 participants — could have implications that reach far beyond the VA.

“We’re working in a space where no has ever worked before on this scale,” said Dr. Michael Gaziano, one of the principal investigators of the Million Veteran Program.

More on this...

If the project develops as planned, it could fuel discoveries for years to come, leading to new medical treatments not just for veterans, but for all patients. “We hope that folks who follow us come up with ideas that we can’t even think of right now,” said Dr. John Concato, the other principal investigator on the program.

For decades, researchers have been trying to find links between genes and diseases, but for a long time their studies often ended in disappointment. They came to realize that they weren’t comparing enough people’s DNA to get a clear picture. And so, in the United States and abroad, scientists began gathering DNA on a huge scale. The biggest of these so-called biobanks now hold DNA from hundreds of thousands of people. In January, President Obama announced plans for the Precision Medicine Initiative, which will create a database with over 1 million participants.

For now, the Precision Medicine Initiative is a plan. The Million Veteran Program, on the other hand, is up and running. In fact, it’s already mature enough for several teams of scientists to start searching its database for links between genes and diseases. But to reach this point, the MVP had to come up with new solutions for unprecedented challenges — how to recruit participants on a massive scale, how to keep their data safe without impeding scientific research, and how to squeeze hidden information out of their data with new artificial intelligence systems.

That’s not to say that MVP’s work is over. It now brings in about 100,000 new participants a year, and at that pace will enroll its millionth veteran in 2018. The MVP computer scientists are scrambling to assemble enough computing power to store and analyze the growing data. And it’s up to Humphries — whose lab is known as the Core Laboratory at the Massachusetts Veterans Epidemiology Research and Information Center — to store away millions of blood vials in his giant freezer.

On the day of my visit, his team had already stored over half-a-million samples from over 100,000 veterans, and had a backlog of hundreds of thousands more samples waiting in temporary storage.

Yet Humphries seemed strangely at peace with the colossal numbers that now rule his life. “Next year, it might get easier,” he said with a shrug.

The push for big science

Today it’s easy to forget just how little the first generation of geneticists a century ago knew about genes. They couldn’t even study genes directly. Instead, they pieced together clues, such as the way diseases ran in families. They found that parents with Huntington’s disease, for example, passed down the disorder to half their children. Only in the 1990s did geneticists discover that these parents pass down a faulty copy of the gene encoding a protein called huntingtin.

Huntington’s disease is simple as diseases go. Many common disorders such as heart disease and diabetes are the result of a complex interplay between many different genes and the environment. “You’ve got the hand you’ve been dealt, and how you play that hand,” said Gaziano.

For doctors who work at the VA, one of the most important of those complex disorders is post-traumatic stress disorder. An estimated 12 percent to 20 percent of Iraq War veterans are treated for PTSD in a given year, according to the VA. Experiencing trauma in war isn’t a guarantee that troops will develop PTSD, though. “You have two people who sit in the same foxhole and see the exact same thing,” said Gaziano. “One guy sleeps great every night and the next guy relives that over and over again.”

In the 1990s, scientists found the first clues that genes are a source of these differences. They looked at the incidence of PTSD in identical twins, who have virtually identical genes, and found that they tended to experience the same outcomes more often than other siblings — even fraternal twins.

“Environmental factors are probably more important than genetic factors, overall, but not by that much,” said Dr. Joel Gelernter, a psychiatrist at Yale School of Medicine who also does research at the VA hospital in West Haven, Conn.

Over the past decade, Gelernter and his colleagues have searched for the specific genes involved in PTSD. They’ve compared the DNA in people with the disorder and without, looking for variations that turn up unusually often in people who suffer from it. They’ve found a few genes that show some promising hints of being involved. Unfortunately, those hints have a way of melting into air. When researchers look at different groups of people, they find different genes that appear to play a role in PTSD. “They may well turn out to be correct, or they may well not turn out to be correct,” said Gelernter.

The problem is that complex disorders like PTSD involve not just one gene, like Huntington’s disease, but hundreds. The most common variations to those genes tend to have a very small effect on the risk of a disorder, making it hard to distinguish them from harmless mutations. Other mutations have a big impact, but they’re typically so rare that scientists who study small groups of people may never observe them.

To get beyond this impasse, researchers have been developing new methods. They’ve invented more powerful statistical techniques, and they’ve also put in the extra effort to study many more people. “I’ve pounded the message for 10 years that sample size is everything,” said Jeffrey Barrett, a scientist at the Wellcome Trust Sanger Institute, a genome research center in England.

Big sample sizes may make studies more powerful, but they also demand a huge amount of work — to find people to participate in them, to gain their informed consent, and to gather data from them. And every time scientists set out to study a new condition — be it blood pressure or near-sightedness or height — they have to bring together yet another gargantuan cohort.

By the mid-2000s, a number of researchers — including epidemiologists and geneticists who work at the VA — recognized that there was another route to the big numbers they needed. They could create a single, enormous biobank.

Such a biobank would contain sizable numbers of people with a wide range of conditions, ready to be studied. Instead of spending time and money gathering yet another set of 10,000 participants for each new study, scientists could jump straight to the science.

VA researchers recognized that they might be able to build a biobank with some advantages that others would lack. VA medical records provide a rich source of information about patients, because veterans who join the VA health system can continue to use it no matter where they move in the country. As a result, the VA’s medical records are deep, including information ranging from lab tests to X-rays and prescriptions. The VA has been digitizing medical records since the 1980s, which meant that the researchers at the MVP wouldn’t have to spend years scanning millions of pieces of paper before they could start their research.

Yet another advantage was the veterans themselves, who have a long tradition of volunteering in high numbers for research studies. In 2009, Johns Hopkins University researchers surveyed veterans to see what they thought about the idea of a VA biobank, and found that 71 percent said they would definitely or probably participate. When the VA researchers considered the 8.76 million veterans who use the VA health system, they decided to reach for the brass ring: a biobank with 1 million participants.

‘What’s to lose?’

Richard Gray found out about MVP one day in 2011 while he was waiting for a regular doctor’s appointment at the VA hospital in West Haven, Conn. Karen Anderson, an MVP research coordinator, approached him and explained how the program worked. “It made sense to me,” said Gray, who served as a Marine sergeant in Vietnam. “If I can help by being part of the program, then why not?”

Gray later went to the MVP office at the VA, where Anderson took him through the 11-page informed consent form. She explained that he wouldn’t get paid for participating, nor would any information about his participation go into his medical record. And because geneticists still know so little about how genes influence most diseases, MVP would not report the test results to Gray, or to his doctors.

On Gray’s subsequent visits to the VA, he’d watch Anderson approach other vets about MVP. “When people said no, I’d lean in my face and say, ‘What’s to lose?’ ” he said.

As the program geared up, millions of brochures went into the mail. A call center began fielding questions from curious veterans. Recruiting offices opened up at VA facilities across the country, and soon thousands of veterans were signing up each month.

Other biobank experts have been impressed with the turnout. “They’re actually recruiting a big percentage of the veteran population into the MVP, which I think is pretty remarkable,” said Dr. Josh Denny, a Vanderbilt University physician-scientist who currently serves on the Precision Medicine Initiative Working Group.

In the United Kingdom, by comparison, the UK biobank has signed up half a million volunteers out of a national population of a 64 million. MVP will reach that mark next year, recruiting from a population just a tenth the size.

Stacey Whitbourne, the program director for MVP, thinks the veterans themselves deserve the lion’s share of the credit for this high rate. “Oftentimes we may have a vet traveling 250 miles just for a 20-minute study visit to enroll in MVP,” she said. “They feel so strongly about being a part of it.”

Major project, major challenges

In the 2009 survey about MVP, veterans said that privacy and security were major concerns for them. “Giving up something personal is the last thing I want,” said Tom Allen, a retired Air Force master sergeant who enrolled in MVP in 2010. “I’ve had credit cards hacked, a camera bought in Vietnam, a one-way ticket to India, and two weeks in Vegas.”

To protect their privacy, the MVP staff quickly separates the identity of enrolling veterans from their data. “They’re completely anonymous — they’re a tube of blood with a history attached to it,” said Anderson.

The creation of that anonymity — without losing track of all the information tied to each veteran — is handled by MVP’s computer system, known as GenISIS (short for Genomic Information System for Integrative Science). Each time a new set of data is generated for a veteran — a DNA sequence, a health survey, a set of medical records — the system creates a different number to identify it. “Once that connection is made, it stays in GenISIS,” said Saiju Pyarajan, the scientific director of GenISIS.

But privacy is only one of the challenges facing Pyarajan and his colleagues. For one thing, the program generates huge amounts of data. After Humphries and his laboratory team process the blood samples, they ship out purified DNA from each veteran to gene-sequencing companies. Every veteran’s DNA is scanned for about 750,000 genetic markers scattered across their genome. In addition, they take a closer look at a few percent of the samples. In some cases, they carry out a process called whole-exome sequencing. (Exomes are the regions of the human genome that encode proteins.) In other cases, they sequence the entire genome of veterans.

Even now, with less than half-a-million veterans enrolled, GenISIS is already groaning under its load of data. Currently, Pyarajan has dedicated 4 petabytes of memory to store all the information. If you used that memory to store HD movies, it would take 53 years of continual viewing to watch them all. And yet Pyarajan is already starting to run out of space.

Pyarajan and his colleagues also need to ensure the data that ends up in GenISIS are accurate. Blood tubes can get stored in the wrong place. Names can get misspelled. Medical records can get mixed up. DNA can get improperly sequenced. And with hundreds of new veterans volunteering for MVP every day, there simply isn’t enough time for people to manually check every piece of data that goes into GenISIS. Instead, Pyarajan and his colleagues have programmed a series of checkpoints at which the MVP computers scrutinize incoming data. Anything that raises flags has to then get approved by a human.

“You know ‘garbage in, garbage out’? With big data, it’s bigger garbage to begin with,” said Pyarajan. “If we don’t clean it now, cleaning it later will be humungous.”

Even at its cleanest, though, the MVP’s data have some limits. Only about 8 percent of the participants are women, for example, reflecting the small fraction of US veterans who are female. “You’re not going to do a good ovarian cancer study with it,” said Denny.

Getting information out of electronic health records also creates a challenge of its own. In a traditional genetic study, scientists would carefully examine all their potential participants to make sure they had a particular disorder and not a different one with similar symptoms. Scientists who study PTSD, for example, might have each subject diagnosed by two separate psychiatrists.

Electronic medical records, by contrast, are filled out by doctors in the ordinary course of treating patients. A veteran enrolling in the VA system may already have a chronic hepatitis C infection, for example. Although the patient may get treated for the disease, an official diagnosis may not appear in the records. MVP researchers are now developing methods to make those diagnoses based on the clues sprinkled in electronic health records. If a veteran is getting prescribed drugs typically used for hepatitis C and has lab tests showing liver damage, for example, that could potentially be enough information to classify him as having hepatitis C.

“I’d like to think we’re pioneers in this area,” said Gaziano. “We are trying to develop this whole new science.”

Taking data on a test drive

MVP may still be three years from hitting the million-veteran mark, but Gaziano and his colleagues have decided to take it out for a spin around the neighborhood. “It’s best to know what we need by trying out a couple experiments,” he said. “We also want to share with the world in some tangible ways what they’re getting for the money they’re spending.”

Gelernter is among the first scientists to put MVP through its paces. He and his colleagues are conducting a new study on PTSD, hoping to have more success at pinpointing genes than they’ve had in the past. In their previous study, published in 2013, they studied a total of 4,344 people. For their new study, Gelernter and his colleagues plan to study almost five times more people, comparing 10,000 veterans with PTSD to 10,000 controls.

The study will not only be bigger, but more precisely focused. In their 2013 study, Gelernter and his colleagues studied civilians who had experienced a wide range of traumas, from child abuse to house fires. In their new study, Gelernter and his colleagues will be able to study PTSD triggered by one particular kind of stress — namely, combat. “To my point of view, this is a pretty ideal way of doing it,” said Gelernter.

If Gelernter and his colleagues can find strong evidence about the genes involved in PTSD, those genes may point them to the underlying biology. It may turn out, for example, that many of the genes encoded proteins that all work together in a particular region of the brain. Drug developers could then test out chemicals that affect one of those proteins to see if any of them can lessen the symptoms of PTSD.

Ultimately, MVP might even become a part of the health care that the VA provides to veterans. “We’re not ready to feed back the information to patient care, but that’s our long-term goal,” said Concato.

By tracking how veterans respond to different drugs, the MVP may pinpoint genetic variations that make certain medications fail. If VA patients someday get their DNA sequenced as part of their care, their doctor could match the best drug to their genetic profile. “The patient doesn’t have to go through a couple of rounds to get a drug that works,” said Pyarajan.

It’s impossible to predict with certainty how many goals MVP will reach. But it’s pretty likely, said Whitbourne, that it will outgrow its name.

“It’s not like we get to a million and then stop,” she said. “We’ll continue to make this a living cohort as long as we can.”