Saturday, January 2, 2010

Finding disease mutations in a sea of noise

Why massive DNA sequencing in search of cancer-related mutations is unlikely to improve cancer treatment in the real world anytime soon (from the Genetic Future blog)

Finding disease mutations in a sea of noise

Review of Jones et al. (2009). Exomic Sequencing Identifies PALB2 as a Pancreatic Cancer Susceptibility Gene. Science DOI: 10.1126/science.1171202


"A paper published online today in Science illustrates both the potential and the challenges of using large-scale DNA sequencing to identify rare genetic variants underlying disease risk.....

First, the good news: as you might have guessed from the fact that the study is published in Science, the authors did in fact find the likely disease-susceptibility mutation. They were able to distinguish this mutation from the many other variants in the patient's exome (more on those in a second) by a particular quirk of cancer susceptibility variants: they are often found in only a single copy (along with a healthy version of the gene) in normal tissue from a patient, whereas in cancer cells from the same patient the normal copy is disrupted.

Now, the bad news: the researchers also found a whole stack of red herrings. In total, the authors looked at sequence from 20,661 genes, and identified 15,461 genetic variants not found in the reference human genome. Of these, 7,721 changed the sequence of the encoded protein, 64 resulted in abnormal stop codons, 108 were predicted to alter RNA splicing of the gene, and 250 were small deletions or insertions (115 of which would be predicted to dramatically alter the encoded protein through a frameshift). The stop codons, splicing mutations and frameshift insertion/deletions, and many of the protein sequence-altering variants, would all have to be regarded as plausible candidates for a disease-causing mutation.

Although it would probably be possible to exclude many of these variants using other sources of information (e.g. functional information about the genes, presence in healthy controls, patterns of evolutionary conservation), this is an enormous number of potential disease-causing variants to filter. The success of the authors in identifying PALB2 as the disease-causing gene relied heavily on the "one bad copy in normal tissue, two bad copies in cancer" rule, but most other severe diseases do not provide such convenient sign-posts.

The sheer scale of the noise variation in the human genome has only really become apparent in the last two years, following the publication of the Watson and Venter genomes. Both of these genomes contained a huge number of variants that could easily be interpreted as disease-causing, often with no clear way of distinguishing the villains from the innocent bystanders.

As such, researchers hunting for disease-causing mutations using genome-scale data will find their traditional problem is now turned on its head: instead of being unable to find plausible mutations, they will be faced with far too many possible candidates.

That problem will only get worse as we move from exome sequences - which at least comprise segments of protein-coding DNA for which we mostly understand the basic biological rules - to the vast, swampy, uncharted morass of non-coding DNA that makes up the other 98% of our genomes. It's clear from recent genome-wide association studies that the majority of disease risk variants are lurking in these regions, but we're currently almost entirely unable to filter out the functional disruptors from the millions of other polymorphisms littering non-coding DNA.

So the message from this paper is mixed. On the one hand, this is a genuine triumph for brute-force genomics, a case where generating staggering amounts of sequence data produced results with very clear clinical relevance. On the other hand, filtering out the true disease mutation from the background noise owed a hefty amount to the special properties of tumour suppressor genes, and more than a little luck; this approach will not be so easy in all cancer patients, and certainly not in patients suffering from other genetic diseases.

There's a dire warning here, as the age of clinical genomics approaches with blinding speed: if we want to be able to convert masses of sequence data into useful clinical information we need to get much better at assigning function to new sequence variants, and we need to learn how to do it fast. "

No comments:

Post a Comment