ENCODE: Some junk DNA isn’t

I’m just beginning to look at these results, a fantastic new source of information and approach to understanding the human genome. Here’s a popular press article and a science article (the source of my embedded quotes) on the same subject.

Like the “god particle”, “junk DNA” is an unfortunate term. The creationists, misunderstanding it, of course, have made much of it. But many lay people have heard the term as well and unless you dig into genomics it’s hard to understand.

After sequencing the entire human genome the hunt was on to find “genes” (aka coding sequences). Previously genes were found in reverse, that is, find a protein in a cell, work backwards to the gene that codes it. From that earlier research it was possible to build data mining programs that knew what a gene “looked like” (certain patterns in the DNA) and thus find all/most the genes in the entire genome, even some that had never been detected before and for which no function (or gene product) is known. This effort resulted in a surprisingly small number genes, hardly enough to code for something as complex as a human being.

These coding regions only comprised about 5% (some measure it as 1.5%, perhaps introns explain some of that discrepency) of the total DNA in the genome, leaving the rest to be referred to as “junk”. Even as this term became common  some functions had already been found for some of the junk. Meanwhile much of the junk is in the form a repeated sequences, literally a virus in the DNA that can spread and insert itself many thousands of times. Some of the “junk” also played important structural roles in the chromosomes. But much of the rest still just looked like junk.

The rest of the functional elements in the ENCODE analysis cover other classes of sequence that were thought to be essentially functionless, including introns. “The idea that introns are definitely deadweight isn’t true,” said Birney. Even some repetitive sequences—small chunks of DNA that have the ability to copy themselves and are typically viewed as parasites—are likely to be functional, often containing sequences where proteins can bind to influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” Birney said.

Now there is the publication of ENCORE, a project to look more closely at the junk. Fortunately Nature choose to publish the results to the public, instead of the usual hidden behind a paywall that only institutional scientist can access. So there is a vast amount of new information available for amateurs like me to go digest, including one of my favorite interests: the role of histones (not DNA per se, but part of the total structure of chromatin, the stuff of chromosomes).

Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. In reality, DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. In this way, distant parts of the genome can actually be physical neighbors, and can affect each other’s activity.

Genes, or coding regions, are specific sequences of DNA composed of ‘codons’, each one of which (well, most, a few are structural) select a specific amino acid to form a polypeptide which then folds into a critical 3D shape and possibly multiplexes with other gene products to form a specific protein. For instance, IIRC, hemoglobin is composed of two each of two subunits and a heme group. These finished proteins basically make life work performing a wide variety of functions.

Every cell contains the same genes (some replication errors accumulate over time and some it isn’t precisely true that every cell has identical DNA). But most genes are not needed in every cell type (the approximately 200 differentiated cell types in a person) so they are not “expressed” (a liver cell doesn’t make the same proteins as a neuron). Furthermore some genes are only needed during certain stages of our development; many function only during embryogenesis and then are inactive in the rest of our life. Furthermore genes are only expressed when there are needed, based on the internal state of the organization, i.e. no digestive enzymes when there is no food to digest, no repair protein unless cells are damaged.

So if every cell has every gene, why don’t all cells create every gene product all the time. The answer is a very complex system of gene regulation. There are many ways this regulation occurs but often the non-coding (i.e. what was known as junk) sections of the DNA play a critical role. This new publication assembles a huge amount of information about all that regulation process and in fact finds regulatory regions of the genome to be far more numerous than the coding regions.

So this will make for some good reading. I’ll have to go back to all the notes I accumulated while in intense study of genomics back when the first sequencing was being completed because there is a huge body of terminology and concepts one must understand before being able to dig, at least deeply, into these new findings. So hopefully I’ll learn something and have some stories to relate, not that anyone needs anything from me, given the experts are the source of the real information, but sometimes having an amateur point out something interesting may be of value to my readers.

Now a brief comment on some creationist “debate”, most of which I can’t recall in detail and much of it was silly so I’m not going back to look again, by the creationists, esp. the IDiots, aka intellect design creationists (yes, they don’t add the ‘creationists’ to their use of the term, but a rose by any other name is, well – it’s still creationism no matter how they try to conceal that from the courts and ID is still vacuous). Junk DNA is easy to explain through evolution; there are all sorts of ways junk DNA can be added to the genome of any creature and not very many ways it can be removed (prokaryotes are the exception, they have very “efficient” and compact genomes with little junk). But if you’re going to use the term ‘intelligent’ in your creationist myths, junk DNA certainly implies a really poor “designer”, or in short, why would a perfect god put all that junk in our DNA. IDiots have a hard time explaining this, so of course, like all good wingnuts, they simply denied it. Now scientists had fallen into a bit of trap by labeling non-coding DNA as ‘junk’, since that implies it has no function and in fact it was always known some of it did have functions, it just didn’t get expressed as proteins. So undoubtedly the IDiots will be jumping with glee with this new Nature study, “see we were right, god has infinite wisdom and you heathen scientists just didn’t know it.’ Of course given the staggering lack of scientific knowledge or misstatement of knowledge by the creationists I suspect they’ll just quote-mine a few bits out of context to apply to their nonsense arguments.

If science were politics some might try to cover up these new findings as inconvenient truths. But in fact since scientists always knew exactly what ‘junk DNA’ meant it was a mystery to be unraveled and of course the findings published, as has now happened. You might wonder, why it’s ten years after the initial human genome sequencing some of this is coming out. Well, IMHO, several reasons: a) a lot of focus on genes before the sequencing plus the development of many bioinformatic tools made that part of the problem easier and so results came out sooner, and, b) since much of the junk DNA actually is junk and junk DNA is 21 times more common it’s a lot of work to try to understand it and so a more directed approach (i.e. hypotheses about what junk DNA might do) had to guide the research. I’ll be able to comment more on this after some study.

So, Dr. Reader, perhaps you too will want to take a look at this site (included in my sidebar). Don’t let terminology get in your way, there are many sources on the net to provide simple and short definitions, sufficient to then get the substance of the article. But this kind of science isn’t for casual reading or quite skimming to get the key points, it does take “study”, but I think you’ll find, as I did, it can be quite fascinating.

The new ENCODE results are vast, reported in 30 central papers in NatureGenome Biology, and Genome Research, as well as a slew of secondary articles in ScienceCell, and others. And all of the data are freely available to the public.

The pages of printed journals are a poor repository for such a vast trove of data, so the ENCODE team have devised a new publishing model. On the ENCODE portal site, readers can pick one of 13 topics of interest, such as enhancer sequences, and follow them in special “threads” that pull out all the relevant paragraphs from the 30 main papers. “Rather than people having to skim read all 30 papers, and working out which ones they want to read, we pull out that thread for you,” Birney said.

btw: Why does any of this matter? This new view of the genome is more sophisticated and thus more likely to lead to understanding of illness and treatments, which the first sequencing promised as well, but mostly disappointed. In short, it’s more complicated than first thought and now new light is being shed and thus more likely we’re getting close to useful results from studying the genome. For example:

The researchers found that just 12 percent of known SNPs [Single Nucleotide Polymorphisms] lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within the non-coding but functional regions that ENCODE identified, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease.


About dmill96

old fat (but now getting trim and fit) guy, who used to create software in Silicon Valley (almost before it was called that), who used to go backpacking and bicycling and cross-country skiing and now geodashes, drives AWD in Wyoming, takes pictures, and writes long blog posts and does xizquvjyk.
This entry was posted in musing and tagged . Bookmark the permalink.

2 Responses to ENCODE: Some junk DNA isn’t

  1. dmill96 says:

    The controversy over the ENCODE claims seems to be building. I first spotted this in PZ’s post. That post references an earlier PZ post and then more significantly a paper.

    The paper is hard to follow (esp. without reading the ENCODE paper being criticized) but I boil it down to two basic objections: a) ENCODE ignores evolution in claiming function for DNA and the lack of sequence conservation is itself proof of non-function, and, b) ENCODE uses a bogus definition of function. As far as I can tell from reading the papers I’d agree (and frankly I was skeptical, but believing, when I first saw the ENCODE papers because 80.4% functionality is so radically different than tons of accumulated evidence).

    My skepticism (admittedly amateur) was simply that junk DNA is not just not knowing any function, but is actually knowing there is non-function (the parasitic DNA elements, introns). But the reason I fell for ENCODE was a quick reading where I picked up the idea that they were saying 3D analysis yields function that was not visible in the linear sequence reading. Somehow I’ve always had that feeling. In short, just looking at linear sequence is too simplistic, too much reductionism. And after the HGP failed to solve all the mysteries people started going back to basic assumptions and over-simplification was one of them. In short we got wrapped up in high-thruput sequencing technology and big data analysis, the easy and rapidly improving part, we were ignoring the functional (instead of structural) part and more importantly ignoring the systems approach which has now been taking hold. But, OTOH, we also look at DNA, not in its natural and active state, but in the manner that the instrumentation needs, namely small fragments reproduced in BACs and YACs. Now shotgun sequencing tripped my trigger because I like the software approach, but maybe I (and a lot of others) were deceived. We were so enamored with our clever technology we may have ignored the biology of this problem.

    When I was considering officially studying bioinformatics a decade ago some people I talked to presented the basic idea that “biology thinking” and “software thinking” were totally different (sorta like the left/right brain dichotomy) and that few people could synthesize both types of thought (and falling for the flattery I was one of them). But perhaps this is just the same thing – the big data / software POV has gotten carried away and found a ton of noise and is claiming it’s signal and the biologist, driven by the central organizing principle of biology, evolution, are looking at the problem entirely differently.

    So hopefully I’ll get more time to add this debate to my ToDo list (if I ever get finished with nutrition and metabolism study) and see if I can really break through what the meta-debate is really about (and, PZ, no this time it’s not about religion even though you and I tend to see that underlying so many debates).

  2. Pingback: ENCODE revisited | dailydouq

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s