ORFans

Explore Evolution claims:

[Molecular biologists] have been surprised to learn that a large number of of genes code for proteins whose function we don't understand yet. They call these ORFan genes.
Explore Evolution, p. 60

This is not the definition of ORFans. ORFans are "open reading frames," sections of a chromosome with a start codon followed by a stretch of nucleotide triplets and ended by a stop codon and which do not match a known coding DNA sequence in other species. There is no guarantee that these sections even code for a protein, let alone that they have any function. More importantly, these merely have no currently recognized relatives. (Siew N, Fischer D. (2003) "Analysis of singleton ORFans in fully sequenced microbial genomes." Proteins. 53:241-51) Function is not a consideration in defining ORFans. Some of these proteins with no known relatives do have recognized functions (e.g. bacterial virulence factor staphostatin B (1nycA)).

In contrast, we do have many genes that are in recognizable gene families, but whose functions are not clear from their sequence alone. For example, alpha-beta barrel family proteins have a wide variety of functions, and it is difficult to deduce the function of a member from simple inspection. The incorrect definition given in Explore Evolution artificially inflates the purported number of ORFans.

According to evolutionary theory, new genes arise from old genes by mutation . New genes should resemble the older "ancestor genes." However, these newly discovered genes do not match any sequence that codes from a known protein.
Explore Evolution, p. 61

Most ORFans have relatives found for them rather rapidly as new genomes are sequenced. With the larger databases available now, old ORFans are finding relatives (e.g. in 2004 hypothetical protein Apc1120 was an ORFan, now several relatives have turned up) and fewer new ORFans are being found. Also, we know that proteins can be generated de novo, so not all proteins must be traced back to older ancestor genes.

Thus, there are two claims here:

There are a substantial number of ORFans have no similarity to other sequences and
Common descent assumes all (or a very high proportion) of current proteins all originated with the Last Universal Common Ancestor.

The first claim is deeply misleading and the second is wrong.

Explore Evolution gives the impression that there are many genes with no relation to any other genes (especially by selectively quoting from older papers). In fact while initially many putative genes in a newly sequenced organism may appear to be unrelated to any then known gene, relatives are usually found rather rapidly. When H. influenzae was first sequenced, 64% of its Open Reading Frames (ORF's, putative genes) were ORFans, as of 2003, only 5.2% were. When Mycoplasma genitalium was first sequenced, roughly 30% of its predicted genes were ORFans, and now all have homologues in other lineages.

Explore Evolution quotes the brief review N. Siew, D. Fischer. 2003 "Twenty Thousand ORFan microbial protein families for the biologist?" Structure 11:7-9.

If proteins in different organisms have descended from common ancestral proteins by duplication and adaptive variation, why is it that so many today show no similarity to each other? Why is it that we do not find today any of the necessary intermediate sequences that must have given rise to these ORFans?
Explore Evolution, p. 62

This citation ignores the following sentences from that paper:

Regardless of their origin, ORFans may be of two types. Some ORFans may correspond to newly evolved (through a yet unknown mechanism) or to unique descendants of ancient proteins, with unique functions and three-dimensional (3D) structures not currently observed in other families. Alternatively, ORFans may correspond to highly diverse members of known protein families, but with functions and/or 3D structures similar to proteins already known.
Siew N, Fischer D. 2003 "Twenty Thousand ORFan microbial protein families for the biologist?" Structure 11:7-9.

As well as the prescient observation:

More sensitive computational methods, such as fold recognition or sequence-to-profile comparisons, may succeed in assigning some ORFans to known families, and thus, their roles and functions may be gained.
Siew N, Fischer D. 2003 "Twenty Thousand ORFan microbial protein families for the biologist?" Structure 11:7-9.

This is what has turned out to be the case. By ignoring work in this area since 2003, (including papers from Siew and Fischer published after this mini-review, such as Siew N, Fischer D. (2003) Proteins. 53:241-51), Explore Evolution gives a highly distorted picture of our current understanding of ORFans.

ORFans versus Genome Number: The proportion of ORFans in the genome, as compared to the total number of sequenced genes. As we increase the number of genes sequenced, the percent of ORFans fall. As of 2003, only 5% of long ORFans (ORF's that are unlikley to be simple sequencing artefacts) were unaccounted for. Figure 1C from Siew N, Fischer D. (2003) "Analysis of singleton ORFans in fully sequenced microbial genomes." Proteins. 53:241-51). Figure 1, C from Siew, N and Fisher D, PROTEINS: Structure, Function, and Genetics 53:241 251 (2003)

In an inquiry-based class, a teacher might ask the students to suggest reasons why some putative genes appear to be ORFans. Once students generated that list, the teacher could encourage students to generate testable hypotheses and even to test those hypotheses. Instead of guiding students and teachers along that path, Explore Evolution encourages students simply to surrender in the face of the unexplained, a decidedly inquiry-averse approach. Some of the reasons scientists have offered for genes to remain ORFans includ:

Some ORFans may be artefacts: Many ORFans are very short, 100-150 codons long. It is likely that many of these represent database or annotation errors. Also, in any genome, one would expect some random ORFs being formed. Fukuchi S and Nishikawa K. ("Estimation of the number of authentic orphan genes in bacterial genomes." DNA Res. 2004 Aug 31;11(4):219-31, 311-313.) closely examined sequences and estimated that about half of all short ORFans are sequencing or other errors.
Some ORFans may have relatives, but we haven't sampled enough genomes yet. As of 2003, when most of the ORFan comparisons were done, something like 60 complete bacterial genomes had been sequenced. Note the diagram above, with the continuing fall of ORFans as more genomes are sequenced. By 2006 the percentage of ORFans fell by a further 5% (Marsden RL, et al., "Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space." Nucleic Acids Res. 2006 34:1066-80). More genomes have been sequenced since then, but there are many, many more bacteria that are not yet sequenced, and will have genomes quite divergent from the human pathogens that form the majority of current sequences. This will be especially important because a horizontal transfer from a distantly related bacteria that has not been sequenced will look like an ORFan (until that distantly related bacteria is sequenced). A recent paper shows that many E. coli ORFans are the result of horizontal gene transfer from bacteriophages (Daubin and Ochman, 2004; "Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli". Genome Res. (6):1036-42.). Bacteriophages are viruses, which is why they didn't turn up in bacterial database comparisons.
Some ORFans may have relatives, but our tools aren't good enough to detect these relatives yet. Rapidly evolving proteins, especially small proteins, can have have their evolutionary history obscured by multiple substitutions during their evolution. More sensitive techniques are needed to find the relatives of these proteins, usually based on structural recognition. For example, using improved fold recognition software and a large database of fold family structures, Siew et al. have found that in Bacillus sp., some related ORFans are members of the of the alpha/beta hydrolase superfamily, and most likely derive from the haloperoxidases (N. Siew, H. K. Saini and D. Fischer. (2005) "A Putative Novel Alpha/Beta Hydrolase Family in Bacillus." FEBS Letters, 579:3175-82.).

So most ORFans have been accounted for, and as we study more genomes with better tools we will resolve the status of many more. In an inquiry-based approach, students could recheck Escherichia coli ORFans from 2003, and would find that the vast majority now have resolved relatives. Indeed, if some of the non-artefactual ORFans are due to horizontal transfer from bacteriophages, as recent experiments suggest (Daubin and Ochman, 2004), then they may prove to be a valuable tool in understanding the phylogeny of bacteria, in the same way that families of LINES, SINES and pseudo genes have been. Far from being a threat to common descent, the patterns seen of the nested hierarchies of singleton, lineage specific and family specific ORFans are those you would expect from common descent.
Some ORFans may be de novo generated proteins. We fully expect a modest proportion of new genes to be generated de novo during evolution. We even have examples of proteins that are so generated. The most famous of these is the nylonase gene, which allows bacteria to metabolise the artificial polymer nylon. This was produced by a mutation in a piece of non-coding DNA which generated a transcribable protein (Okada H, et al., (1983) "Evolutionary adaptation of plasmid-encoded enzymes for degrading nylon oligomers." Nature. 306(5939):203-6.). The sperm-specific dynein intermediate chain gene (Sdic) was generated by a fusion mutation between two genes (so strictly speaking it falls under the gene duplication rubric), but the coding region of the new Sdic gene is generated from the non-coding intronic regions, so protein homology studies would have a hard time identifying it (Nurminsky DI, et al., (1998) "Selective sweep of a newly evolved sperm-specific gene in Drosophila." Nature. 396(6711):572-5). Formation of new genes poses no problem for evolutionary biology or common descent, as we do not demand that all, or the vast majority of genes originate in the Last Common Universal Ancestor. Furthermore, we are quite able to trace common ancestry with some genes being generated de novo, as this does not disturb the trees generated from other genes.

New York Times

The Hechinger Report

Issues in Science and Technology

Table of Contents