请输入您要查询的百科知识:

 

词条 Draft:Split gene theory
释义

  1. Background

  2. Contrasting discussions

  3. The Split-gene theory

      The hypothesis    Testing the hypothesis    Origin of Splice junctions    Origin of Spliceosome  

  4. Background

  5. Early Speculations

  6. The Split-gene theory

      The hypothesis  

  7. Testing the hypothesis

      Origin of introns and the split gene structure    Origin of Splice junctions    Branch point (lariat) sequence   Gene regulatory sequences   Stop codons are key parts of every genetic element in the eukaryotic gene  

  8. Why exons are short and introns are long?

  9. Why eukaryotic genomes are large?

  10. Origin of the spliceosomal machinery and the eukaryotic cell nucleus

  11. Origin of the eukaryotic cell

  12. The Shapiro-Senapathy algorithm

  13. Bacterial genes could have originated from split genes?

  14. Comprehensive corroborating evidences for the split gene theory

  15. Background

  16. Early Speculations

  17. The Split-gene theory

      The hypothesis  

  18. Testing the hypothesis

      Origin of introns and the split gene structure    Origin of Splice junctions    Branch point (lariat) sequence   Gene regulatory sequences   Stop codons are key parts of every genetic element in the eukaryotic gene  

  19. Why exons are short and introns are long?

  20. Why eukaryotic genomes are large?

  21. Origin of the spliceosomal machinery and the eukaryotic cell nucleus

  22. Origin of the eukaryotic cell

  23. The Shapiro-Senapathy algorithm

  24. Bacterial genes could have originated from split genes?

  25. Comprehensive corroborating evidences for the split gene theory

{{AFC submission|t||ts=20181225204834|u=Ganeshmanohar|ns=118|demo=}}

The eukaryotic genes’ coding sequences are split into exons and introns. As the split gene structure is central to eukaryotic biology, the question of how and why eukaryotic genes are split is extremely important.

Background

Genes of all organisms, except bacteria, consist of short protein-coding regions (exons) interrupted by long sequences that intervene the coding sequences (introns) [FIGURE - show split gene, →  Transcription (RNA Pol), → Splicing (Spliceosome), Translation (Ribosome) → Protein]. When a gene is expressed, its DNA sequence is copied into a “primary RNA” sequence. Then the “spliceosome” machinery physically removes the introns from the RNA copy of the gene, leaving only a contiguously connected series of exons, which becomes the “messenger” RNA (mRNA). This mRNA is now “read” by another cellular machinery, called the “ribosome,” to produce the encoded protein. Thus, although introns are not physically removed from a gene, a gene’s sequence is read as if introns never existed.

The length of introns varies widely between 10 bases to 500,000 bases in a genome (for example, the human genome), but the length of exons has an upper limit of about 600 bases in most of the eukaryotic genes [REF]. Because exons code for protein sequences, they are very important for the cell, yet constitute only ~2% of the genes’ sequences. Introns, in contrast, constitute 98% of the genes’ sequences but seem to have little crucial functions in genes, except for functions such as containing enhancer sequences and developmental regulators in rare instances (3,4).

Until introns were discovered to interrupt genes in 1977 by Philip Sharp [REFs] from MIT and Richard Roberts [REFs] then at CSHL (currently at NEB), it was believed that genes contained its coding sequence in one stretch, bounded by a single Open Reading Frame (ORF) [FIGURE - contiguous coding gene - in the legend say one line - this type of genes are the norm in prokaryotic organisms]. The discovery that introns interrupted the eukaryotic genes was a profound surprise to scientists, which instantly brought up the questions of how, why and when did the introns come into being, leading to the split structure of genes. As more eukaryotic genes were sequenced, it became apparent that a typical gene was interrupted in many places by introns, dividing the coding sequence into many short exons. Also surprising was that the introns were very long, even as long as hundreds of thousands of bases (e.g., in the human genes NAME, NAME, NAME). These findings prompted the question of not only why introns came into the eukaryotic genes but also why many introns occur within a gene (up to 200 introns in human genes, for example, in genes NAME, NAME, NAME) and why they are very long, and why exons are very short [ACTUAL FIGURE OF SYN1 OR ANOTHER GENE FROM EXORF].

It was discovered that the spliceosome machinery that spliced together the exons and eliminated the introns from the primary RNA transcript was very large and complex with ~300 proteins and several SnRNA molecules [REF]. So, the questions also extended to the origin of the spliceosome. Soon after the discovery of introns, it became apparent that the junctions between exons and introns on either side exhibited specific sequences that signalled the spliceosome machinery to the exact base position for splicing. How and why did these splice junction signals came into being was another important question to be answered.

Contrasting discussions

These questions prompted contrasting discussions in the literature almost immediately.

Were the introns introduced when eukaryotic genes evolved from more ancient prokaryotic intronless genes or were the eukaryotes more ancient to evolve along with introns (5-9)?

Although  he later retracted, Dr. F Doolittle’s thinking turned out to be correct that the original structure of the genes could be the split gene version of the gene. And James Darnell …

Apparently, none of these publications answered the questions of why and how introns and the the split structure of genes originated, what are splice junction sequences, why are exons short and introns long, and genomes are large.


The Split-gene theory

The hypothesis

Around the same time introns were discovered, Dr. Senapathy was asking how genes themselves could have originated. He surmised that for any gene to come into being, there must have been genetic sequences (RNA or DNA) present in the prebiotic chemistry environment. A basic question he asked was how protein-coding sequences could have originated from primordial DNA sequences at the initial development of the very first cells.

To answer this, he made two basic assumptions: (i) before a self-replicating cell could come into existence, DNA molecules were synthesized in the primordial soup by random addition of the 4 nucleotides without the help of templates and (ii) the nucleotide sequences that code for proteins were selected from these preexisting DNA sequences in the primordial soup, and not by construction from shorter coding sequences. He also surmised that codons must have been established prior to the origin of the first genes. If primordial DNA did contain random nucleotide sequences, he asked: Was there an upper limit in the coding-sequence lengths, and, if so, did this limit play a crucial role in the formation of the structural features of genes at the very beginning of the origin of genes?

His logic was the following. The average length of proteins in living organisms, including the eukaryotic organisms and bacterial organisms, was ~400 amino acids. There also existed much longer proteins in both eukaryotic and bacterial organisms, up to 10,000 AAs and longer. However, the coding sequence existed in a single stretch of 1,200 bases to 30,000 bases long in bacterial genes, whereas the coding sequence of eukaryotes existed in short segments of exons of approx. 120 bases long regardless of the length of the protein. If the coding sequence lengths in random DNA sequences were as long as those from the contiguous genes of bacterial organisms, then contiguous coding genes were possible to have directly originated from random DNA. Although three stop codons out of the 64 codon set would lead to a very short average coding sequence length (defined as an ORF) of ~60 bases, the upper limit of ORFs could be very long to the tune of several thousands of bases in length, matching the lengths of contiguously coding genes in bacterial organisms. This was not known, as the distribution of the lengths of ORFs in a random DNA sequence was never studied before.

Testing the hypothesis

Dr. Senapathy analyzed the distribution of the ORF lengths in computer-generated random DNA sequences first. Surprisingly, this study revealed that there actually existed an upper limit of about 200 codons (600 bases) in the lengths of ORFs (FIGURE 1). The shortest ORF (zero) was the most frequent. At increasing lengths of ORFs, their frequency decreased logarithmically, reaching almost zero at about 600 bases. When the probability of ORF lengths in a random sequence was plotted (see FIGURE 2), it also revealed that the  probability of increasing lengths of ORFs decreased exponentially and tailed off at a maximum of about 600 bases. From this “negative exponential” distribution of ORF lengths, it was found that most of the ORFs are extremely shorter than even the upper maximum of 600 bases, being closer to the zero length.

This finding was surprising because the average protein length of 400 AAs (with ~1,200 bases of coding sequence) and longer proteins of thousands of AAs (requiring >10,000 bases of coding sequence) would not occur at a stretch in a random sequence. If this was true, a typical gene with a contiguous coding sequence could not originate in a random sequence. The only possible way that any gene coding for a protein longer than 200 AAs could originate from a random sequence was to split the coding sequence into shorter segments and select these segments from short ORFs available in the random sequence. This would lead to a split structure of the gene.

If this hypothesis was true, eukaryotic DNA sequences should show evidence for it. When Senapathy plotted the distribution of ORF lengths in eukaryotic DNA sequences, the plot was remarkably similar to that from random DNA sequence. It was also a negative exponential distribution that tailed off at a maximum of about 600 bases. This finding was amazing because the lengths of exons from eukaryotic genes had a maximum of about 600 bases [REF], which coincided exactly with the maximum length of ORFs observed in both random DNA sequence and in eukaryotic DNA sequence. These findings indicated that it was likely that split genes originated from random DNA sequences with exons and introns as described above. The Nobel Laureate Dr. Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS.[103] New Scientist covered this publication in “A long explanation for introns”.[105]

Origin of Splice junctions

The split gene theory thus suggested that genes with long coding sequences originated from random DNA sequences by choosing the best of the short coding segments (exons) and joining them by a process of splicing. The intervening intron sequences were left-over vestiges of the random sequences, and thus were earmarked to be removed by the spliceosome. This split-gene organization would require that a mechanism to recognize an ORF should have originated. As an ORF is defined by a contiguously coding sequence bounded by stop codons, these stop codon ends had to be recognized by this gene recognition system. This system could have defined the exons by the presence of a stop codon at the ends of ORFs. Thus, the introns should contain a stop codon at their ends, which would be part of the splice junction sequences.

If this hypothesis was true, the split genes of today’s living organisms should contain stop codons exactly at the ends of introns. When Senapathy tested this hypothesis in the splice junctions of eukaryotic genes, it was astonishing that almost all splice junctions did contain a stop codon at the ends of introns, right outside of the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. Thus, the basic split gene theory led to the hypothesis that the splice junctions originated from the stop codons.[104]

Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequence clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes.  Dr. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper.[104] New Scientist covered this publication in “Exons, Introns and Evolution”.[106]

Origin of Spliceosome

Senapathy proposes that the spliceosome originated at the same time as the split genes originated from random DNA sequences. His concept is that the genes for the spliceosomal proteins also originated from the random sequences.

The chicken or the egg - all of these genes existed in random sequences. The first transcription and translation of these genes happened by the enzymatic activities that occurred in prebiotic chemistry in random polypeptides, RNA and ribonucleic acid - polypeptide complexes. There is much evidence.

The coding sequences of eukaryotic genes are split into short coding sequence segments (exons) and long non-coding sequences (introns) that intervene the exons. As the split gene structure is central to eukaryotic biology, the question of why, how and when introns came into the eukaryotic genes, what intron sequences are, and why eukaryotic genes are split are extremely important.

Dr. Periannan Senapathy proposed the “split gene” theory to explain the origin of introns.[1][2][3] This theory provides comprehensive and tenable solutions to the key questions concerning the split genes, including the exons, introns, splice junctions, branch points and the entire split gene architecture, based on the origin of split genes from random genetic sequences. It also provides possible solutions to the origin of the spliceosomal machinery, the nuclear boundary and the eukaryotic cell.

Background

Genes of all organisms, except bacteria, consist of short protein-coding regions (exons) interrupted by long sequences that intervene the coding sequences (introns). [1][2] When a gene is expressed, its DNA sequence is copied into a “primary RNA” sequence by the enzyme RNA polymerase. Then the “spliceosome” machinery physically removes the introns from the RNA copy of the gene by the process of splicing, leaving only a contiguously connected series of exons, which becomes the “messenger” RNA (mRNA). This mRNA is now “read” by another cellular machinery, called the “ribosome,” to produce the encoded protein. Thus, although introns are not physically removed from a gene, a gene’s sequence is read as if introns never existed.

The exons are usually very short, with an approx. average length of about 120 bases (e.g. in human genes). The length of introns varies widely between 10 bases to 500,000 bases in a genome (for example, the human genome), but the length of exons has an upper limit of about 600 bases in most of the eukaryotic genes. Because exons code for protein sequences, they are very important for the cell, yet constitute only ~2% of the genes’ sequences. Introns, in contrast, constitute 98% of the genes’ sequences but seem to have little crucial functions in genes, except for functions such as containing enhancer sequences and developmental regulators in rare instances.[4][5]

Until Dr. Philip Sharp [6][7] from the MIT and Dr. Richard Roberts [8] then at the Cold Spring Harbor Laboratories (currently at the New England Biolabs) discovered introns[9] within eukaryotic genes in 1977, it was believed that the coding sequence of all genes was always in one single stretch, bounded by a single long Open Reading Frame (ORF). The discovery of introns was a profound surprise to scientists, which instantly brought up the questions of how, why and when the introns came into the eukaryotic genes.

It soon became apparent that a typical eukaryotic gene was interrupted at many locations by introns, dividing the coding sequence into many short exons. Also surprising was that the introns were very long, even as long as hundreds of thousands of bases (see table below). These findings also prompted the questions of why many introns occur within a gene (for example, ~312 introns occur in the human gene TTN), why they are very long, and why exons are very short.

Gene symbolGene length
(bases)
Longest Intron length
(bases)
ROBO2           1,743,2691,160,411
KCNIP41,220,1831,097,903
ASIC21,161,8771,043,911
NRG11,128,573956,398
DPP101,403,453866,399
DMD 2,220,382319,058
TTN304,81395,764
The longest introns in the human genes.

It was also discovered that the spliceosome machinery was very large and complex with ~300 proteins and several SnRNA molecules. So, the questions also extended to the origin of the spliceosome. Soon after the discovery of introns, it became apparent that the junctions between exons and introns on either side exhibited specific sequences that signalled the spliceosome machinery to the exact base position for splicing. How and why these splice junction signals came into being was another important question to be answered.

Early Speculations

The startling discovery of introns and the split gene architecture of the eukaryotic genes was dramatic, and started a new era of eukaryotic biology. The question of why eukaryotic genes had a genes-in-pieces architecture prompted speculations and discussions in the literature almost immediately.

Dr. Ford Doolittle from the Dalhousie University published a paper in 1978 in which he expressed his views.[10] He stated that most molecular biologists assumed that the eukaryotic genome arose from a ‘simpler’ and more ‘primitive’ prokaryotic genome rather like that of Escherichia coli. However, this type of evolution would require that introns be introduced into the contiguous coding sequences of bacterial genes. Regarding this requirement, Doolittle said, “It is extraordinarily difficult to imagine how informationally irrelevant sequences could be introduced into pre-existing structural genes without deleterious effects.” He stated “I would like to argue that the eukaryotic genome, at least in that aspect of its structure manifested as ‘genes in pieces’ is in fact the primitive original form.”

Dr. James E. Darnell from the Rockefeller University also expressed similar views in 1978.[11] He stated, “The differences in the biochemistry of messenger RNA formation in eukaryotes compared to prokaryotes are so profound as to suggest that sequential prokaryotic to eukaryotic cell evolution seems unlikely. The recently discovered non-contiguous sequences in eukaryotic DNA that encode messenger RNA may reflect an ancient, rather than a new, distribution of information in DNA and that eukaryotes evolved independently of prokaryotes.”

However, in an apparent attempt to reconcile with the idea that RNA preceded DNA in evolution, and with the concept of the three evolutionary lineages of archea, bacteria and eukarya, both Dr. Doolittle and Dr. Darnell deviated from their original speculation in a paper they published together in 1985.[12] They suggested that the ancestor of all three groups of organisms, the ‘progenote,’ had a genes-in-pieces structure, from which all three lineages evolved. They speculated that the precellular stage had primitive RNA genes which had introns, which were reverse transcribed into DNA and formed the progenote. Bacteria and archea evolved from the progenote by losing introns, and ‘urkaryote’ evolved from it by retaining introns. Later, the eukaryote evolved from the urkaryote by evolving a nucleus and gaining the mitochondria from the bacteria. Multicellular organisms then evolved from the eukaryote.

These authors were able to predict that the distinctions between the prokaryote and the eukaryote were so profound that the prokaryote to eukaryote evolution was not tenable, and that both had different origins. However, other than the speculations that the precellular RNA genes must have had introns, they did not address the key questions of where from, how or why the introns could have originated in these genes or what their material basis was. There were no explanations of why exons were short and introns were long, how the splice junctions originated, what the structure and sequence of the splice junctions meant, and why eukaryotic genomes were large.

Around the same time that Dr. Doolittle and Dr. Darnell suggested that introns in eukaryotic genes could be ancient, Dr. Colin Blake[13] from the university of Oxford and Dr. Walter Gilbert[14][15] from the Harvard University (who won the Nobel Prize for inventing a DNA sequencing method along with Fred Sanger) published their views on intron origins independently. In their view, introns originated as spacer sequences that enabled the recombination and shuffling of exons that encoded distinct functional domains in order to evolve new genes. Thus, new genes were assembled from exon modules that coded for functional domains, folding regions, or structural elements from preexisting genes in the genome of an ancestral organism, thereby evolving genes with new functions. They did not specify how the exons representing protein structural motifs originated, or the introns that do not code for proteins originated. In addition, even after many years, extensive analysis of several thousands of proteins and genes showed that only extremely rarely do genes exhibit the supposed exon shuffling phenomenon.[16][17] Furthermore, several molecular biologists had questioned the exon shuffling proposal, from a purely evolutionary view for both methodological and conceptual reasons, and, in the long run, this theory did not materialize.

The Split-gene theory

The hypothesis

Around the same time introns were discovered, Dr. Senapathy was asking how genes themselves could have originated. He surmised that for any gene to come into being, there must have been genetic sequences (RNA or DNA) present in the prebiotic chemistry environment. A basic question he asked was how protein-coding sequences could have originated from primordial DNA sequences at the initial development of the very first cells.

To answer this, he made two basic assumptions: (i) before a self-replicating cell could come into existence, DNA molecules were synthesized in the primordial soup by random addition of the 4 nucleotides without the help of templates and (ii) the nucleotide sequences that code for proteins were selected from these preexisting random DNA sequences in the primordial soup, and not by construction from shorter coding sequences. He also surmised that codons must have been established prior to the origin of the first genes. If primordial DNA did contain random nucleotide sequences, he asked: Was there an upper limit in the coding-sequence lengths, and, if so, did this limit play a crucial role in the formation of the structural features of genes at the very beginning of the origin of genes?

His logic was the following. The average length of proteins in living organisms, including the eukaryotic and bacterial organisms, was ~400 amino acids. However, there existed much longer proteins, even longer than 10,000 amino acids up to ~30,000 amino acids, in both eukaryotes and bacteria.[18] The coding sequence of thousands of bases existed in a single stretch in bacterial genes. In contrast, the coding sequence of eukaryotes existed only in short segments of exons of approx. 120 bases regardless of the length of the protein. If the coding sequence (Open Reading Frame, ORF) lengths in random DNA sequences were as long as those in bacterial organisms, then contiguously long coding genes were possible to have occurred in random DNA. This was not known, as the distribution of the lengths of ORFs in a random DNA sequence was never studied before.

As random DNA sequences could be generated in the computer, Senapathy thought that he could ask these questions and conduct his experiments in the computer. Furthermore, when he began studying this question, there existed just about sufficient amount of DNA and protein sequence information in the National Biomedical Research Foundation (NBRF) database in the early 1980’s.

Testing the hypothesis

Origin of introns and the split gene structure

Dr. Senapathy analyzed the distribution of the ORF lengths in computer-generated random DNA sequences first. Surprisingly, this study revealed that there actually existed an upper limit of about 200 codons (600 bases) in the lengths of ORFs. The shortest ORF (zero base in length) was the most frequent. At increasing lengths of ORFs, their frequency decreased logarithmically, reaching almost zero at about 600 bases. When the probability of ORF lengths in a random sequence was plotted, it also revealed that the  probability of increasing lengths of ORFs decreased exponentially and tailed off at a maximum of about 600 bases. From this “negative exponential” distribution of ORF lengths, it was found that most of the ORFs were extremely shorter than even the maximum of 600 bases.

This finding was surprising because the coding sequence for the average protein length of 400 AAs (with ~1,200 bases of coding sequence) and longer proteins of thousands of AAs (requiring >10,000 bases of coding sequence) would not occur at a stretch in a random sequence. If this was true, a typical gene with a contiguous coding sequence could not originate in a random sequence. Thus, the only possible way that any gene could originate from a random sequence was to split the coding sequence into shorter segments and select these segments from short ORFs available in the random sequence, rather than to increase the length of an ORF by eliminating numerous consecutively occurring stop codons. This process of choosing short segments of coding sequences from the available ORFs to make a long ORF would lead to a split structure of the gene.

The split genes thus originated from random DNA sequences by choosing the best of the short coding segments (exons) and joining them by a process of splicing. The intervening intron sequences were left-over vestiges of the random sequences, and thus were earmarked to be removed by the spliceosome. These findings indicated that split genes could have originated from random DNA sequences with exons and introns as they are found in today’s eukaryotic organisms. The Nobel Laureate Dr. Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS.[1] New Scientist covered this publication in “A long explanation for introns”.[20]

Noted molecular biologist Dr. Colin Blake, who proposed the Gilbert-Blake hypothesis in 1979 for the origin of introns (see above), stated in his 1987 publication entitled “Proteins, exons and molecular evolution,” that Senapathy’s split gene theory comprehensively explained the origin of the split gene structure. In addition, he stated that it explained several key questions including the origin of the splicing mechanism:[21]

“Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and non-coding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution. He found that the distribution of reading frame lengths in a random nucleotide sequence corresponded exactly to that for the observed distribution of eukaryotic exon sizes. These were delimited by regions containing stop signals, the messages to terminate construction of the polypeptide chain, and were thus non-coding regions or introns. The presence of a random sequence was therefore sufficient to create in the primordial ancestor the segregated form of RNA observed in the eukaryotic gene structure. Moreover, the random distribution also displays a cutoff at 600 nucleotides, which suggests that the maximum size for an early polypeptide was 200 residues, again as observed in the maximum size of the eukaryotic exon. Thus, in response to evolutionary pressures to create larger and more complex genes, the RNA fragments were joined together by a splicing mechanism that removed the introns. Hence, the early existence of both introns and RNA splicing in eukaryotes appears to be very likely from a simple statistical basis. These results also agree with the linear relationship found between the number of exons in the gene for a particular protein and the length of the polypeptide chain.”

Origin of Splice junctions

Under the split gene theory, an exon would be defined by an ORF. It would require that a mechanism to recognize an ORF should have originated. As an ORF is defined by a contiguously coding sequence bounded by stop codons, these stop codon ends had to be recognized by this exon-intron gene recognition system. This system could have defined the exons by the presence of a stop codon at the ends of ORFs, which should be included within the ends of the introns and eliminated by the splicing process. Thus, the introns should contain a stop codon at their ends, which would be part of the splice junction sequences.

If this hypothesis was true, the split genes of today’s living organisms should contain stop codons exactly at the ends of introns. When Dr. Senapathy tested this hypothesis in the splice junctions of eukaryotic genes, it was astonishing that the vast majority of splice junctions did contain a stop codon at the ends of introns, right outside of the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. Thus, the basic split gene theory for the origin of introns and the split gene structure led to the understanding that the splice junctions originated from the stop codons.[2]

CodonNumber of occurrences
in donor signal
Number of occurrences
in acceptor signal
TAA3700
TGA2930
TAG64234
CAG7746
Other297*50
Total10301030
Frequency of stop codons in donor and acceptor splice-junction sequences.

Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequences clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes, thus providing a strong corroboration for the split gene theory.  Dr. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper.[2] New Scientist covered this publication in “Exons, Introns and Evolution”.[21]

Soon after the discovery of introns by Drs. Philip Sharp and Richard Roberts, it became known that mutations within splice junctions could lead to diseases. Dr. Senapathy showed that mutations in the stop codon bases (canonical bases) caused more diseases than the mutations in non-canonical bases.[1]

Branch point (lariat) sequence

An intermediate stage in the process of eukaryotic RNA splicing is the formation of a lariat structure. It is anchored at an adenosine residue in intron between 10 and 50 nucleotides upstream of the 3' splice site. A short conserved sequence (the branch point sequence) functions as the recognition signal for the site of lariat formation. During the splicing process, this conserved sequence towards the end of the intron forms a lariat structure with the beginning of the intron.[22] The final step of the splicing process occurs when the two exons are joined and the intron is released as a lariat RNA[23].

Several investigators have found the branch point sequences in different organisms[22] including yeast, human, fruit fly, rat, and plants. Senapathy found that, in all of these branch point sequences, the codon ending at the branch point adenosine is consistently a stop codon. What is interesting is that two of the three stop codons (TAA and TGA) occur almost all of the times at this position.

OrganismLariat Consensus sequence
YeastTACTAAC
Human Beta globin genesCTGAC

CTAAT

CTGAT

CTAAC

CTCAC

DrosophilaCTAAT
RatsCTGAC
Plants(C/T)T(A/G)A(T/C)
Consistent presence of stop codons in branch point signal sequences.Lariat (branch point) sequences have been identified from many differentorganisms.These sequences consistently show that the codon ending inthe branching adenosine is a stop codon, either TAA or TGA, which are shown in red.

These findings led Dr. Senapathy to propose that the branch point signal originated from stop codons. The finding that two different stop codons (TAA and TGA) occur within the lariat signal with the branching point as the third base of the stop codons corroborates this proposal. As the branching point of the lariat occurs at the last adenine of the stop codon, it is possible that the spliceosome machinery that originated for the elimination of the numerously occurring stop codons from the primary RNA sequence created an auxiliary stop-codon sequence signal as the lariat sequence to aid its splicing function.[2]

The small nuclear U2 RNA found in splicing complexes is thought to aid splicing by interacting with the lariat sequence.[24] Complementary sequences for both the lariat sequence and the acceptor signal are present in a segment of only 15 nucleotides in U2 RNA. Further, the U1 RNA has been proposed to function as a guide in splicing to identify the precise donor splice junction by complementary base-pairing. The conserved regions of the U1 RNA thus include sequences complementary to the stop codons. These observations enabled Senapathy to predict that that stop codons had operated in the origin of not only the splice-junction signals and the lariat signal, but also some of the small nuclear RNAs.

Gene regulatory sequences

Dr Senapathy also proposed that the gene-expression regulatory sequences (promoter and poly-A addition site sequences) also could have originated from stop codons. A conserved sequence, AATAAA, exists in almost every gene a short distance downstream from the end of the protein-coding message and serves as a signal for the addition of poly(A) in the mRNA copy of the gene[25]. This poly(A) sequence signal contains a stop codon, TAA. A sequence shortly downstream from this signal, thought to be part of the complete poly(A) signal, also contains the TAG and TGA stop codons.

Eukaryotic RNA-polymerase-II-dependent promoters can contain a TATA box (consensus sequence TATAAA), which contains the stop codon TAA. Bacterial promoter elements at -10 bases exhibits a TATA box with a consensus of TATAAT (which contains the stop codon TAA), and at -35 bases exhibits a consensus of TTGACA (containing the stop codon TGA). Thus, the evolution of the whole RNA processing mechanism seems to have been geared toward elimination of stop codons, thus making those stop codons the focal points for RNA processing.

Stop codons are key parts of every genetic element in the eukaryotic gene


Genetic ElementConsensus sequence
PromoterTATAAT
Donor Splice SequenceCAG:GTAAGT

CAG:GTGAGT

Acceptor Splice Sequence(C/T)9…TAG:GT
Lariat SequenceCTGAC

CTAAC

Poly-A addition siteTATAAA
The consistent occurrence of stop codons in genetic elements in eukaryotic genes.The consensus sequences of the different genetic elements in eukaryotic genes are shown. The stop codon(s) in each of these sequences are colored in red.

Dr. Senapathy’s work based on his split gene theory has unraveled that stop codons occur as the key parts in every genetic element in eukaryotic genes. The table and figure above show that the key parts of the core promoter elements, the lariat (branch point) signal, the donor and acceptor splice signals, and the poly-A addition signal consist of one or more stop codons. This finding provides a strong corroboration for the split gene theory that the underlying reason for the complete split gene paradigm is the origin of split genes from random DNA sequences, wherein random distribution of an extremely high frequency of stop codons were used by nature to define these genetic elements.

Why exons are short and introns are long?

Research based on the split gene theory sheds light on other basic questions of exons and introns. The exons of eukaryotes are generally short (human exons average ~120 bases, and can be as short as 10 bases) and introns are usually very long (average of ~3,000 bases, and can be several hundred thousands bases long), for example genes RBFOX1, CNTNAP2, PTPRD and DLG2. Dr. Senapathy has provided a plausible answer to these questions, which has remained the only explanation so far. Based on the split gene theory, exons of eukaryotic genes, if they originated from random DNA sequences, have to match the lengths of ORFs from random sequence, and possibly should be around 100 bases (close to the median length of ORFs in random sequence). The genome sequences of living organisms, for example the human, exhibits exactly the same average lengths of 120 bases for exons, and the longest exons of 600 bases (with few exceptions), which is the same length as that of the longest random ORFs.[1][2][3][19]

If split genes originated in random DNA sequences, then introns would be long for several reasons. The stop codons occur in clusters leading to numerous consecutive very short ORFs, and longer ORFs that could be defined as exons would be rarer. Furthermore, the best of the coding sequence parameters for functional proteins would be chosen from the long ORFs in random sequence, which may occur rarely. In addition, the combination of the donor and acceptor splice junction sequences within short lengths of coding sequence segments that would define exon boundaries would occur rarely in a random sequence. These combined reasons would make introns very long compared to the lengths of exons.   

Why eukaryotic genomes are large?

This work also explains why the genomes are very large, for example, the human genome with three billion bases, and why only a very small fraction of the human genome (~2%) codes for the proteins and other regulatory elements.[26][27] If split genes originated from random primordial DNA sequences, it would contain a significant amount of DNA that would be represented by introns. Furthermore, a genome assembled from random DNA containing split genes would also include intergenic random DNA. Thus, the nascent genomes that originated from random DNA sequences had to be large, regardless of the complexity of the organism.

The observation that the genomes of several organisms such as that of the onion (~16 billion bases[28]) and salamander (~32 billion bases[29]) are much larger than that of the human (~3 billion bases[30][31]) but the organisms are no more complex than human provides credence to this split gene theory. Furthermore, the findings that the genomes of several organisms are smaller, although they contain essentially the same number of genes as that of the human, such as those of the C. elegans (genome size ~100 million bases, ~19,000 genes)[32] and Arabidopsis thaliana (genome size ~125 million bases, ~25,000 genes),[33] adds support to this theory. The split gene theory predicts that the introns in the split genes in these genomes could be the “reduced” (or deleted) form compared to the larger genes with long introns, thus leading to reduced genomes.[1][19] In fact, researchers have recently proposed that these smaller genomes are actually reduced genomes, which adds support to the split gene theory.[34]

Origin of the spliceosomal machinery and the eukaryotic cell nucleus

Dr. Senapathy's research also addresses the origin of the spliceosomal machinery that edits out the introns from the RNA transcripts of genes. If the split genes had originated from random DNA, then the introns would have become an unnecessary but integral part of the eukaryotic genes along with the splice junctions at their ends. The spliceosomal machinery would be required to remove them and to enable the short exons to be linearly spliced together as a contiguously coding mRNA that can be translated into a complete protein. Thus, the split gene theory shows that the whole spliceosomal machinery originated due to the origin of split genes from random DNA sequences, and to remove the unnecessary introns.[1][2]

As noted above, Dr. Colin Blake, the author of the Gilbert-Blake theory for the origin of introns and exons, states, “Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and noncoding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution.”

Dr. Senapathy had also proposed a plausible mechanistic and functional rationale why the eukaryotic nucleus originated, a major question in biology.[1][2] If the transcripts of the split genes and the spliced mRNAs were present in a cell without a nucleus, the ribosomes would try to bind to both the un-spliced primary RNA transcript and the spliced mRNA, which would result in a molecular chaos. If a boundary had originated to separate the RNA splicing process from the mRNA translation, it can avoid this problem of molecular chaos. This is exactly what is found in eukaryotic cells, where the splicing of the primary RNA transcript occurs within the nucleus, and the spliced mRNA is transported to the cytoplasm, where the ribosomes translate them into proteins. The nuclear boundary provides a clear separation of the primary RNA splicing and the mRNA translation.

Origin of the eukaryotic cell

These investigations thus led to the possibility that primordial DNA with essentially random sequence gave rise to the complex structure of the split genes with exons, introns and splice junctions. They also predict that the cells that harbored these split genes had to be complex with a nuclear cytoplasmic boundary, and must have had a spliceosomal machinery. Thus, it was possible that the earliest cell was complex and eukaryotic.[1] [2][3][19] Surprisingly, findings from extensive comparative genomics research from several organisms over the past 15 years are showing overwhelmingly that the earliest organisms could have been highly complex and eukaryotic, and could have contained complex proteins,[35][36][37][38][39][40][41] exactly as predicted by Dr. Senapathy's theory.

The spliceosome is a highly complex machinery within the eukaryotic cell, containing ~200 proteins and several SnRNPs. In their paper [34]Complex spliceosomal organization ancestral to extant eukaryotes,” molecular biologists Dr. Lesley Collins and Dr. David Penny state “We begin with the hypothesis that ... the spliceosome has increased in complexity throughout eukaryotic evolution. However, examination of the distribution of spliceosomal components indicates that not only was a spliceosome present in the eukaryotic ancestor but it also contained most of the key components found in today's eukaryotes. ... the last common ancestor of extant eukaryotes appears to show much of the molecular complexity seen today.” This suggests that the earliest eukaryotic organisms were highly complex and contained sophisticated genes and proteins, as the split gene theory predicts.

The Shapiro-Senapathy algorithm

Based on the split gene theory, Dr. Senapathy developed computational algorithms to detect the donor and acceptor splice sites, exons and a complete split gene in a genomic sequence. He developed the position weight matrix (PWM) method based on the frequency of the four bases at the consensus sequences of the donor and acceptor in different organisms to identify the splice sites in a given sequence. Furthermore, he formulated the first algorithm to find the exons based on the requirement of exons to contain a donor sequence (at the 5’ end) and an acceptor sequence (at the 3’ end), and an ORF in which the exon should occur, and another algorithm to find a complete split gene. These algorithms are collectively known as the Shapiro-Senapathy algorithm (S&S).[69][70]

This Shapiro-Senapathy algorithm aids in the identification of splicing mutations that cause numerous diseases and adverse drug reactions.[42][43] Using the S&S algorithm, scientists have identified mutations and genes that cause numerous cancers, inherited disorders, immune deficiency diseases and neurological disorders (see here for details).

The widespread use of this algorithm in biological research and clinical applications worldwide adds credence to the split gene theory, as this algorithm emanated from the split gene theory.  

It is increasingly used in clinical practice and research not only to find mutations in known disease-causing genes in patients, but also to discover novel genes that are causal of different diseases.

Furthermore, it is used in defining the cryptic splice sites and deducing the mechanisms by which mutations in them can affect normal splicing and lead to different diseases. It is also employed in addressing various questions in basic research in humans, animals and plants.

These contributions have impacted major questions in eukaryotic biology and their applications to human medicine. These applications may expand as the fields of clinical genomics and pharmacogenomics magnify their research with mega sequencing projects such as the All of Us project[44] that will sequence a million individuals, and with the sequencing of millions of patients in clinical practice and research in the future.

Bacterial genes could have originated from split genes?

Based on the split gene theory, only genes split into short exons and long introns, with a maximum exon length of ~600 bases, could have occurred in random DNA sequences. Genes with long uninterrupted coding sequences that are thousands of bases long and longer than 10,000 bases up to 90,000 bases that occur in many bacterial organisms[18] were practically impossible to have occurred. However, the bacterial genes could have originated from split genes by losing introns, which seems to be the only way to arrive at long coding sequences. It is also a better way than by increasing the lengths of ORFs from very short random ORFs to very long ORFs by specifically removing the stop codons by mutation.[1][2][3]

Gene size (bases)Number of genes
5,000 - 10,0003,029
10,000 - 15,000492
15,000 - 20,000131
20,000 - 25,00039
>25,00041
Extremely long coding sequences occur as very long ORFs in bacterial genes. Thousands of genes that are longer than 5,000 bases, coding for proteins that are longer than 2,000 amino acids, exist in many bacterial genomes. The longest genes are ~90,000 bases long coding for proteins ~30,000 amino acids long. Each of these genes occur in a single stretch of coding sequence (ORF) without any interrupting stop codons or intervening introns. Data taken from Think big – giant genes in bacteria.[18]

According to the split gene theory, this process of intron loss could have happened from prebiotic random DNA. These contiguously coding genes could be tightly organized in the bacterial genomes without any introns and be more streamlined. According to Dr. Senapathy, the nuclear boundary that was required for a cell containing split genes in its genome (see the section Origin of the eukaryotic cell nucleus, above) would not be required for a cell containing only contiguously coding genes. Thus, the bacterial cells did not develop a nucleus. Based on split gene theory, the eukaryotic genomes and bacterial genomes could have independently originated from the split genes in primordial random DNA sequences.

Comprehensive corroborating evidences for the split gene theory

If the split gene theory is correct, the structural features of split genes predicted from computer-simulated random sequences can be expected to occur in actual eukaryotic split genes. This is what we find in most known split genes in eukaryotes living today. The eukaryotic sequences exhibit a nearly perfect negative exponential distribution of ORFs lengths, with an upper limit of 600 bases (with rare exceptions).[1][2][19][3] Also, with rare exceptions, the exons of eukaryotic genes fall within this 600 bases upper maximum.

Moreover, if this theory is correct, exons should be delimited by stop codons, especially at the 3’ ends of exons (that is, the 5’ end of introns). Actually they are precisely delimited more strongly at the 3’ ends of exons and less strongly at the 5’ ends in most known genes, as predicted. [1][2][19][3] These stop codons are the most important functional parts of both splice junctions (the canonical bases GT:AG). The theory thus provides an explanation for the “conserved” splice junctions at the ends of exons and for the loss of these stop codons along with introns when they are spliced out. If this theory is correct, splice junctions should be randomly distributed in eukaryotic DNA sequences, and they are.[3][22][42][43] The splice junctions present in transfer RNA genes and ribosomal RNA genes, which do not code for proteins and wherein stop codons have no functional meaning, should not contain stop codons, and again, this is observed. The lariat signal, another sequence involved in the splicing process, also contains stop codons.[1][2][3][19][22][42][43] These findings show that the predictions of the split gene theory concerning the structure and function of the split genes in random DNA sequences are precisely corroborated by the structural and functional characteristics of split genes in modern eukaryotic organisms.

If the split genes originated from random primordial DNA sequences, as proposed in the split gene theory, there could be evidence that they were present in the earliest organisms. Actually, using comparative analysis of the modern genome data from several living organisms, scientists have found that the characteristics of split genes that are present in modern eukaryotes trace back to the earliest organisms that came on earth. These studies show that the earliest organisms could have contained the intron-rich split genes and complex proteins that occur in today’s living organisms.[45][46][47][48][49][50][51][52][53]

In addition, using another computational analytical method known as the “maximum likelihood analysis,” scientists have found that the earliest eukaryotic organisms must have contained the same genes from today’s living organisms with even a higher density of introns.[54] Furthermore, comparative genomics of many organisms including basal eukaryotes (considered to be primitive eukaryotic organisms such as Amoeboflagellata, Diplomonadida, and Parabasalia) have shown that intron-rich split genes accompanied by a fully formed spliceosome from today’s complex organisms were present in the earliest organisms, and that the earliest organisms were extremely complex with all of the eukaryotic cellular components.[55][56][57][58][59][60]

These findings are exactly as predicted by the split gene theory providing remarkable support. This theory is corroborated by the findings from comparative analysis of actual eukaryotic gene sequences with those of the computer generated random DNA sequences. Furthermore, comparative analysis of genome data from many organisms living today by several groups of scientists show that the earliest organisms that appeared on earth had intron-rich split genes, coding for complex proteins and cellular components, such as those found in the modern eukaryotic organisms. Thus, the split gene theory provides comprehensive solutions to the entire structural and functional features of the split gene architecture, with strong corroborating evidence.

1. ^10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 {{Cite journal|last=Senapathy|first=P.|date=April 1986|title=Origin of eukaryotic introns: a hypothesis, based on codon distribution statistics in genes, and its implications|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=83|issue=7|pages=2133–2137|issn=0027-8424|pmid=3457379|pmc=323245}}
2. ^10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 {{Cite journal|last=Senapathy|first=P.|date=February 1982|title=Possible evolution of splice-junction signals in eukaryotic genes from stop codons|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=85|issue=4|pages=1129–1133|issn=0027-8424|pmid=3422483|pmc=279719}}
3. ^10 11 12 13 14 15 16 {{Cite journal|last=Senapathy|first=P.|date=1995-06-02|title=Introns and the origin of protein-coding genes|journal=Science|volume=268|issue=5215|pages=1366–1367; author reply 1367–1369|issn=0036-8075|pmid=7761858|bibcode=1995Sci...268.1366S|doi=10.1126/science.7761858}}
4. ^{{Cite journal|last=Gillies|first=S. D.|last2=Morrison|first2=S. L.|last3=Oi|first3=V. T.|last4=Tonegawa|first4=S.|date=June 1983|title=A tissue-specific transcription enhancer element is located in the major intron of a rearranged immunoglobulin heavy chain gene|journal=Cell|volume=33|issue=3|pages=717–728|issn=0092-8674|pmid=6409417}}
5. ^{{Cite journal|last=Mercola|first=M.|last2=Wang|first2=X. F.|last3=Olsen|first3=J.|last4=Calame|first4=K.|date=1983-08-12|title=Transcriptional enhancer elements in the mouse immunoglobulin heavy chain locus|journal=Science|volume=221|issue=4611|pages=663–665|issn=0036-8075|pmid=6306772|bibcode=1983Sci...221..663M|doi=10.1126/science.6306772}}
6. ^{{Cite journal|last=Berk|first=A. J.|last2=Sharp|first2=P. A.|date=November 1977|title=Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids|journal=Cell|volume=12|issue=3|pages=721–732|issn=0092-8674|pmid=922889}}
7. ^{{Cite journal|last=Berget|first=S M|last2=Moore|first2=C|last3=Sharp|first3=P A|date=August 1977|title=Spliced segments at the 5' terminus of adenovirus 2 late mRNA.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=74|issue=8|pages=3171–3175|issn=0027-8424|pmid=269380|pmc=431482}}
8. ^{{Cite journal|last=Chow|first=L. T.|last2=Roberts|first2=J. M.|last3=Lewis|first3=J. B.|last4=Broker|first4=T. R.|date=August 1977|title=A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids|journal=Cell|volume=11|issue=4|pages=819–836|issn=0092-8674|pmid=890740}}
9. ^{{Cite web|url=https://www.genome.gov/25520306/online-education-kit-1977-introns-discovered/|title=Online Education Kit: 1977: Introns Discovered|website=National Human Genome Research Institute (NHGRI)|language=en-US|access-date=2019-01-01}}
10. ^{{Cite journal|last=Doolittle|first=W. Ford|date=13 April 1978|title=Genes in pieces: were they ever together?|url=https://www.nature.com/articles/272581a0|journal=Nature|language=en|volume=272|issue=5654|pages=581–582|doi=10.1038/272581a0|issn=1476-4687|via=|bibcode=1978Natur.272..581D}}
11. ^{{Cite journal|last=Darnell|first=J. E.|date=1978-12-22|title=Implications of RNA-RNA splicing in evolution of eukaryotic cells|journal=Science|volume=202|issue=4374|pages=1257–1260|issn=0036-8075|pmid=364651}}
12. ^{{Cite journal|last=Doolittle|first=W. F.|last2=Darnell|first2=J. E.|date=1986-03-01|title=Speculations on the early course of evolution|url=https://www.pnas.org/content/83/5/1271|journal=Proceedings of the National Academy of Sciences|language=en|volume=83|issue=5|pages=1271–1275|doi=10.1073/pnas.83.5.1271|issn=1091-6490|pmid=2419905|pmc=323057|bibcode=1986PNAS...83.1271D}}
13. ^{{Cite book|date=1985-01-01|title=Exons and the Evolution of Proteins|url=https://www.sciencedirect.com/science/article/abs/pii/S0074769608613741|journal=International Review of Cytology|language=en|volume=93|pages=149–185|doi=10.1016/S0074-7696(08)61374-1|issn=0074-7696|last1=Blake|first1=C.C.F.|isbn=9780123644930}}
14. ^{{Cite journal|last=Gilbert|first=Walter|date=February 1978|title=Why genes in pieces?|url=https://www.nature.com/articles/271501a0|journal=Nature|language=en|volume=271|issue=5645|pages=501|doi=10.1038/271501a0|pmid=622185|issn=1476-4687|via=|bibcode=1978Natur.271..501G}}
15. ^{{Cite journal|last=Tonegawa|first=S|last2=Maxam|first2=A M|last3=Tizard|first3=R|last4=Bernard|first4=O|last5=Gilbert|first5=W|date=March 1978|title=Sequence of a mouse germ-line gene for a variable region of an immunoglobulin light chain.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=75|issue=3|pages=1485–1489|issn=0027-8424|pmid=418414|pmc=411497|bibcode=1978PNAS...75.1485T|doi=10.1073/pnas.75.3.1485}}
16. ^{{Cite journal|last=Feng|first=D. F.|last2=Doolittle|first2=R. F.|date=1987-01-01|title=Reconstructing the Evolution of Vertebrate Blood Coagulation from a Consideration of the Amino Acid Sequences of Clotting Proteins|url=http://symposium.cshlp.org/content/52/869|journal=Cold Spring Harbor Symposia on Quantitative Biology|language=en|volume=52|pages=869–874|doi=10.1101/SQB.1987.052.01.095|issn=1943-4456|pmid=3483343}}
17. ^{{Cite journal|last=Gibbons|first=A.|date=1990-12-07|title=Calculating the original family--of exons|url=http://science.sciencemag.org/content/250/4986/1342|journal=Science|language=en|volume=250|issue=4986|pages=1342|doi=10.1126/science.1701567|issn=1095-9203|pmid=1701567|bibcode=1990Sci...250.1342G}}
18. ^{{Cite journal|last=Reva|first=Oleg|last2=Tümmler|first2=Burkhard|date=2008|title=Think big – giant genes in bacteria|journal=Environmental Microbiology|language=en|volume=10|issue=3|pages=768–777|doi=10.1111/j.1462-2920.2007.01500.x|pmid=18237309|issn=1462-2920|hdl=2263/9009}}
19. ^10 11 12 {{Cite journal|last=Regulapati|first=Rahul|last2=Singh|first2=Chandan Kumar|last3=Bhasi|first3=Ashwini|last4=Senapathy|first4=Periannan|date=2008-10-20|title=Origination of the Split Structure of Spliceosomal Genes from Random Genetic Sequences|journal=PLOS ONE|language=en|volume=3|issue=10|pages=e3456|doi=10.1371/journal.pone.0003456|issn=1932-6203|pmc=2565106|pmid=18941625|bibcode=2008PLoSO...3.3456R}}
20. ^{{Cite book|url=https://books.google.com/?id=oZjRIhZtINUC&pg=PA34#v=onepage&q&f=false|title=New Scientist|last=Information|first=Reed Business|date=1986-06-26|publisher=Reed Business Information|language=en}}
21. ^{{Cite book|url=https://books.google.com/?id=yGdGbzlA6AQC&pg=PA31#v=onepage&q&f=false|title=New Scientist|last=Information|first=Reed Business|date=1988-03-31|publisher=Reed Business Information|language=en}}
22. ^{{Cite journal|last=Senapathy|first=Periannan|last2=Harris|first2=Nomi L.|date=1990-05-25|title=Distribution and consenus of branch point signals in eukaryotic genes: a computerized statistical analysis|url=https://academic.oup.com/nar/article/18/10/3015/2388453|journal=Nucleic Acids Research|language=en|volume=18|issue=10|pages=3015–9|doi=10.1093/nar/18.10.3015|pmid=2349097|pmc=330832|issn=0305-1048}}
23. ^{{Cite journal|last=Maier|first=U.-G.|last2=Brown|first2=J.W.S.|last3=Toloczyki|first3=C.|last4=Feix|first4=G.|date=January 1987|title=Binding of a nuclear factor to a consensus sequence in the 5' flanking region of zein genes from maize|journal=The EMBO Journal|volume=6|issue=1|pages=17–22|issn=0261-4189|pmid=15981330|pmc=553350}}
24. ^{{Cite journal|last=Keller|first=E B|last2=Noon|first2=W A|date=1985-07-11|title=Intron splicing: a conserved internal signal in introns of Drosophila pre-mRNAs.|journal=Nucleic Acids Research|volume=13|issue=13|pages=4971–4981|issn=0305-1048|pmid=2410858|pmc=321838}}
25. ^{{Cite journal|last=BIRNSTIEL|first=M|last2=BUSSLINGER|first2=M|last3=STRUB|first3=K|date=June 1985|title=Transcription termination and 3′ processing: the end is in site!|journal=Cell|volume=41|issue=2|pages=349–359|doi=10.1016/s0092-8674(85)80007-6|issn=0092-8674}}
26. ^{{Cite journal|last=Consortium|first=International Human Genome Sequencing|date=February 2001|title=Initial sequencing and analysis of the human genome|url=https://www.nature.com/articles/35057062|journal=Nature|language=en|volume=409|issue=6822|pages=860–921|doi=10.1038/35057062|pmid=11237011|issn=1476-4687|via=|bibcode=2001Natur.409..860L}}
27. ^{{Cite journal|last=Zhu|first=Xiaohong|last2=Zandieh|first2=Ali|last3=Xia|first3=Ashley|last4=Wu|first4=Mitchell|last5=Wu|first5=David|last6=Wen|first6=Meiyuan|last7=Wang|first7=Mei|last8=Venter|first8=Eli|last9=Turner|first9=Russell|date=2001-02-16|title=The Sequence of the Human Genome|url=http://science.sciencemag.org/content/291/5507/1304|journal=Science|language=en|volume=291|issue=5507|pages=1304–1351|doi=10.1126/science.1058040|issn=1095-9203|pmid=11181995|bibcode=2001Sci...291.1304V}}
28. ^{{Cite journal|last=Kang|first=Byoung-Cheorl|last2=Nah|first2=Gyoungju|last3=Lee|first3=Heung-Ryul|last4=Han|first4=Koeun|last5=Purushotham|first5=Preethi M.|last6=Jo|first6=Jinkwan|date=2017|title=Development of a Genetic Map for Onion (Allium cepa L.) Using Reference-Free Genotyping-by-Sequencing and SNP Assays|journal=Frontiers in Plant Science|language=English|volume=8|pages=1606|doi=10.3389/fpls.2017.01606|issn=1664-462X|pmc=5604068|pmid=28959273}}
29. ^{{Cite journal|last=Smith|first=Jeramiah J.|last2=Voss|first2=S. Randal|last3=Tsonis|first3=Panagiotis A.|last4=Timoshevskaya|first4=Nataliya Y.|last5=Timoshevskiy|first5=Vladimir A.|last6=Keinath|first6=Melissa C.|date=2015-11-10|title=Initial characterization of the large genome of the salamander Ambystoma mexicanum using shotgun and laser capture chromosome sequencing|url=https://www.nature.com/articles/srep16413|journal=Scientific Reports|language=en|volume=5|pages=16413|doi=10.1038/srep16413|issn=2045-2322|pmc=4639759|pmid=26553646|bibcode=2015NatSR...516413K}}
30. ^{{Cite journal|last=Venter|first=J. C.|last2=Adams|first2=M. D.|last3=Myers|first3=E. W.|last4=Li|first4=P. W.|last5=Mural|first5=R. J.|last6=Sutton|first6=G. G.|last7=Smith|first7=H. O.|last8=Yandell|first8=M.|last9=Evans|first9=C. A.|date=2001-02-16|title=The sequence of the human genome|journal=Science|volume=291|issue=5507|pages=1304–1351|doi=10.1126/science.1058040|issn=0036-8075|pmid=11181995|bibcode=2001Sci...291.1304V}}
31. ^{{Cite journal|last=Lander|first=E. S.|last2=Linton|first2=L. M.|last3=Birren|first3=B.|last4=Nusbaum|first4=C.|last5=Zody|first5=M. C.|last6=Baldwin|first6=J.|last7=Devon|first7=K.|last8=Dewar|first8=K.|last9=Doyle|first9=M.|date=2001-02-15|title=Initial sequencing and analysis of the human genome|journal=Nature|volume=409|issue=6822|pages=860–921|doi=10.1038/35057062|issn=0028-0836|pmid=11237011|bibcode=2001Natur.409..860L}}
32. ^{{Cite journal|last=Consortium*|first=The C. elegans Sequencing|date=1998-12-11|title=Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology|url=http://science.sciencemag.org/content/282/5396/2012|journal=Science|language=en|volume=282|issue=5396|pages=2012–2018|doi=10.1126/science.282.5396.2012|issn=1095-9203|pmid=9851916}}
33. ^{{Cite journal|last=Arabidopsis Genome Initiative|date=2000-12-14|title=Analysis of the genome sequence of the flowering plant Arabidopsis thaliana|journal=Nature|volume=408|issue=6814|pages=796–815|doi=10.1038/35048692|issn=0028-0836|pmid=11130711|bibcode=2000Natur.408..796T}}
34. ^{{Cite journal|last=Bennetzen|first=Jeffrey L.|last2=Brown|first2=James K. M.|last3=Devos|first3=Katrien M.|date=2002-07-01|title=Genome Size Reduction through Illegitimate Recombination Counteracts Genome Expansion in Arabidopsis|url=http://genome.cshlp.org/content/12/7/1075|journal=Genome Research|language=en|volume=12|issue=7|pages=1075–1079|doi=10.1101/gr.132102|issn=1549-5469|pmid=12097344|pmc=186626}}
35. ^{{Cite journal|last=Kurland|first=C. G.|last2=Canbäck|first2=B.|last3=Berg|first3=O. G.|date=December 2007|title=The origins of modern proteomes|journal=Biochimie|volume=89|issue=12|pages=1454–1463|doi=10.1016/j.biochi.2007.09.004|issn=0300-9084|pmid=17949885}}
36. ^{{Cite journal|last=Caetano-Anollés|first=Gustavo|last2=Caetano-Anollés|first2=Derek|date=July 2003|title=An evolutionarily structured universe of protein architecture|journal=Genome Research|volume=13|issue=7|pages=1563–1571|doi=10.1101/gr.1161903|issn=1088-9051|pmid=12840035|pmc=403752}}
37. ^{{Cite journal|last=Glansdorff|first=Nicolas|last2=Xu|first2=Ying|last3=Labedan|first3=Bernard|date=2008-07-09|title=The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner|journal=Biology Direct|volume=3|pages=29|doi=10.1186/1745-6150-3-29|issn=1745-6150|pmc=2478661|pmid=18613974}}
38. ^{{Cite journal|last=Kurland|first=C. G.|last2=Collins|first2=L. J.|last3=Penny|first3=D.|date=2006-05-19|title=Genomics and the irreducible nature of eukaryote cells|journal=Science|volume=312|issue=5776|pages=1011–1014|doi=10.1126/science.1121674|issn=1095-9203|pmid=16709776|bibcode=2006Sci...312.1011K}}
39. ^{{Cite journal|last=Collins|first=Lesley|last2=Penny|first2=David|date=April 2005|title=Complex spliceosomal organization ancestral to extant eukaryotes|journal=Molecular Biology and Evolution|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|issn=0737-4038|pmid=15659557}}
40. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley J.|last3=Daly|first3=Toni K.|last4=Cox|first4=Simon J.|date=December 2014|title=The relative ages of eukaryotes and akaryotes|journal=Journal of Molecular Evolution|volume=79|issue=5–6|pages=228–239|doi=10.1007/s00239-014-9643-y|issn=1432-1432|pmid=25179144|bibcode=2014JMolE..79..228P}}
41. ^{{Cite journal|last=Fuerst|first=John A.|last2=Sagulenko|first2=Evgeny|date=2012-05-04|title=Keys to Eukaryality: Planctomycetes and Ancestral Evolution of Cellular Complexity|journal=Frontiers in Microbiology|volume=3|pages=167|doi=10.3389/fmicb.2012.00167|issn=1664-302X|pmc=3343278|pmid=22586422}}
42. ^{{Cite journal|last=Shapiro|first=M. B.|last2=Senapathy|first2=P.|date=1987-09-11|title=RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression|journal=Nucleic Acids Research|volume=15|issue=17|pages=7155–7174|issn=0305-1048|pmid=3658675|pmc=306199}}
43. ^{{Cite journal|last=Senapathy|first=P.|last2=Shapiro|first2=M. B.|last3=Harris|first3=N. L.|date=1990|title=Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project|journal=Methods in Enzymology|volume=183|pages=252–278|issn=0076-6879|pmid=2314278}}
44. ^{{Cite web|url=https://allofus.nih.gov/|title=National Institutes of Health (NIH) — All of Us|website=allofus.nih.gov|access-date=2019-01-02}}
45. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley|date=2005-04-01|title=Complex Spliceosomal Organization Ancestral to Extant Eukaryotes|url=https://academic.oup.com/mbe/article/22/4/1053/1083329|journal=Molecular Biology and Evolution|language=en|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|pmid=15659557|issn=0737-4038}}
46. ^{{Cite journal|last=Caetano-Anollés|first=Derek|last2=Caetano-Anollés|first2=Gustavo|date=2003-07-01|title=An Evolutionarily Structured Universe of Protein Architecture|url=http://genome.cshlp.org/content/13/7/1563|journal=Genome Research|language=en|volume=13|issue=7|pages=1563–1571|doi=10.1101/gr.1161903|issn=1549-5469|pmc=403752|pmid=12840035}}
47. ^{{Cite journal|last=Glansdorff|first=Nicolas|last2=Xu|first2=Ying|last3=Labedan|first3=Bernard|date=2008-07-09|title=The Last Universal Common Ancestor: emergence, constitution and genetic legacy of an elusive forerunner|journal=Biology Direct|volume=3|issue=1|pages=29|doi=10.1186/1745-6150-3-29|issn=1745-6150|pmc=2478661|pmid=18613974}}
48. ^{{Cite journal|date=2007-12-01|title=The origins of modern proteomes|url=https://www.sciencedirect.com/science/article/pii/S0300908407002465|journal=Biochimie|language=en|volume=89|issue=12|pages=1454–1463|doi=10.1016/j.biochi.2007.09.004|pmid=17949885|issn=0300-9084|last1=Kurland|first1=C.G.|last2=Canbäck|first2=B.|last3=Berg|first3=O.G.}}
49. ^{{Cite journal|last=Penny|first=D.|last2=Collins|first2=L. J.|last3=Kurland|first3=C. G.|date=2006-05-19|title=Genomics and the Irreducible Nature of Eukaryote Cells|url=http://science.sciencemag.org/content/312/5776/1011|journal=Science|language=en|volume=312|issue=5776|pages=1011–1014|doi=10.1126/science.1121674|issn=1095-9203|pmid=16709776|bibcode=2006Sci...312.1011K}}
50. ^{{Cite journal|last=Poole|first=A. M.|last2=Jeffares|first2=D. C.|last3=Penny|first3=D.|date=January 1998|title=The path from the RNA world|journal=Journal of Molecular Evolution|volume=46|issue=1|pages=1–17|issn=0022-2844|pmid=9419221}}
51. ^{{Cite journal|last=Forterre|first=Patrick|last2=Philippe|first2=Hervé|date=1999|title=Where is the root of the universal tree of life?|url=https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291521-1878%28199910%2921%3A10%3C871%3A%3AAID-BIES10%3E3.0.CO%3B2-Q|journal=BioEssays|language=en|volume=21|issue=10|pages=871–879|doi=10.1002/(SICI)1521-1878(199910)21:103.0.CO;2-Q|issn=1521-1878|doi-broken-date=2019-01-10}}
52. ^{{Cite journal|last=Cox|first=Simon J.|last2=Daly|first2=Toni K.|last3=Collins|first3=Lesley J.|last4=Penny|first4=David|date=2014-12-01|title=The Relative Ages of Eukaryotes and Akaryotes|journal=Journal of Molecular Evolution|language=en|volume=79|issue=5–6|pages=228–239|doi=10.1007/s00239-014-9643-y|pmid=25179144|issn=1432-1432|bibcode=2014JMolE..79..228P}}
53. ^{{Cite journal|last=Sagulenko|first=Evgeny|last2=Fuerst|first2=John Arlington|date=2012|title=Keys to eukaryality: planctomycetes and ancestral evolution of cellular complexity|journal=Frontiers in Microbiology|language=English|volume=3|doi=10.3389/fmicb.2012.00167|issn=1664-302X|pmc=3343278|pmid=22586422}}
54. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott W.|date=2005-02-08|title=Complex early genes|url=https://www.pnas.org/content/102/6/1986|journal=Proceedings of the National Academy of Sciences|language=en|volume=102|issue=6|pages=1986–1991|doi=10.1073/pnas.0408355101|issn=1091-6490|pmc=548548|pmid=15687506|bibcode=2005PNAS..102.1986R}}
55. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott William|date=March 2006|title=The evolution of spliceosomal introns: patterns, puzzles and progress|url=https://www.nature.com/articles/nrg1807|journal=Nature Reviews Genetics|language=en|volume=7|issue=3|pages=211–221|doi=10.1038/nrg1807|pmid=16485020|issn=1471-0064|via=}}
56. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley|date=2005-04-01|title=Complex Spliceosomal Organization Ancestral to Extant Eukaryotes|url=https://academic.oup.com/mbe/article/22/4/1053/1083329|journal=Molecular Biology and Evolution|language=en|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|pmid=15659557|issn=0737-4038}}
57. ^{{Cite journal|last=Rogozin|first=Igor B.|last2=Sverdlov|first2=Alexander V.|last3=Babenko|first3=Vladimir N.|last4=Koonin|first4=Eugene V.|date=June 2005|title=Analysis of evolution of exon-intron structure of eukaryotic genes|journal=Briefings in Bioinformatics|volume=6|issue=2|pages=118–134|issn=1467-5463|pmid=15975222}}
58. ^{{Cite journal|last=Sullivan|first=James C.|last2=Reitzel|first2=Adam M.|last3=Finnerty|first3=John R.|date=2006|title=A high percentage of introns in human genes were present early in animal evolution: evidence from the basal metazoan Nematostella vectensis|journal=Genome Informatics. International Conference on Genome Informatics|volume=17|issue=1|pages=219–229|issn=0919-9454|pmid=17503371}}
59. ^{{Cite journal|last=Koonin|first=Eugene V.|last2=Rogozin|first2=Igor B.|last3=Csuros|first3=Miklos|date=2011-09-15|title=A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes|journal=PLOS Computational Biology|language=en|volume=7|issue=9|pages=e1002150|doi=10.1371/journal.pcbi.1002150|issn=1553-7358|pmc=3174169|pmid=21935348|bibcode=2011PLSCB...7E2150C}}
60. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott W.|date=2005-02-08|title=Complex early genes|url=https://www.pnas.org/content/102/6/1986|journal=Proceedings of the National Academy of Sciences|language=en|volume=102|issue=6|pages=1986–1991|doi=10.1073/pnas.0408355101|issn=1091-6490|pmc=548548|pmid=15687506|bibcode=2005PNAS..102.1986R}}
61. ^{{Cite journal|last=Senapathy|first=P.|date=April 1986|title=Origin of eukaryotic introns: a hypothesis, based on codon distribution statistics in genes, and its implications|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=83|issue=7|pages=2133–2137|issn=0027-8424|pmid=3457379|pmc=323245}}
62. ^{{Cite journal|last=Senapathy|first=P.|date=February 1982|title=Possible evolution of splice-junction signals in eukaryotic genes from stop codons|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=85|issue=4|pages=1129–1133|issn=0027-8424|pmid=3422483|pmc=279719}}
63. ^{{Cite journal|last=Senapathy|first=P.|date=1995-06-02|title=Introns and the origin of protein-coding genes|journal=Science|volume=268|issue=5215|pages=1366–1367; author reply 1367–1369|issn=0036-8075|pmid=7761858|bibcode=1995Sci...268.1366S|doi=10.1126/science.7761858}}
64. ^{{Cite journal|last=Gillies|first=S. D.|last2=Morrison|first2=S. L.|last3=Oi|first3=V. T.|last4=Tonegawa|first4=S.|date=June 1983|title=A tissue-specific transcription enhancer element is located in the major intron of a rearranged immunoglobulin heavy chain gene|journal=Cell|volume=33|issue=3|pages=717–728|issn=0092-8674|pmid=6409417}}
65. ^{{Cite journal|last=Mercola|first=M.|last2=Wang|first2=X. F.|last3=Olsen|first3=J.|last4=Calame|first4=K.|date=1983-08-12|title=Transcriptional enhancer elements in the mouse immunoglobulin heavy chain locus|journal=Science|volume=221|issue=4611|pages=663–665|issn=0036-8075|pmid=6306772|bibcode=1983Sci...221..663M|doi=10.1126/science.6306772}}
66. ^{{Cite journal|last=Berk|first=A. J.|last2=Sharp|first2=P. A.|date=November 1977|title=Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids|journal=Cell|volume=12|issue=3|pages=721–732|issn=0092-8674|pmid=922889}}
67. ^{{Cite journal|last=Berget|first=S M|last2=Moore|first2=C|last3=Sharp|first3=P A|date=August 1977|title=Spliced segments at the 5' terminus of adenovirus 2 late mRNA.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=74|issue=8|pages=3171–3175|issn=0027-8424|pmid=269380|pmc=431482}}
68. ^{{Cite journal|last=Chow|first=L. T.|last2=Roberts|first2=J. M.|last3=Lewis|first3=J. B.|last4=Broker|first4=T. R.|date=August 1977|title=A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids|journal=Cell|volume=11|issue=4|pages=819–836|issn=0092-8674|pmid=890740}}
69. ^{{Cite web|url=https://www.genome.gov/25520306/online-education-kit-1977-introns-discovered/|title=Online Education Kit: 1977: Introns Discovered|website=National Human Genome Research Institute (NHGRI)|language=en-US|access-date=2019-01-01}}
70. ^{{Cite journal|last=Doolittle|first=W. Ford|date=13 April 1978|title=Genes in pieces: were they ever together?|url=https://www.nature.com/articles/272581a0|journal=Nature|language=en|volume=272|issue=5654|pages=581–582|doi=10.1038/272581a0|issn=1476-4687|via=|bibcode=1978Natur.272..581D}}
71. ^{{Cite journal|last=Darnell|first=J. E.|date=1978-12-22|title=Implications of RNA-RNA splicing in evolution of eukaryotic cells|journal=Science|volume=202|issue=4374|pages=1257–1260|issn=0036-8075|pmid=364651}}
72. ^{{Cite journal|last=Doolittle|first=W. F.|last2=Darnell|first2=J. E.|date=1986-03-01|title=Speculations on the early course of evolution|url=https://www.pnas.org/content/83/5/1271|journal=Proceedings of the National Academy of Sciences|language=en|volume=83|issue=5|pages=1271–1275|doi=10.1073/pnas.83.5.1271|issn=1091-6490|pmid=2419905|pmc=323057|bibcode=1986PNAS...83.1271D}}
73. ^{{Cite book|date=1985-01-01|title=Exons and the Evolution of Proteins|url=https://www.sciencedirect.com/science/article/abs/pii/S0074769608613741|journal=International Review of Cytology|language=en|volume=93|pages=149–185|doi=10.1016/S0074-7696(08)61374-1|issn=0074-7696|last1=Blake|first1=C.C.F.|isbn=9780123644930}}
74. ^{{Cite journal|last=Gilbert|first=Walter|date=February 1978|title=Why genes in pieces?|url=https://www.nature.com/articles/271501a0|journal=Nature|language=en|volume=271|issue=5645|pages=501|doi=10.1038/271501a0|pmid=622185|issn=1476-4687|via=|bibcode=1978Natur.271..501G}}
75. ^{{Cite journal|last=Tonegawa|first=S|last2=Maxam|first2=A M|last3=Tizard|first3=R|last4=Bernard|first4=O|last5=Gilbert|first5=W|date=March 1978|title=Sequence of a mouse germ-line gene for a variable region of an immunoglobulin light chain.|journal=Proceedings of the National Academy of Sciences of the United States of America|volume=75|issue=3|pages=1485–1489|issn=0027-8424|pmid=418414|pmc=411497|bibcode=1978PNAS...75.1485T|doi=10.1073/pnas.75.3.1485}}
76. ^{{Cite journal|last=Feng|first=D. F.|last2=Doolittle|first2=R. F.|date=1987-01-01|title=Reconstructing the Evolution of Vertebrate Blood Coagulation from a Consideration of the Amino Acid Sequences of Clotting Proteins|url=http://symposium.cshlp.org/content/52/869|journal=Cold Spring Harbor Symposia on Quantitative Biology|language=en|volume=52|pages=869–874|doi=10.1101/SQB.1987.052.01.095|issn=1943-4456|pmid=3483343}}
77. ^{{Cite journal|last=Gibbons|first=A.|date=1990-12-07|title=Calculating the original family--of exons|url=http://science.sciencemag.org/content/250/4986/1342|journal=Science|language=en|volume=250|issue=4986|pages=1342|doi=10.1126/science.1701567|issn=1095-9203|pmid=1701567|bibcode=1990Sci...250.1342G}}
78. ^{{Cite journal|last=Reva|first=Oleg|last2=Tümmler|first2=Burkhard|date=2008|title=Think big – giant genes in bacteria|journal=Environmental Microbiology|language=en|volume=10|issue=3|pages=768–777|doi=10.1111/j.1462-2920.2007.01500.x|pmid=18237309|issn=1462-2920|hdl=2263/9009}}
79. ^{{Cite journal|last=Regulapati|first=Rahul|last2=Singh|first2=Chandan Kumar|last3=Bhasi|first3=Ashwini|last4=Senapathy|first4=Periannan|date=2008-10-20|title=Origination of the Split Structure of Spliceosomal Genes from Random Genetic Sequences|journal=PLOS ONE|language=en|volume=3|issue=10|pages=e3456|doi=10.1371/journal.pone.0003456|issn=1932-6203|pmc=2565106|pmid=18941625|bibcode=2008PLoSO...3.3456R}}
80. ^{{Cite book|url=https://books.google.com/?id=oZjRIhZtINUC&pg=PA34#v=onepage&q&f=false|title=New Scientist|last=Information|first=Reed Business|date=1986-06-26|publisher=Reed Business Information|language=en}}
81. ^{{Cite book|url=https://books.google.com/?id=yGdGbzlA6AQC&pg=PA31#v=onepage&q&f=false|title=New Scientist|last=Information|first=Reed Business|date=1988-03-31|publisher=Reed Business Information|language=en}}
82. ^{{Cite journal|last=Senapathy|first=Periannan|last2=Harris|first2=Nomi L.|date=1990-05-25|title=Distribution and consenus of branch point signals in eukaryotic genes: a computerized statistical analysis|url=https://academic.oup.com/nar/article/18/10/3015/2388453|journal=Nucleic Acids Research|language=en|volume=18|issue=10|pages=3015–9|doi=10.1093/nar/18.10.3015|pmid=2349097|pmc=330832|issn=0305-1048}}
83. ^{{Cite journal|last=Maier|first=U.-G.|last2=Brown|first2=J.W.S.|last3=Toloczyki|first3=C.|last4=Feix|first4=G.|date=January 1987|title=Binding of a nuclear factor to a consensus sequence in the 5' flanking region of zein genes from maize|journal=The EMBO Journal|volume=6|issue=1|pages=17–22|issn=0261-4189|pmid=15981330|pmc=553350}}
84. ^{{Cite journal|last=Keller|first=E B|last2=Noon|first2=W A|date=1985-07-11|title=Intron splicing: a conserved internal signal in introns of Drosophila pre-mRNAs.|journal=Nucleic Acids Research|volume=13|issue=13|pages=4971–4981|issn=0305-1048|pmid=2410858|pmc=321838}}
85. ^{{Cite journal|last=BIRNSTIEL|first=M|last2=BUSSLINGER|first2=M|last3=STRUB|first3=K|date=June 1985|title=Transcription termination and 3′ processing: the end is in site!|journal=Cell|volume=41|issue=2|pages=349–359|doi=10.1016/s0092-8674(85)80007-6|issn=0092-8674}}
86. ^{{Cite journal|last=Consortium|first=International Human Genome Sequencing|date=February 2001|title=Initial sequencing and analysis of the human genome|url=https://www.nature.com/articles/35057062|journal=Nature|language=en|volume=409|issue=6822|pages=860–921|doi=10.1038/35057062|pmid=11237011|issn=1476-4687|via=|bibcode=2001Natur.409..860L}}
87. ^{{Cite journal|last=Zhu|first=Xiaohong|last2=Zandieh|first2=Ali|last3=Xia|first3=Ashley|last4=Wu|first4=Mitchell|last5=Wu|first5=David|last6=Wen|first6=Meiyuan|last7=Wang|first7=Mei|last8=Venter|first8=Eli|last9=Turner|first9=Russell|date=2001-02-16|title=The Sequence of the Human Genome|url=http://science.sciencemag.org/content/291/5507/1304|journal=Science|language=en|volume=291|issue=5507|pages=1304–1351|doi=10.1126/science.1058040|issn=1095-9203|pmid=11181995|bibcode=2001Sci...291.1304V}}
88. ^{{Cite journal|last=Kang|first=Byoung-Cheorl|last2=Nah|first2=Gyoungju|last3=Lee|first3=Heung-Ryul|last4=Han|first4=Koeun|last5=Purushotham|first5=Preethi M.|last6=Jo|first6=Jinkwan|date=2017|title=Development of a Genetic Map for Onion (Allium cepa L.) Using Reference-Free Genotyping-by-Sequencing and SNP Assays|journal=Frontiers in Plant Science|language=English|volume=8|pages=1606|doi=10.3389/fpls.2017.01606|issn=1664-462X|pmc=5604068|pmid=28959273}}
89. ^{{Cite journal|last=Smith|first=Jeramiah J.|last2=Voss|first2=S. Randal|last3=Tsonis|first3=Panagiotis A.|last4=Timoshevskaya|first4=Nataliya Y.|last5=Timoshevskiy|first5=Vladimir A.|last6=Keinath|first6=Melissa C.|date=2015-11-10|title=Initial characterization of the large genome of the salamander Ambystoma mexicanum using shotgun and laser capture chromosome sequencing|url=https://www.nature.com/articles/srep16413|journal=Scientific Reports|language=en|volume=5|pages=16413|doi=10.1038/srep16413|issn=2045-2322|pmc=4639759|pmid=26553646|bibcode=2015NatSR...516413K}}
90. ^{{Cite journal|last=Venter|first=J. C.|last2=Adams|first2=M. D.|last3=Myers|first3=E. W.|last4=Li|first4=P. W.|last5=Mural|first5=R. J.|last6=Sutton|first6=G. G.|last7=Smith|first7=H. O.|last8=Yandell|first8=M.|last9=Evans|first9=C. A.|date=2001-02-16|title=The sequence of the human genome|journal=Science|volume=291|issue=5507|pages=1304–1351|doi=10.1126/science.1058040|issn=0036-8075|pmid=11181995|bibcode=2001Sci...291.1304V}}
91. ^{{Cite journal|last=Lander|first=E. S.|last2=Linton|first2=L. M.|last3=Birren|first3=B.|last4=Nusbaum|first4=C.|last5=Zody|first5=M. C.|last6=Baldwin|first6=J.|last7=Devon|first7=K.|last8=Dewar|first8=K.|last9=Doyle|first9=M.|date=2001-02-15|title=Initial sequencing and analysis of the human genome|journal=Nature|volume=409|issue=6822|pages=860–921|doi=10.1038/35057062|issn=0028-0836|pmid=11237011|bibcode=2001Natur.409..860L}}
92. ^{{Cite journal|last=Consortium*|first=The C. elegans Sequencing|date=1998-12-11|title=Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology|url=http://science.sciencemag.org/content/282/5396/2012|journal=Science|language=en|volume=282|issue=5396|pages=2012–2018|doi=10.1126/science.282.5396.2012|issn=1095-9203|pmid=9851916}}
93. ^{{Cite journal|last=Arabidopsis Genome Initiative|date=2000-12-14|title=Analysis of the genome sequence of the flowering plant Arabidopsis thaliana|journal=Nature|volume=408|issue=6814|pages=796–815|doi=10.1038/35048692|issn=0028-0836|pmid=11130711|bibcode=2000Natur.408..796T}}
94. ^{{Cite journal|last=Bennetzen|first=Jeffrey L.|last2=Brown|first2=James K. M.|last3=Devos|first3=Katrien M.|date=2002-07-01|title=Genome Size Reduction through Illegitimate Recombination Counteracts Genome Expansion in Arabidopsis|url=http://genome.cshlp.org/content/12/7/1075|journal=Genome Research|language=en|volume=12|issue=7|pages=1075–1079|doi=10.1101/gr.132102|issn=1549-5469|pmid=12097344|pmc=186626}}
95. ^{{Cite journal|last=Kurland|first=C. G.|last2=Canbäck|first2=B.|last3=Berg|first3=O. G.|date=December 2007|title=The origins of modern proteomes|journal=Biochimie|volume=89|issue=12|pages=1454–1463|doi=10.1016/j.biochi.2007.09.004|issn=0300-9084|pmid=17949885}}
96. ^{{Cite journal|last=Caetano-Anollés|first=Gustavo|last2=Caetano-Anollés|first2=Derek|date=July 2003|title=An evolutionarily structured universe of protein architecture|journal=Genome Research|volume=13|issue=7|pages=1563–1571|doi=10.1101/gr.1161903|issn=1088-9051|pmid=12840035|pmc=403752}}
97. ^{{Cite journal|last=Glansdorff|first=Nicolas|last2=Xu|first2=Ying|last3=Labedan|first3=Bernard|date=2008-07-09|title=The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner|journal=Biology Direct|volume=3|pages=29|doi=10.1186/1745-6150-3-29|issn=1745-6150|pmc=2478661|pmid=18613974}}
98. ^{{Cite journal|last=Kurland|first=C. G.|last2=Collins|first2=L. J.|last3=Penny|first3=D.|date=2006-05-19|title=Genomics and the irreducible nature of eukaryote cells|journal=Science|volume=312|issue=5776|pages=1011–1014|doi=10.1126/science.1121674|issn=1095-9203|pmid=16709776|bibcode=2006Sci...312.1011K}}
99. ^{{Cite journal|last=Collins|first=Lesley|last2=Penny|first2=David|date=April 2005|title=Complex spliceosomal organization ancestral to extant eukaryotes|journal=Molecular Biology and Evolution|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|issn=0737-4038|pmid=15659557}}
100. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley J.|last3=Daly|first3=Toni K.|last4=Cox|first4=Simon J.|date=December 2014|title=The relative ages of eukaryotes and akaryotes|journal=Journal of Molecular Evolution|volume=79|issue=5–6|pages=228–239|doi=10.1007/s00239-014-9643-y|issn=1432-1432|pmid=25179144|bibcode=2014JMolE..79..228P}}
101. ^{{Cite journal|last=Fuerst|first=John A.|last2=Sagulenko|first2=Evgeny|date=2012-05-04|title=Keys to Eukaryality: Planctomycetes and Ancestral Evolution of Cellular Complexity|journal=Frontiers in Microbiology|volume=3|pages=167|doi=10.3389/fmicb.2012.00167|issn=1664-302X|pmc=3343278|pmid=22586422}}
102. ^{{Cite journal|last=Shapiro|first=M. B.|last2=Senapathy|first2=P.|date=1987-09-11|title=RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression|journal=Nucleic Acids Research|volume=15|issue=17|pages=7155–7174|issn=0305-1048|pmid=3658675|pmc=306199}}
103. ^{{Cite journal|last=Senapathy|first=P.|last2=Shapiro|first2=M. B.|last3=Harris|first3=N. L.|date=1990|title=Splice junctions, branch point sites, and exons: sequence statistics, identification, and applications to genome project|journal=Methods in Enzymology|volume=183|pages=252–278|issn=0076-6879|pmid=2314278}}
104. ^{{Cite web|url=https://allofus.nih.gov/|title=National Institutes of Health (NIH) — All of Us|website=allofus.nih.gov|access-date=2019-01-02}}
105. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley|date=2005-04-01|title=Complex Spliceosomal Organization Ancestral to Extant Eukaryotes|url=https://academic.oup.com/mbe/article/22/4/1053/1083329|journal=Molecular Biology and Evolution|language=en|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|pmid=15659557|issn=0737-4038}}
106. ^{{Cite journal|last=Caetano-Anollés|first=Derek|last2=Caetano-Anollés|first2=Gustavo|date=2003-07-01|title=An Evolutionarily Structured Universe of Protein Architecture|url=http://genome.cshlp.org/content/13/7/1563|journal=Genome Research|language=en|volume=13|issue=7|pages=1563–1571|doi=10.1101/gr.1161903|issn=1549-5469|pmc=403752|pmid=12840035}}
107. ^{{Cite journal|last=Glansdorff|first=Nicolas|last2=Xu|first2=Ying|last3=Labedan|first3=Bernard|date=2008-07-09|title=The Last Universal Common Ancestor: emergence, constitution and genetic legacy of an elusive forerunner|journal=Biology Direct|volume=3|issue=1|pages=29|doi=10.1186/1745-6150-3-29|issn=1745-6150|pmc=2478661|pmid=18613974}}
108. ^{{Cite journal|date=2007-12-01|title=The origins of modern proteomes|url=https://www.sciencedirect.com/science/article/pii/S0300908407002465|journal=Biochimie|language=en|volume=89|issue=12|pages=1454–1463|doi=10.1016/j.biochi.2007.09.004|pmid=17949885|issn=0300-9084|last1=Kurland|first1=C.G.|last2=Canbäck|first2=B.|last3=Berg|first3=O.G.}}
109. ^{{Cite journal|last=Penny|first=D.|last2=Collins|first2=L. J.|last3=Kurland|first3=C. G.|date=2006-05-19|title=Genomics and the Irreducible Nature of Eukaryote Cells|url=http://science.sciencemag.org/content/312/5776/1011|journal=Science|language=en|volume=312|issue=5776|pages=1011–1014|doi=10.1126/science.1121674|issn=1095-9203|pmid=16709776|bibcode=2006Sci...312.1011K}}
110. ^{{Cite journal|last=Poole|first=A. M.|last2=Jeffares|first2=D. C.|last3=Penny|first3=D.|date=January 1998|title=The path from the RNA world|journal=Journal of Molecular Evolution|volume=46|issue=1|pages=1–17|issn=0022-2844|pmid=9419221}}
111. ^{{Cite journal|last=Forterre|first=Patrick|last2=Philippe|first2=Hervé|date=1999|title=Where is the root of the universal tree of life?|url=https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291521-1878%28199910%2921%3A10%3C871%3A%3AAID-BIES10%3E3.0.CO%3B2-Q|journal=BioEssays|language=en|volume=21|issue=10|pages=871–879|doi=10.1002/(SICI)1521-1878(199910)21:103.0.CO;2-Q|issn=1521-1878|doi-broken-date=2019-01-10}}
112. ^{{Cite journal|last=Cox|first=Simon J.|last2=Daly|first2=Toni K.|last3=Collins|first3=Lesley J.|last4=Penny|first4=David|date=2014-12-01|title=The Relative Ages of Eukaryotes and Akaryotes|journal=Journal of Molecular Evolution|language=en|volume=79|issue=5–6|pages=228–239|doi=10.1007/s00239-014-9643-y|pmid=25179144|issn=1432-1432|bibcode=2014JMolE..79..228P}}
113. ^{{Cite journal|last=Sagulenko|first=Evgeny|last2=Fuerst|first2=John Arlington|date=2012|title=Keys to eukaryality: planctomycetes and ancestral evolution of cellular complexity|journal=Frontiers in Microbiology|language=English|volume=3|doi=10.3389/fmicb.2012.00167|issn=1664-302X|pmc=3343278|pmid=22586422}}
114. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott W.|date=2005-02-08|title=Complex early genes|url=https://www.pnas.org/content/102/6/1986|journal=Proceedings of the National Academy of Sciences|language=en|volume=102|issue=6|pages=1986–1991|doi=10.1073/pnas.0408355101|issn=1091-6490|pmc=548548|pmid=15687506|bibcode=2005PNAS..102.1986R}}
115. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott William|date=March 2006|title=The evolution of spliceosomal introns: patterns, puzzles and progress|url=https://www.nature.com/articles/nrg1807|journal=Nature Reviews Genetics|language=en|volume=7|issue=3|pages=211–221|doi=10.1038/nrg1807|pmid=16485020|issn=1471-0064|via=}}
116. ^{{Cite journal|last=Penny|first=David|last2=Collins|first2=Lesley|date=2005-04-01|title=Complex Spliceosomal Organization Ancestral to Extant Eukaryotes|url=https://academic.oup.com/mbe/article/22/4/1053/1083329|journal=Molecular Biology and Evolution|language=en|volume=22|issue=4|pages=1053–1066|doi=10.1093/molbev/msi091|pmid=15659557|issn=0737-4038}}
117. ^{{Cite journal|last=Rogozin|first=Igor B.|last2=Sverdlov|first2=Alexander V.|last3=Babenko|first3=Vladimir N.|last4=Koonin|first4=Eugene V.|date=June 2005|title=Analysis of evolution of exon-intron structure of eukaryotic genes|journal=Briefings in Bioinformatics|volume=6|issue=2|pages=118–134|issn=1467-5463|pmid=15975222}}
118. ^{{Cite journal|last=Sullivan|first=James C.|last2=Reitzel|first2=Adam M.|last3=Finnerty|first3=John R.|date=2006|title=A high percentage of introns in human genes were present early in animal evolution: evidence from the basal metazoan Nematostella vectensis|journal=Genome Informatics. International Conference on Genome Informatics|volume=17|issue=1|pages=219–229|issn=0919-9454|pmid=17503371}}
119. ^{{Cite journal|last=Koonin|first=Eugene V.|last2=Rogozin|first2=Igor B.|last3=Csuros|first3=Miklos|date=2011-09-15|title=A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes|journal=PLOS Computational Biology|language=en|volume=7|issue=9|pages=e1002150|doi=10.1371/journal.pcbi.1002150|issn=1553-7358|pmc=3174169|pmid=21935348|bibcode=2011PLSCB...7E2150C}}
120. ^{{Cite journal|last=Gilbert|first=Walter|last2=Roy|first2=Scott W.|date=2005-02-08|title=Complex early genes|url=https://www.pnas.org/content/102/6/1986|journal=Proceedings of the National Academy of Sciences|language=en|volume=102|issue=6|pages=1986–1991|doi=10.1073/pnas.0408355101|issn=1091-6490|pmc=548548|pmid=15687506|bibcode=2005PNAS..102.1986R}}

The coding sequences of eukaryotic genes are split into short coding sequence segments (exons) and long non-coding sequences (introns) that intervene the exons. As the split gene structure is central to eukaryotic biology, the question of why, how and when introns came into the eukaryotic genes, what intron sequences are, and why eukaryotic genes are split are extremely important.

Dr. Periannan Senapathy proposed the “split gene” theory to explain the origin of introns.[61][62][63] This theory provides comprehensive and tenable solutions to the key questions concerning the split genes, including the exons, introns, splice junctions, branch points and the entire split gene architecture, based on the origin of split genes from random genetic sequences. It also provides possible solutions to the origin of the spliceosomal machinery, the nuclear boundary and the eukaryotic cell.

Background

Genes of all organisms, except bacteria, consist of short protein-coding regions (exons) interrupted by long sequences that intervene the coding sequences (introns). [1][2] When a gene is expressed, its DNA sequence is copied into a “primary RNA” sequence by the enzyme RNA polymerase. Then the “spliceosome” machinery physically removes the introns from the RNA copy of the gene by the process of splicing, leaving only a contiguously connected series of exons, which becomes the “messenger” RNA (mRNA). This mRNA is now “read” by another cellular machinery, called the “ribosome,” to produce the encoded protein. Thus, although introns are not physically removed from a gene, a gene’s sequence is read as if introns never existed.

The exons are usually very short, with an approx. average length of about 120 bases (e.g. in human genes). The length of introns varies widely between 10 bases to 500,000 bases in a genome (for example, the human genome), but the length of exons has an upper limit of about 600 bases in most of the eukaryotic genes. Because exons code for protein sequences, they are very important for the cell, yet constitute only ~2% of the genes’ sequences. Introns, in contrast, constitute 98% of the genes’ sequences but seem to have little crucial functions in genes, except for functions such as containing enhancer sequences and developmental regulators in rare instances.[64][65]

Until Dr. Philip Sharp [66][67] from the MIT and Dr. Richard Roberts [68] then at the Cold Spring Harbor Laboratories (currently at the New England Biolabs) discovered introns[69] within eukaryotic genes in 1977, it was believed that the coding sequence of all genes was always in one single stretch, bounded by a single long Open Reading Frame (ORF). The discovery of introns was a profound surprise to scientists, which instantly brought up the questions of how, why and when the introns came into the eukaryotic genes.

It soon became apparent that a typical eukaryotic gene was interrupted at many locations by introns, dividing the coding sequence into many short exons. Also surprising was that the introns were very long, even as long as hundreds of thousands of bases (see table below). These findings also prompted the questions of why many introns occur within a gene (for example, ~312 introns occur in the human gene TTN), why they are very long, and why exons are very short.

Gene symbolGene length
(bases)
Longest Intron length
(bases)
ROBO2           1,743,2691,160,411
KCNIP41,220,1831,097,903
ASIC21,161,8771,043,911
NRG11,128,573956,398
DPP101,403,453866,399
DMD 2,220,382319,058
TTN304,81395,764
The longest introns in the human genes.

It was also discovered that the spliceosome machinery was very large and complex with ~300 proteins and several SnRNA molecules. So, the questions also extended to the origin of the spliceosome. Soon after the discovery of introns, it became apparent that the junctions between exons and introns on either side exhibited specific sequences that signalled the spliceosome machinery to the exact base position for splicing. How and why these splice junction signals came into being was another important question to be answered.

Early Speculations

The startling discovery of introns and the split gene architecture of the eukaryotic genes was dramatic, and started a new era of eukaryotic biology. The question of why eukaryotic genes had a genes-in-pieces architecture prompted speculations and discussions in the literature almost immediately.

Dr. Ford Doolittle from the Dalhousie University published a paper in 1978 in which he expressed his views.[70] He stated that most molecular biologists assumed that the eukaryotic genome arose from a ‘simpler’ and more ‘primitive’ prokaryotic genome rather like that of Escherichia coli. However, this type of evolution would require that introns be introduced into the contiguous coding sequences of bacterial genes. Regarding this requirement, Doolittle said, “It is extraordinarily difficult to imagine how informationally irrelevant sequences could be introduced into pre-existing structural genes without deleterious effects.” He stated “I would like to argue that the eukaryotic genome, at least in that aspect of its structure manifested as ‘genes in pieces’ is in fact the primitive original form.”

Dr. James E. Darnell from the Rockefeller University also expressed similar views in 1978.[71] He stated, “The differences in the biochemistry of messenger RNA formation in eukaryotes compared to prokaryotes are so profound as to suggest that sequential prokaryotic to eukaryotic cell evolution seems unlikely. The recently discovered non-contiguous sequences in eukaryotic DNA that encode messenger RNA may reflect an ancient, rather than a new, distribution of information in DNA and that eukaryotes evolved independently of prokaryotes.”

However, in an apparent attempt to reconcile with the idea that RNA preceded DNA in evolution, and with the concept of the three evolutionary lineages of archea, bacteria and eukarya, both Dr. Doolittle and Dr. Darnell deviated from their original speculation in a paper they published together in 1985.[72] They suggested that the ancestor of all three groups of organisms, the ‘progenote,’ had a genes-in-pieces structure, from which all three lineages evolved. They speculated that the precellular stage had primitive RNA genes which had introns, which were reverse transcribed into DNA and formed the progenote. Bacteria and archea evolved from the progenote by losing introns, and ‘urkaryote’ evolved from it by retaining introns. Later, the eukaryote evolved from the urkaryote by evolving a nucleus and gaining the mitochondria from the bacteria. Multicellular organisms then evolved from the eukaryote.

These authors were able to predict that the distinctions between the prokaryote and the eukaryote were so profound that the prokaryote to eukaryote evolution was not tenable, and that both had different origins. However, other than the speculations that the precellular RNA genes must have had introns, they did not address the key questions of where from, how or why the introns could have originated in these genes or what their material basis was. There were no explanations of why exons were short and introns were long, how the splice junctions originated, what the structure and sequence of the splice junctions meant, and why eukaryotic genomes were large.

Around the same time that Dr. Doolittle and Dr. Darnell suggested that introns in eukaryotic genes could be ancient, Dr. Colin Blake[73] from the university of Oxford and Dr. Walter Gilbert[74][75] from the Harvard University (who won the Nobel Prize for inventing a DNA sequencing method along with Fred Sanger) published their views on intron origins independently. In their view, introns originated as spacer sequences that enabled the recombination and shuffling of exons that encoded distinct functional domains in order to evolve new genes. Thus, new genes were assembled from exon modules that coded for functional domains, folding regions, or structural elements from preexisting genes in the genome of an ancestral organism, thereby evolving genes with new functions. They did not specify how the exons representing protein structural motifs originated, or the introns that do not code for proteins originated. In addition, even after many years, extensive analysis of several thousands of proteins and genes showed that only extremely rarely do genes exhibit the supposed exon shuffling phenomenon.[76][77] Furthermore, several molecular biologists had questioned the exon shuffling proposal, from a purely evolutionary view for both methodological and conceptual reasons, and, in the long run, this theory did not materialize.

The Split-gene theory

The hypothesis

Around the same time introns were discovered, Dr. Senapathy was asking how genes themselves could have originated. He surmised that for any gene to come into being, there must have been genetic sequences (RNA or DNA) present in the prebiotic chemistry environment. A basic question he asked was how protein-coding sequences could have originated from primordial DNA sequences at the initial development of the very first cells.

To answer this, he made two basic assumptions: (i) before a self-replicating cell could come into existence, DNA molecules were synthesized in the primordial soup by random addition of the 4 nucleotides without the help of templates and (ii) the nucleotide sequences that code for proteins were selected from these preexisting random DNA sequences in the primordial soup, and not by construction from shorter coding sequences. He also surmised that codons must have been established prior to the origin of the first genes. If primordial DNA did contain random nucleotide sequences, he asked: Was there an upper limit in the coding-sequence lengths, and, if so, did this limit play a crucial role in the formation of the structural features of genes at the very beginning of the origin of genes?

His logic was the following. The average length of proteins in living organisms, including the eukaryotic and bacterial organisms, was ~400 amino acids. However, there existed much longer proteins, even longer than 10,000 amino acids up to ~30,000 amino acids, in both eukaryotes and bacteria.[78] The coding sequence of thousands of bases existed in a single stretch in bacterial genes. In contrast, the coding sequence of eukaryotes existed only in short segments of exons of approx. 120 bases regardless of the length of the protein. If the coding sequence (Open Reading Frame, ORF) lengths in random DNA sequences were as long as those in bacterial organisms, then contiguously long coding genes were possible to have occurred in random DNA. This was not known, as the distribution of the lengths of ORFs in a random DNA sequence was never studied before.

As random DNA sequences could be generated in the computer, Senapathy thought that he could ask these questions and conduct his experiments in the computer. Furthermore, when he began studying this question, there existed just about sufficient amount of DNA and protein sequence information in the National Biomedical Research Foundation (NBRF) database in the early 1980’s.

Testing the hypothesis

Origin of introns and the split gene structure

Dr. Senapathy analyzed the distribution of the ORF lengths in computer-generated random DNA sequences first. Surprisingly, this study revealed that there actually existed an upper limit of about 200 codons (600 bases) in the lengths of ORFs. The shortest ORF (zero base in length) was the most frequent. At increasing lengths of ORFs, their frequency decreased logarithmically, reaching almost zero at about 600 bases. When the probability of ORF lengths in a random sequence was plotted, it also revealed that the  probability of increasing lengths of ORFs decreased exponentially and tailed off at a maximum of about 600 bases. From this “negative exponential” distribution of ORF lengths, it was found that most of the ORFs were extremely shorter than even the maximum of 600 bases.

This finding was surprising because the coding sequence for the average protein length of 400 AAs (with ~1,200 bases of coding sequence) and longer proteins of thousands of AAs (requiring >10,000 bases of coding sequence) would not occur at a stretch in a random sequence. If this was true, a typical gene with a contiguous coding sequence could not originate in a random sequence. Thus, the only possible way that any gene could originate from a random sequence was to split the coding sequence into shorter segments and select these segments from short ORFs available in the random sequence, rather than to increase the length of an ORF by eliminating numerous consecutively occurring stop codons. This process of choosing short segments of coding sequences from the available ORFs to make a long ORF would lead to a split structure of the gene.

The split genes thus originated from random DNA sequences by choosing the best of the short coding segments (exons) and joining them by a process of splicing. The intervening intron sequences were left-over vestiges of the random sequences, and thus were earmarked to be removed by the spliceosome. These findings indicated that split genes could have originated from random DNA sequences with exons and introns as they are found in today’s eukaryotic organisms. The Nobel Laureate Dr. Marshall Nirenberg, who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS.[1] New Scientist covered this publication in “A long explanation for introns”.[80]

Noted molecular biologist Dr. Colin Blake, who proposed the Gilbert-Blake hypothesis in 1979 for the origin of introns (see above), stated in his 1987 publication entitled “Proteins, exons and molecular evolution,” that Senapathy’s split gene theory comprehensively explained the origin of the split gene structure. In addition, he stated that it explained several key questions including the origin of the splicing mechanism:[81]

“Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and non-coding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution. He found that the distribution of reading frame lengths in a random nucleotide sequence corresponded exactly to that for the observed distribution of eukaryotic exon sizes. These were delimited by regions containing stop signals, the messages to terminate construction of the polypeptide chain, and were thus non-coding regions or introns. The presence of a random sequence was therefore sufficient to create in the primordial ancestor the segregated form of RNA observed in the eukaryotic gene structure. Moreover, the random distribution also displays a cutoff at 600 nucleotides, which suggests that the maximum size for an early polypeptide was 200 residues, again as observed in the maximum size of the eukaryotic exon. Thus, in response to evolutionary pressures to create larger and more complex genes, the RNA fragments were joined together by a splicing mechanism that removed the introns. Hence, the early existence of both introns and RNA splicing in eukaryotes appears to be very likely from a simple statistical basis. These results also agree with the linear relationship found between the number of exons in the gene for a particular protein and the length of the polypeptide chain.”

Origin of Splice junctions

Under the split gene theory, an exon would be defined by an ORF. It would require that a mechanism to recognize an ORF should have originated. As an ORF is defined by a contiguously coding sequence bounded by stop codons, these stop codon ends had to be recognized by this exon-intron gene recognition system. This system could have defined the exons by the presence of a stop codon at the ends of ORFs, which should be included within the ends of the introns and eliminated by the splicing process. Thus, the introns should contain a stop codon at their ends, which would be part of the splice junction sequences.

If this hypothesis was true, the split genes of today’s living organisms should contain stop codons exactly at the ends of introns. When Dr. Senapathy tested this hypothesis in the splice junctions of eukaryotic genes, it was astonishing that the vast majority of splice junctions did contain a stop codon at the ends of introns, right outside of the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. Thus, the basic split gene theory for the origin of introns and the split gene structure led to the understanding that the splice junctions originated from the stop codons.[2]

CodonNumber of occurrences
in donor signal
Number of occurrences
in acceptor signal
TAA3700
TGA2930
TAG64234
CAG7746
Other297*50
Total10301030
Frequency of stop codons in donor and acceptor splice-junction sequences.

Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequences clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes, thus providing a strong corroboration for the split gene theory.  Dr. Marshall Nirenberg again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper.[2] New Scientist covered this publication in “Exons, Introns and Evolution”.[21]

Soon after the discovery of introns by Drs. Philip Sharp and Richard Roberts, it became known that mutations within splice junctions could lead to diseases. Dr. Senapathy showed that mutations in the stop codon bases (canonical bases) caused more diseases than the mutations in non-canonical bases.[1]

Branch point (lariat) sequence

An intermediate stage in the process of eukaryotic RNA splicing is the formation of a lariat structure. It is anchored at an adenosine residue in intron between 10 and 50 nucleotides upstream of the 3' splice site. A short conserved sequence (the branch point sequence) functions as the recognition signal for the site of lariat formation. During the splicing process, this conserved sequence towards the end of the intron forms a lariat structure with the beginning of the intron.[82] The final step of the splicing process occurs when the two exons are joined and the intron is released as a lariat RNA[83].

Several investigators have found the branch point sequences in different organisms[22] including yeast, human, fruit fly, rat, and plants. Senapathy found that, in all of these branch point sequences, the codon ending at the branch point adenosine is consistently a stop codon. What is interesting is that two of the three stop codons (TAA and TGA) occur almost all of the times at this position.

OrganismLariat Consensus sequence
YeastTACTAAC
Human Beta globin genesCTGAC

CTAAT

CTGAT

CTAAC

CTCAC

DrosophilaCTAAT
RatsCTGAC
Plants(C/T)T(A/G)A(T/C)
Consistent presence of stop codons in branch point signal sequences.Lariat (branch point) sequences have been identified from many differentorganisms.These sequences consistently show that the codon ending inthe branching adenosine is a stop codon, either TAA or TGA, which are shown in red.

These findings led Dr. Senapathy to propose that the branch point signal originated from stop codons. The finding that two different stop codons (TAA and TGA) occur within the lariat signal with the branching point as the third base of the stop codons corroborates this proposal. As the branching point of the lariat occurs at the last adenine of the stop codon, it is possible that the spliceosome machinery that originated for the elimination of the numerously occurring stop codons from the primary RNA sequence created an auxiliary stop-codon sequence signal as the lariat sequence to aid its splicing function.[2]

The small nuclear U2 RNA found in splicing complexes is thought to aid splicing by interacting with the lariat sequence.[84] Complementary sequences for both the lariat sequence and the acceptor signal are present in a segment of only 15 nucleotides in U2 RNA. Further, the U1 RNA has been proposed to function as a guide in splicing to identify the precise donor splice junction by complementary base-pairing. The conserved regions of the U1 RNA thus include sequences complementary to the stop codons. These observations enabled Senapathy to predict that that stop codons had operated in the origin of not only the splice-junction signals and the lariat signal, but also some of the small nuclear RNAs.

Gene regulatory sequences

Dr Senapathy also proposed that the gene-expression regulatory sequences (promoter and poly-A addition site sequences) also could have originated from stop codons. A conserved sequence, AATAAA, exists in almost every gene a short distance downstream from the end of the protein-coding message and serves as a signal for the addition of poly(A) in the mRNA copy of the gene[85]. This poly(A) sequence signal contains a stop codon, TAA. A sequence shortly downstream from this signal, thought to be part of the complete poly(A) signal, also contains the TAG and TGA stop codons.

Eukaryotic RNA-polymerase-II-dependent promoters can contain a TATA box (consensus sequence TATAAA), which contains the stop codon TAA. Bacterial promoter elements at -10 bases exhibits a TATA box with a consensus of TATAAT (which contains the stop codon TAA), and at -35 bases exhibits a consensus of TTGACA (containing the stop codon TGA). Thus, the evolution of the whole RNA processing mechanism seems to have been geared toward elimination of stop codons, thus making those stop codons the focal points for RNA processing.

Stop codons are key parts of every genetic element in the eukaryotic gene


Genetic ElementConsensus sequence
PromoterTATAAT
Donor Splice SequenceCAG:GTAAGT

CAG:GTGAGT

Acceptor Splice Sequence(C/T)9…TAG:GT
Lariat SequenceCTGAC

CTAAC

Poly-A addition siteTATAAA
The consistent occurrence of stop codons in genetic elements in eukaryotic genes.The consensus sequences of the different genetic elements in eukaryotic genes are shown. The stop codon(s) in each of these sequences are colored in red.

Dr. Senapathy’s work based on his split gene theory has unraveled that stop codons occur as the key parts in every genetic element in eukaryotic genes. The table and figure above show that the key parts of the core promoter elements, the lariat (branch point) signal, the donor and acceptor splice signals, and the poly-A addition signal consist of one or more stop codons. This finding provides a strong corroboration for the split gene theory that the underlying reason for the complete split gene paradigm is the origin of split genes from random DNA sequences, wherein random distribution of an extremely high frequency of stop codons were used by nature to define these genetic elements.

Why exons are short and introns are long?

Research based on the split gene theory sheds light on other basic questions of exons and introns. The exons of eukaryotes are generally short (human exons average ~120 bases, and can be as short as 10 bases) and introns are usually very long (average of ~3,000 bases, and can be several hundred thousands bases long), for example genes RBFOX1, CNTNAP2, PTPRD and DLG2. Dr. Senapathy has provided a plausible answer to these questions, which has remained the only explanation so far. Based on the split gene theory, exons of eukaryotic genes, if they originated from random DNA sequences, have to match the lengths of ORFs from random sequence, and possibly should be around 100 bases (close to the median length of ORFs in random sequence). The genome sequences of living organisms, for example the human, exhibits exactly the same average lengths of 120 bases for exons, and the longest exons of 600 bases (with few exceptions), which is the same length as that of the longest random ORFs.[1][2][3][19]

If split genes originated in random DNA sequences, then introns would be long for several reasons. The stop codons occur in clusters leading to numerous consecutive very short ORFs, and longer ORFs that could be defined as exons would be rarer. Furthermore, the best of the coding sequence parameters for functional proteins would be chosen from the long ORFs in random sequence, which may occur rarely. In addition, the combination of the donor and acceptor splice junction sequences within short lengths of coding sequence segments that would define exon boundaries would occur rarely in a random sequence. These combined reasons would make introns very long compared to the lengths of exons.   

Why eukaryotic genomes are large?

This work also explains why the genomes are very large, for example, the human genome with three billion bases, and why only a very small fraction of the human genome (~2%) codes for the proteins and other regulatory elements.[86][87] If split genes originated from random primordial DNA sequences, it would contain a significant amount of DNA that would be represented by introns. Furthermore, a genome assembled from random DNA containing split genes would also include intergenic random DNA. Thus, the nascent genomes that originated from random DNA sequences had to be large, regardless of the complexity of the organism.

The observation that the genomes of several organisms such as that of the onion (~16 billion bases[88]) and salamander (~32 billion bases[89]) are much larger than that of the human (~3 billion bases[90][91]) but the organisms are no more complex than human provides credence to this split gene theory. Furthermore, the findings that the genomes of several organisms are smaller, although they contain essentially the same number of genes as that of the human, such as those of the C. elegans (genome size ~100 million bases, ~19,000 genes)[92] and Arabidopsis thaliana (genome size ~125 million bases, ~25,000 genes),[93] adds support to this theory. The split gene theory predicts that the introns in the split genes in these genomes could be the “reduced” (or deleted) form compared to the larger genes with long introns, thus leading to reduced genomes.[1][19] In fact, researchers have recently proposed that these smaller genomes are actually reduced genomes, which adds support to the split gene theory.[94]

Origin of the spliceosomal machinery and the eukaryotic cell nucleus

Dr. Senapathy's research also addresses the origin of the spliceosomal machinery that edits out the introns from the RNA transcripts of genes. If the split genes had originated from random DNA, then the introns would have become an unnecessary but integral part of the eukaryotic genes along with the splice junctions at their ends. The spliceosomal machinery would be required to remove them and to enable the short exons to be linearly spliced together as a contiguously coding mRNA that can be translated into a complete protein. Thus, the split gene theory shows that the whole spliceosomal machinery originated due to the origin of split genes from random DNA sequences, and to remove the unnecessary introns.[1][2]

As noted above, Dr. Colin Blake, the author of the Gilbert-Blake theory for the origin of introns and exons, states, “Recent work by Senapathy, when applied to RNA, comprehensively explains the origin of the segregated form of RNA into coding and noncoding regions. It also suggests why a splicing mechanism was developed at the start of primordial evolution.”

Dr. Senapathy had also proposed a plausible mechanistic and functional rationale why the eukaryotic nucleus originated, a major question in biology.[1][2] If the transcripts of the split genes and the spliced mRNAs were present in a cell without a nucleus, the ribosomes would try to bind to both the un-spliced primary RNA transcript and the spliced mRNA, which would result in a molecular chaos. If a boundary had originated to separate the RNA splicing process from the mRNA translation, it can avoid this problem of molecular chaos. This is exactly what is found in eukaryotic cells, where the splicing of the primary RNA transcript occurs within the nucleus, and the spliced mRNA is transported to the cytoplasm, where the ribosomes translate them into proteins. The nuclear boundary provides a clear separation of the primary RNA splicing and the mRNA translation.

Origin of the eukaryotic cell

These investigations thus led to the possibility that primordial DNA with essentially random sequence gave rise to the complex structure of the split genes with exons, introns and splice junctions. They also predict that the cells that harbored these split genes had to be complex with a nuclear cytoplasmic boundary, and must have had a spliceosomal machinery. Thus, it was possible that the earliest cell was complex and eukaryotic.[1] [2][3][19] Surprisingly, findings from extensive comparative genomics research from several organisms over the past 15 years are showing overwhelmingly that the earliest organisms could have been highly complex and eukaryotic, and could have contained complex proteins,[95][96][97][98][99][100][101] exactly as predicted by Dr. Senapathy's theory.

The spliceosome is a highly complex machinery within the eukaryotic cell, containing ~200 proteins and several SnRNPs. In their paper [34]Complex spliceosomal organization ancestral to extant eukaryotes,” molecular biologists Dr. Lesley Collins and Dr. David Penny state “We begin with the hypothesis that ... the spliceosome has increased in complexity throughout eukaryotic evolution. However, examination of the distribution of spliceosomal components indicates that not only was a spliceosome present in the eukaryotic ancestor but it also contained most of the key components found in today's eukaryotes. ... the last common ancestor of extant eukaryotes appears to show much of the molecular complexity seen today.” This suggests that the earliest eukaryotic organisms were highly complex and contained sophisticated genes and proteins, as the split gene theory predicts.

The Shapiro-Senapathy algorithm

Based on the split gene theory, Dr. Senapathy developed computational algorithms to detect the donor and acceptor splice sites, exons and a complete split gene in a genomic sequence. He developed the position weight matrix (PWM) method based on the frequency of the four bases at the consensus sequences of the donor and acceptor in different organisms to identify the splice sites in a given sequence. Furthermore, he formulated the first algorithm to find the exons based on the requirement of exons to contain a donor sequence (at the 5’ end) and an acceptor sequence (at the 3’ end), and an ORF in which the exon should occur, and another algorithm to find a complete split gene. These algorithms are collectively known as the Shapiro-Senapathy algorithm (S&S).[42][43]

This Shapiro-Senapathy algorithm aids in the identification of splicing mutations that cause numerous diseases and adverse drug reactions.[102][103] Using the S&S algorithm, scientists have identified mutations and genes that cause numerous cancers, inherited disorders, immune deficiency diseases and neurological disorders (see here for details).

The widespread use of this algorithm in biological research and clinical applications worldwide adds credence to the split gene theory, as this algorithm emanated from the split gene theory.  

It is increasingly used in clinical practice and research not only to find mutations in known disease-causing genes in patients, but also to discover novel genes that are causal of different diseases.

Furthermore, it is used in defining the cryptic splice sites and deducing the mechanisms by which mutations in them can affect normal splicing and lead to different diseases. It is also employed in addressing various questions in basic research in humans, animals and plants.

These contributions have impacted major questions in eukaryotic biology and their applications to human medicine. These applications may expand as the fields of clinical genomics and pharmacogenomics magnify their research with mega sequencing projects such as the All of Us project[104] that will sequence a million individuals, and with the sequencing of millions of patients in clinical practice and research in the future.

Bacterial genes could have originated from split genes?

Based on the split gene theory, only genes split into short exons and long introns, with a maximum exon length of ~600 bases, could have occurred in random DNA sequences. Genes with long uninterrupted coding sequences that are thousands of bases long and longer than 10,000 bases up to 90,000 bases that occur in many bacterial organisms[18] were practically impossible to have occurred. However, the bacterial genes could have originated from split genes by losing introns, which seems to be the only way to arrive at long coding sequences. It is also a better way than by increasing the lengths of ORFs from very short random ORFs to very long ORFs by specifically removing the stop codons by mutation.[1][2][3]

Gene size (bases)Number of genes
5,000 - 10,0003,029
10,000 - 15,000492
15,000 - 20,000131
20,000 - 25,00039
>25,00041
Extremely long coding sequences occur as very long ORFs in bacterial genes. Thousands of genes that are longer than 5,000 bases, coding for proteins that are longer than 2,000 amino acids, exist in many bacterial genomes. The longest genes are ~90,000 bases long coding for proteins ~30,000 amino acids long. Each of these genes occur in a single stretch of coding sequence (ORF) without any interrupting stop codons or intervening introns. Data taken from Think big – giant genes in bacteria.[18]

According to the split gene theory, this process of intron loss could have happened from prebiotic random DNA. These contiguously coding genes could be tightly organized in the bacterial genomes without any introns and be more streamlined. According to Dr. Senapathy, the nuclear boundary that was required for a cell containing split genes in its genome (see the section Origin of the eukaryotic cell nucleus, above) would not be required for a cell containing only contiguously coding genes. Thus, the bacterial cells did not develop a nucleus. Based on split gene theory, the eukaryotic genomes and bacterial genomes could have independently originated from the split genes in primordial random DNA sequences.

Comprehensive corroborating evidences for the split gene theory

If the split gene theory is correct, the structural features of split genes predicted from computer-simulated random sequences can be expected to occur in actual eukaryotic split genes. This is what we find in most known split genes in eukaryotes living today. The eukaryotic sequences exhibit a nearly perfect negative exponential distribution of ORFs lengths, with an upper limit of 600 bases (with rare exceptions).[1][2][19][3] Also, with rare exceptions, the exons of eukaryotic genes fall within this 600 bases upper maximum.

Moreover, if this theory is correct, exons should be delimited by stop codons, especially at the 3’ ends of exons (that is, the 5’ end of introns). Actually they are precisely delimited more strongly at the 3’ ends of exons and less strongly at the 5’ ends in most known genes, as predicted. [1][2][19][3] These stop codons are the most important functional parts of both splice junctions (the canonical bases GT:AG). The theory thus provides an explanation for the “conserved” splice junctions at the ends of exons and for the loss of these stop codons along with introns when they are spliced out. If this theory is correct, splice junctions should be randomly distributed in eukaryotic DNA sequences, and they are.[3][22][42][43] The splice junctions present in transfer RNA genes and ribosomal RNA genes, which do not code for proteins and wherein stop codons have no functional meaning, should not contain stop codons, and again, this is observed. The lariat signal, another sequence involved in the splicing process, also contains stop codons.[1][2][3][19][22][42][43] These findings show that the predictions of the split gene theory concerning the structure and function of the split genes in random DNA sequences are precisely corroborated by the structural and functional characteristics of split genes in modern eukaryotic organisms.

If the split genes originated from random primordial DNA sequences, as proposed in the split gene theory, there could be evidence that they were present in the earliest organisms. Actually, using comparative analysis of the modern genome data from several living organisms, scientists have found that the characteristics of split genes that are present in modern eukaryotes trace back to the earliest organisms that came on earth. These studies show that the earliest organisms could have contained the intron-rich split genes and complex proteins that occur in today’s living organisms.[105][106][107][108][109][110][111][112][113]

In addition, using another computational analytical method known as the “maximum likelihood analysis,” scientists have found that the earliest eukaryotic organisms must have contained the same genes from today’s living organisms with even a higher density of introns.[114] Furthermore, comparative genomics of many organisms including basal eukaryotes (considered to be primitive eukaryotic organisms such as Amoeboflagellata, Diplomonadida, and Parabasalia) have shown that intron-rich split genes accompanied by a fully formed spliceosome from today’s complex organisms were present in the earliest organisms, and that the earliest organisms were extremely complex with all of the eukaryotic cellular components.[115][116][117][118][119][120]

These findings are exactly as predicted by the split gene theory providing remarkable support. This theory is corroborated by the findings from comparative analysis of actual eukaryotic gene sequences with those of the computer generated random DNA sequences. Furthermore, comparative analysis of genome data from many organisms living today by several groups of scientists show that the earliest organisms that appeared on earth had intron-rich split genes, coding for complex proteins and cellular components, such as those found in the modern eukaryotic organisms. Thus, the split gene theory provides comprehensive solutions to the entire structural and functional features of the split gene architecture, with strong corroborating evidence.

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/9/21 12:43:56