MadSci Network: Genetics |
The answer to your question is complex. I will have to present some background information to round out this answer. The sequence of DNA isolated from the chromosomes of an organism (the genome) contains all of the genes. Different organisms have different overall structures to the genome, so the answer to your question is somewhat dependent on the organism. When genes are expressed (transcribed into RNA), the resulting transcript is often processed in a variety of ways, the worst of which, from the standpoint of your question, is splicing. In this process, parts of the primary transcript are spliced out of the RNA, so that a primary transcript with the sequence ABCDEFGH might end up with a spliced RNA with the sequence ACFH. There is often alternative splicing, so that the same gene could also encode a spliced transcript with the sequence ACDH. The parts of the sequence retained in the mRNA are called 'exons', and the parts spliced out are called 'introns'. Once RNA has been spliced, it is messenger RNA or mRNA. This is then translated into protein using the three-letter genetic code, in which triplets of nucleotides (there are 64) are interpreted as codes for one of twenty amino acids or as 'stop'. There are three stop codons and 61 amino acid encoding codons, which is the source of one of the tricks used to find genes. The genetic code doesn't have any 'punctuation' other than 'stop', so you can't even look at the sequence of an mRNA and know which 'reading frame' of the three possible ones to use. Fortunately, a random sequence of nucleotides that doesn't encode a protein will be interrupted by one of the three stop codons fairly often. This means that you can look at the sequence of an mRNA and easily pick out which reading frame to use, because there will only be one long 'open' reading frame free of stop codons until near the end of the mRNA sequence. There is a viral enzyme called reverse transcriptase that allows us to 'copy' RNA into DNA, so that we can create (using recombinant DNA techniques) clone libraries of 'complementary DNA' or 'cDNA' from mRNA isolated from various tissues. One of the tricks to spot genes in genomic sequence is to sequence a very large number of randomly chosen cDNA clones, and put the cDNA sequences in a computer database. Then, for any given patch of genomic sequence, we can see if there is any match to known cDNA sequence. If so, we have found a gene. Of course, we only match exons with our cDNA sequence. In some organisms, such as in humans, the introns are very large relative to the exons, and most of the genomic DNA doesn't encode either introns or exons, just junk. A second trick to finding genes involves just grinding away at the sequence using computer programs that exploit features of coding sequence (sequence that is part of an exon). Because there are more codons than there are amino acids, most amino acids have a number of different 'synonyms', meaning that different codons encode the same amino acid. Most organisms exhibit a known 'codon bias', meaning that some of the synonyms are used more frequently than others. So one way to try to spot genes among genomic sequence is to look for stretches of sequence that have two properties: open reading frames that are longer than would be expected on average, and a codon bias that suggests that they aren't random. In addition, in some organisms, there is a pronounced difference in base composition, with genes being relatively GC- rich compared to the rest of the genome; junk tends to be AT-rich. Finally, the sequences at the intron-exon borders can be recognized, sort of. These properties are all exploited one way or another by gene-finding programs. The annotation of a particular genome involves using a dozen or so different gene-finding programs that work in slightly different ways, usually displaying the results graphically aligned to the sequence. An even better trick is to sequence the genome of a species at just the right evolutionary distance. For the human genome, the genomic sequence of the mouse is perfect. The mouse genome is mostly sequenced, and was used by the Celera group to annotate (find the genes within) the human sequence. Let's say we have the complete genomic sequence of the mouse and human (we are close). For a given patch of human genomic sequence, you can ask your computer to take frames of 100 nucleotides and compare them to the entire mouse genome sequence, find the best match, and give you the number of bases that are identical. When you do this for a region containing a human gene, the exons are better than 50% identical to mouse exons, sometimes even better than 75%. Within an intron, the sequence identity drops well below 50%. A plot of the percent identity vs. the sequence reveals the introns. So to annotate the human genome (find the genes) we would: 1. Feed the sequence to the gene-finding programs, which would produce an array of slightly different predictions about where the genes are. 2. Align the genome as well as possible against the mouse genome to attempt to find genes using the differential evolutionary conservation between coding (exonic) and noncoding (intronic and junk) sequence. 3. Align the genome against all known human cDNA sequences to find some genes the easy way. 4. Take all the predicted genes and try to find what sort of genes they are by matches to the protein sequence database. You can view the annotated human genome using a crude browser tool on the web at: http:// genome.ucsc.edu/goldenPath/octTracks.html The link above has the viewer set to part of chromosome 17, which is as good a place as any to look. Notice that the known genes are mostly thin lines (introns) with small portions as thick lines (exons). Notice that they have also placed cDNAs (mRNAs/ESTs) along the map to confirm the gene predictions. The very bottom shows repeating elements found by 'RepeatMasker', which looks for sequence known to be repetitive (junk) in the human genome. I have used a large number of genetic terms in this answer. You can see these and others in the MGI Glossary, which includes links to pictures. See the MGI Glossary at: http:// www.informatics.jax.org/userdocs/glossary.shtml Thank you for your interesting question. I hope that you have fun learning more about genetics at this exciting time. Yours, Paul Szauter Mouse Genome Informatics
Try the links in the MadSci Library for more information on Genetics.