Re: How do you know a sequence of nucleotides is part a gene?

Date: Mon Feb 26 13:44:06 2001
Posted By: Paul Szauter, Staff, Mouse Genome Informatics, The Jackson Laboratory
Area of science: Genetics
ID: 983086356.Ge
Message:

The answer to your question is complex.

I will have to present some background information to round out this answer.

The sequence of DNA isolated from the chromosomes of an organism (the 
genome) contains all of the genes. Different organisms have different 
overall structures to the genome, so the answer to your question is somewhat 
dependent on the organism.

When genes are expressed (transcribed into RNA), the resulting transcript is 
often processed in a variety of ways, the worst of which, from the 
standpoint of your question, is splicing. In this process, parts of the 
primary transcript are spliced out of the RNA, so that a primary transcript 
with the sequence ABCDEFGH might end up with a spliced RNA with the sequence 
ACFH. There is often alternative splicing, so that the same gene could also 
encode a spliced transcript with the sequence ACDH. The parts of the 
sequence retained in the mRNA are called 'exons', and the parts spliced out 
are called 'introns'.

Once RNA has been spliced, it is messenger RNA or mRNA. This is then 
translated into protein using the three-letter genetic code, in which 
triplets of nucleotides (there are 64) are interpreted as codes for one of 
twenty amino acids or as 'stop'. There are three stop codons and 61 amino 
acid encoding codons, which is the source of one of the tricks used to find 
genes.

The genetic code doesn't have any 'punctuation' other than 'stop', so you 
can't even look at the sequence of an mRNA and know which 'reading frame' of 
the three possible ones to use. Fortunately, a random sequence of 
nucleotides that doesn't encode a protein will be interrupted by one of the 
three stop codons fairly often. This means that you can look at the sequence 
of an mRNA and easily pick out which reading frame to use, because there 
will only be one long 'open' reading frame free of stop codons until near 
the end of the mRNA sequence.

There is a viral enzyme called reverse transcriptase that allows us to 
'copy' RNA into DNA, so that we can create (using recombinant DNA 
techniques) clone libraries of 'complementary DNA' or 'cDNA' from mRNA 
isolated from various tissues.

One of the tricks to spot genes in genomic sequence is to sequence a very 
large number of randomly chosen cDNA clones, and put the cDNA sequences in a 
computer database. Then, for any given patch of genomic sequence, we can see 
if there is any match to known cDNA sequence. If so, we have found a gene. 
Of course, we only match exons with our cDNA sequence. In some organisms, 
such as in humans, the introns are very large relative to the exons, and 
most of the genomic DNA doesn't encode either introns or exons, just junk.

A second trick to finding genes involves just grinding away at the sequence 
using computer programs that exploit features of coding sequence (sequence 
that is part of an exon). Because there are more codons than there are amino 
acids, most amino acids have a number of different 'synonyms', meaning that 
different codons encode the same amino acid. Most organisms exhibit a known 
'codon bias', meaning that some of the synonyms are used more frequently 
than others. So one way to try to spot genes among genomic sequence is to 
look for stretches of sequence that have two properties: open reading frames 
that are longer than would be expected on average, and a codon bias that 
suggests that they aren't random. In addition, in some organisms, there is a 
pronounced difference in base composition, with genes being relatively GC-
rich compared to the rest of the genome; junk tends to be AT-rich. Finally, 
the sequences at the intron-exon borders can be recognized, sort of.

These properties are all exploited one way or another by gene-finding 
programs. The annotation of a particular genome involves using a dozen or so 
different gene-finding programs that work in slightly different ways, 
usually displaying the results graphically aligned to the sequence.

An even better trick is to sequence the genome of a species at just the 
right evolutionary distance. For the human genome, the genomic sequence of 
the mouse is perfect. The mouse genome is mostly sequenced, and was used by 
the Celera group to annotate (find the genes within) the human sequence. 
Let's say we have the complete genomic sequence of the mouse and human (we 
are close). For a given patch of human genomic sequence, you can ask your 
computer to take frames of 100 nucleotides and compare them to the entire 
mouse genome sequence, find the best match, and give you the number of bases 
that are identical. When you do this for a region containing a human gene, 
the exons are better than 50% identical to mouse exons, sometimes even 
better than 75%. Within an intron, the sequence identity drops well below 
50%. A plot of the percent identity vs. the sequence reveals the introns.

So to annotate the human genome (find the genes) we would:
1. Feed the sequence to the gene-finding programs, which would produce an 
array of slightly different predictions about where the genes are.
2. Align the genome as well as possible against the mouse genome to attempt 
to find genes using the differential evolutionary conservation between 
coding (exonic) and noncoding (intronic and junk) sequence.
3. Align the genome against all known human cDNA sequences to find some 
genes the easy way.
4. Take all the predicted genes and try to find what sort of genes they are 
by matches to the protein sequence database.

You can view the annotated human genome using a crude browser tool on the 
web at:
 http://
genome.ucsc.edu/goldenPath/octTracks.html

The link above has the viewer set to part of chromosome 17, which is as good 
a place as any to look. Notice that the known genes are mostly thin lines 
(introns) with small portions as thick lines (exons). Notice that they have 
also placed cDNAs (mRNAs/ESTs) along the map to confirm the gene 
predictions. The very bottom shows repeating elements found by 
'RepeatMasker', which looks for sequence known to be repetitive (junk) in 
the human genome.

I have used a large number of genetic terms in this answer. You can see 
these and others in the MGI Glossary, which includes links to pictures. See 
the MGI Glossary at:
 http://
www.informatics.jax.org/userdocs/glossary.shtml

Thank you for your interesting question. I hope that you have fun learning 
more about genetics at this exciting time.

Yours,

Paul Szauter
Mouse Genome Informatics
Current Queue | Current Queue for Genetics | Genetics archives
Try the links in the MadSci Library for more information on Genetics.