Re: what is the average percentage of dna shared by unrelated individuals

Date: Tue Jun 17 15:56:24 2003
Posted By: Christopher Carlson, Senior Fellow, Dept. of Molecular Biotechnology
Area of science: Genetics
ID: 1054767234.Ge
Message:

Hi Jane,

  Interesting questions.  As you mentioned, if we compare any two genomes,
they will differ from one another roughly once every 1000 base pairs.  This
is usually expressed as 99.9% of the human genome sequence is identical
between individuals, but it is more formally correct to say that 99.9% of
the human genome sequence is identical between any two copies of the genome.  

  As we all carry two copies of the genome (one from each parent), the
difference between any pair of individuals can be formally expressed as how
often we observe sites where the pair of alleles carried by each individual
are different.  That is, given two alleles (A and a) there are three
possible genotypes (AA, Aa, and aa), so in comparing two individuals there
are six possible comparisons (AA/AA, AA/Aa, AA/aa, Aa/Aa, Aa/aa, and aa/aa)
and in three of these comparisons the genotype is different.

  Now for the tricky part.  Given that any pair of genomes differ at one
base pair in 1000, how often will the genotype for any pair of unrelated
individuals be different?  In fact, at this point we don't have enough
information to answer the question.

  Assuming two alleles at each site (A and a) with allele frequency p and q
(=1-p) respectively, and assuming Hardy Weinberg equilibrium, the expected
genotype frequencies are p^2, 2pq, and q^2.  Thus, the probability that
genotypes are different in two individuals is one minus the probabilty that
the genotypes are the same, or 1-(p^4 + 4p^2q^2 + q^4).  So for any site
with p=q=0.5, we would expect any random pair of individuals to differ
1-(.0625+4*.0625+.0625) = 67.5% of the time.  

  Now all we need to know is the allele frequency at every polymorphic site
in the genome in order to measure how often a random pair of individuals
would have different genotypes.  This is obviously unreasonable at present,
but at least it gives us a boundary condition.  If all polymorphic sites
are p=q=0.5, then the probability that any pair of genomes will differ at
each site is 50% (=2pq), so if two genomes are observed to differ once
every 1000 bp, then under this model we would expect a polymorphic site in
the population once every 500 bp.  Extrapolating from this, if 67.5% of
such polymorphic sites will differ in genotype between any pair of
individuals, we would expect any random pair of individuals to differ in
genotype once every 800 bp.

  Unfortunately, assuming that all polymorphisms ar 50/50 is not
reasonable, and the true distribution of allele frequencies is a function
of the demographic history of a population.  Growing populations will have
an excess of rare polymorphisms, and collapsing populations will have an
excess of common polymorphisms.  The human species seems to have had a
relatively constant population size for theoretical purposes, so common
polymorphisms do exist, but there are considerably more rare polymorphisms
than common.  Genotype is more likely to differ between individuals at
common polymorphisms, so the 50/50 example explored above is likely to
overestimate the frequency of genotypic differences.

  Making a long story short, it's probably reasonable to estimate that any
pair of individuals will differ in genotype once every 1000 bp.  Because
siblings share half of their alleles, sibs will differ once every 2000 bp.
 Nonetheless, if we extrapolate to a 3 billion base pair genome, this means
that genotype in unrelated individuals will differ at roughly three million
polymorphic sites, but only 1.5 million sites in siblings.  

  Getting back to the second part of your question, it's pretty well
established that siblings look more like one another than unrelated people
(even when they are different sexes) so it's likely that only a handful of
the existing variants determine phenotypic similarity, although the number
is likely in the thousands.  If a pair of unrelated individuals show some
phenotypic similarity, they probably are similar at a number of these loci,
but across the genome I would not expect phenotypically similar strangers
to be much more genotypically similar than phenotypically dissimilar
strangers.  There is a caveat to this: allele frequency differences do
exist between ethnic populations, as well as phenotypic differences,
although the fraction of all variation between ethnicities (as opposed to
within ethnicities) is generally estimated at leass than 15%.  Thus,
comparisons within ethnic groups are more likely to be genotypically (and
phenotypically) similar.

  Chris

  If you want to delve further into the theoretical aspects of nucleotide
diversity in populations of two or more individuals, I'm not aware of any
basic texts on the subject.  Here are some rather technical references:

Hartl and Clark, Principles of population genetics
Nei, Molecular Evolutionary Genetics
Kimura, The Neutral Theory of Molecular Evolution
Current Queue | Current Queue for Genetics | Genetics archives
Try the links in the MadSci Library for more information on Genetics.