MadSci Network: Genetics |
Hi Jane, Interesting questions. As you mentioned, if we compare any two genomes, they will differ from one another roughly once every 1000 base pairs. This is usually expressed as 99.9% of the human genome sequence is identical between individuals, but it is more formally correct to say that 99.9% of the human genome sequence is identical between any two copies of the genome. As we all carry two copies of the genome (one from each parent), the difference between any pair of individuals can be formally expressed as how often we observe sites where the pair of alleles carried by each individual are different. That is, given two alleles (A and a) there are three possible genotypes (AA, Aa, and aa), so in comparing two individuals there are six possible comparisons (AA/AA, AA/Aa, AA/aa, Aa/Aa, Aa/aa, and aa/aa) and in three of these comparisons the genotype is different. Now for the tricky part. Given that any pair of genomes differ at one base pair in 1000, how often will the genotype for any pair of unrelated individuals be different? In fact, at this point we don't have enough information to answer the question. Assuming two alleles at each site (A and a) with allele frequency p and q (=1-p) respectively, and assuming Hardy Weinberg equilibrium, the expected genotype frequencies are p^2, 2pq, and q^2. Thus, the probability that genotypes are different in two individuals is one minus the probabilty that the genotypes are the same, or 1-(p^4 + 4p^2q^2 + q^4). So for any site with p=q=0.5, we would expect any random pair of individuals to differ 1-(.0625+4*.0625+.0625) = 67.5% of the time. Now all we need to know is the allele frequency at every polymorphic site in the genome in order to measure how often a random pair of individuals would have different genotypes. This is obviously unreasonable at present, but at least it gives us a boundary condition. If all polymorphic sites are p=q=0.5, then the probability that any pair of genomes will differ at each site is 50% (=2pq), so if two genomes are observed to differ once every 1000 bp, then under this model we would expect a polymorphic site in the population once every 500 bp. Extrapolating from this, if 67.5% of such polymorphic sites will differ in genotype between any pair of individuals, we would expect any random pair of individuals to differ in genotype once every 800 bp. Unfortunately, assuming that all polymorphisms ar 50/50 is not reasonable, and the true distribution of allele frequencies is a function of the demographic history of a population. Growing populations will have an excess of rare polymorphisms, and collapsing populations will have an excess of common polymorphisms. The human species seems to have had a relatively constant population size for theoretical purposes, so common polymorphisms do exist, but there are considerably more rare polymorphisms than common. Genotype is more likely to differ between individuals at common polymorphisms, so the 50/50 example explored above is likely to overestimate the frequency of genotypic differences. Making a long story short, it's probably reasonable to estimate that any pair of individuals will differ in genotype once every 1000 bp. Because siblings share half of their alleles, sibs will differ once every 2000 bp. Nonetheless, if we extrapolate to a 3 billion base pair genome, this means that genotype in unrelated individuals will differ at roughly three million polymorphic sites, but only 1.5 million sites in siblings. Getting back to the second part of your question, it's pretty well established that siblings look more like one another than unrelated people (even when they are different sexes) so it's likely that only a handful of the existing variants determine phenotypic similarity, although the number is likely in the thousands. If a pair of unrelated individuals show some phenotypic similarity, they probably are similar at a number of these loci, but across the genome I would not expect phenotypically similar strangers to be much more genotypically similar than phenotypically dissimilar strangers. There is a caveat to this: allele frequency differences do exist between ethnic populations, as well as phenotypic differences, although the fraction of all variation between ethnicities (as opposed to within ethnicities) is generally estimated at leass than 15%. Thus, comparisons within ethnic groups are more likely to be genotypically (and phenotypically) similar. Chris If you want to delve further into the theoretical aspects of nucleotide diversity in populations of two or more individuals, I'm not aware of any basic texts on the subject. Here are some rather technical references: Hartl and Clark, Principles of population genetics Nei, Molecular Evolutionary Genetics Kimura, The Neutral Theory of Molecular Evolution
Try the links in the MadSci Library for more information on Genetics.