Presentation is loading. Please wait.

Presentation is loading. Please wait.

人类群体遗传学 基本原理和分析方法 徐书华 金 力 中科院-马普学会计算生物学伙伴研究所

Similar presentations


Presentation on theme: "人类群体遗传学 基本原理和分析方法 徐书华 金 力 中科院-马普学会计算生物学伙伴研究所"— Presentation transcript:

1 人类群体遗传学 基本原理和分析方法 徐书华 金 力 中科院-马普学会计算生物学伙伴研究所
中国科学院上海生命科学研究院研究生课程 人类群体遗传学 人类群体遗传学 基本原理和分析方法 徐书华 金 力 中科院-马普学会计算生物学伙伴研究所

2 2008-2009学年第二学期《人类群体遗传学分析方法》课程表
上课时间:每周四上午10:00-11: 上课地点:中科大厦4楼403室第7教室 序号 日 期 课程内容 授课教师 1 2月26日 Hardy-Weinberg平衡检验原理及其应用 徐书华 2 3月5日 遗传多态性统计量 3 3月12日 进化树的构建方法及应用 4 3月19日 Coalescence原理及应用 李海鹏 5 3月26日 遗传漂变效应及有效群体大小的估计 6 4月2日 人群遗传结构分析 (I) 7 4月9日 单倍型估计及连锁不平衡分析 8 4月16日 人群遗传结构分析 (II) 9 4月23日 基因定位中的关联分析(I) 何云刚 10 4月30日 基因定位中的关联分析(II) 11 5月7日 人类基因组中的连锁不平衡模式及标签位点的选择 12 5月14日 基因表达数据的分析方法 严军 13 5月21日 人群历史的遗传学研究 5月28日 端午节 14 6月4日 法医学检测及分析方法 李士林 15 6月11日 自然选择检验原理和方法 16 6月18日 全基因组基因型数据正选择检验方法 17 6月25日 课程考试 教育基地

3 第二讲 遗传多态性统计量

4 第二讲 遗传多态性的概念 遗传多态性的种类 描述遗传多态性的统计量 群体遗传多态性参数(θ)的估计 利用群体遗传多态性数据进行统计检验
Tajima test

5 Polymorphism Light-morph Jaguar (typical)
Dark-morph or melanistic Jaguar (about 6% of the South American population)

6 Polymorphism

7 56 ethnic groups in China They have different clothes

8 They speak different languages

9 Human Genetic Diversity
Science 319:1100 (2008)

10 Polymorphism Greek: poly = many, and morph = form
Polymorphism is often defined as the presence of more than one genetically distinct type in a single population. Rare variations are not classified as polymorphisms; and mutations by themselves do not constitute polymorphisms.

11 Sexual dimorphism Why is the ratio ~50/50?

12 DNA polymorphism RFLP (Restriction Fragment Length Polymorphism)
AFLP (Amplified Fragment Length Polymorphism) RAPD (Random Amplification of Polymorphic DNA) VNTR (Variable Number Tandem Repeat, or Minisatellite) STR (Short Tandem Repeat, or Microsatellite) SNP (Single Nucleotide Polymorphism) SFP (Single Feature Polymorphism) CNV (Copy Number Variation)

13 Intuitive statistics Number of alleles Minor allele frequency (MAF)
More alleles, larger diversity; Minor allele frequency (MAF) is the frequency of the less (or least) frequent allele in a given locus and a given population.

14 Human SNP data A Single Nucleotide Polymorphism (SNP) ("snip") is a single base variant in DNA. Mutation: minor allele frequency (MAF) ≤1% SNP: MAF >1% SNPs are the most simple form and most common source of genetic polymorphism in the human genome (90% of all human DNA polymorphisms).

15 Heterozygosity The fraction of individuals in a population that are heterozygous for a particular locus. It can also refer to the fraction of loci within an individual that are heterozygous. Observed where n is the number of individuals in the population, and ai1, ai2 are the alleles of individual i at the target locus. Expected where m is the number of alleles at the target locus, and fi is the allele frequency of the ith allele at the target locus.

16 Heterozygosity related issues
Heterozygosity and HWD Comparison of Ho and He Gene diversity

17 Population Mutation Rate (q )
Under mutation-drift equilibrium: q = 4Nem for autosome q = Nem for Y and mtDNA q = 3Nem for X chromosome qautosome > qX > qY

18 Estimators of θ Number of segregating sites (θK);
Average pairwise differences (θ∏); Number of alleles (θE); Mean number of mutations since the MRCA (θΩ); Singleton.

19 Number of segregating sites (K)
Under the infinite site model, K is equal to the number of mutations since the most recent common ancestor of the sequences in the sample. Therefore, K has a clear biological meaning. However, K depends on the sample size.

20 Normalized K

21 Under the neutral Wright-Fisher model with constant effective population size,

22 The properties of θK θK is independent of sample size.
However, the usefulness of θK is not clear under other population genetic models, such as those with natural selection. θK is sensitive to the number of rare alleles, or mutants of low frequency.

23 How many common SNPs in human genome?
Common SNPs: minor allele frequency (MAF) >0.05; Suppose we have 50 samples of African, European, Asian respectively; Theta=1.2/kb for African population; Theta=0.8/kb for European and Asian population; Autosome length (L)=2.68 billion bp; where We expect 9.8 million common SNPs in 50 African samples; We expect 6.5 million common SNPs in 50 European samples; We expect 6.5 million common SNPs in 50 Asian samples;

24 ThetaK=1.2/kb

25 ThetaK=0.8/kb

26 Average pairwise differences (∏)
Also known as sequence diversity mean number of nucleotide differences between two sequences.

27 The properties of ∏ ∏ as a measure of genetic variation has clear biological meanings which do not depend on the underlying evolutionary process. In comparison to θK, it is insensitive to the rare alleles, or mutants of low frequency. ∏ is an useful measure of persistent genetic variation, and neutral genetic variation when purifying selection is operating. However, because its variance is considerably larger than that of θK, it is not as good as θK for neutral locus.

28 Nucleotide Diversity Locus (length) p(x10-4) q(x10-4) m(x10-9) Ne Reference APOE (5.5kb) (S) ,300 Fullerton et al. 2000 Chr.1 (10kb) (S) ,000 Yu et al. 2001 Chr.22 (10kb) (S) ,400 Zhao et al. 2000 X chr. (10.2kb) (S) ,300 Kaessmann et al. 1999 X chr. (4.2kb)) (ML) ,700 Harris & Hey 1999 Y chr. (64kb) (S) ,100 Thomson et al. 2000 mtDNA (15.4kb) (p) ,200 Ingman et al. 2000 Alu insertions ,500 Sherry et al. 1997

29 Number of alleles Ewens (1972) shows that under the infinite allele model An estimate of θ can be obtained by resolving the above equation for θ with E(k) replaced by k. The estimate is known as Ewens’s estimator θE.

30 The properties of θE Under the infinite allele model, θE is about the best estimator one can devise. However, θE is slightly upward biased estimator particularly when θ is large.

31 Mean number of mutations since the MRCA (Ω)
The mean number Ω of mutations since the most recent common ancestor (MRCA) of a sample is another intuitive summary statistic, but seldom used in practice. This is probably partly due to that its use requires knowing for each segregating site the ancestral nucleotide, and partly because its because its statistical properties are not well understood.

32 Let ωl be the number of mutations in sequence l since MRCA.
Then the average is given by Note that a mutation of size i is counted as one mutation in i of n sequences, we therefore have

33 It follows that

34 Singleton mutations The number ξi of mutations of size 1 in a sample is of special interest because it captures mostly the recent mutations in a sample. According to Fu and Li (1993),

35 Classify the above summary statistics

36 ∏0,0 =θ K ∏1,1 =θ∏ ∏1,0 =θΩ

37 Weight of ∏k,l statistics

38 Distribution of θ ξi Ω ∏ θK
A sample of 100 from a population with θ=5.

39 Neutral hypothesis as the null model
Whether a locus has been evolving under natural selection is often of interest if the locus represent a gene or linked to one. As typical in many branches of sciences, a simpler explanation of phenomenon is often preferred unless there is strong evidence to suggest otherwise. In population genetics study, the neutral hypothesis of evolution is arguably simpler than any other hypotheses and is much better understood statistically. As a result, it is now generally used as the null model for analyzing polymorphism. A significant deviation from the null model may signal the presence of forces that are absent or factors that are over-simplified in the null model.

40 Statistical tests using estimators of θ
There are several ways statistical tests can be constructed to see if the null model is adequate for explaining the observed amount and pattern of polymorphism. Many summary statistics (estimators of θ) have quite different expectation when the null model is violated, this offer an opportunity of testing by considering the difference between two measures of polymorphism.

41 Suppose L1 and L2 are two different summary statistics such that E(L1) =E(L2) under the hypothesis of strict neutrality. Then one way to test the null hypothesis of strict neutrality is to use the normalized difference as test statistic. Normalization is intended to minimize the effect of unknown parameter(s) so that the resulting test is more rigorous. Note that V ar(L1−L2) is a function of θ so its value needs to be estimated.

42 Although every pair of statistics L1 and L2 can be used to construct a test as long as E(L1) = E(L2) and V ar(L1−L2) can be computed, such a test is useful only if the values of L1 and L2 are likely different when the locus under study depart from neutrality. Unfortunately the distribution of a test of the form above is not well approximated by any standard distribution, so that obtaining critical values from a large number of simulated samples is commonly used, which means that the best way to apply such tests is to use a computer package that implement the test. Therefore, we will focus on discussing the rational of several tests rather than detail of their computations.

43 Tajima test the parameter θ required for computing the variance is estimated by K/an.

44 Rational of Tajima test
Since K ignores the frequency of mutants, it is strongly affected by the existence of deleterious alleles, which are usually kept in low frequencies. In contrast, ∏ is not much affected by the existence of deleterious alleles because it takes the frequency of mutants into consideration. Therefore, a D value that is significantly different from 0 suggests that the null hypothesis should be rejected.

45 Indication of Tajima’s D
When a population has been under selective sweeps (and population growth), K/an will likely be larger than ∏, resulting in negative value of D. When a population has been under balance selection (or population structure with sampling from many populations), K/an will likely be smaller than ∏, resulting in positive value of D.

46 Tajima’s D Expectations
Neutrality: D=0 Balancing Selection: D>0 Divergence of alleles (π) increases Purifying or Positive Selection: D<0 Divergence of alleles decreases Also Bottleneck, D>0 (S decreases) Population expansion: D<0 (Divergence of alleles decreases: many low frequency alleles)

47 常用软件 DnaSp PAML Arlequin
PAML Arlequin


Download ppt "人类群体遗传学 基本原理和分析方法 徐书华 金 力 中科院-马普学会计算生物学伙伴研究所"

Similar presentations


Ads by Google