Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gil McVean Department of Statistics

Similar presentations


Presentation on theme: "Gil McVean Department of Statistics"— Presentation transcript:

1 Gil McVean Department of Statistics
Bioinformatics Gil McVean Department of Statistics

2 What is it to be a human?

3 What is it to be an individual?
Species Diversity (percent) Humans Chimpanzees Drosophila simulans 2 E. coli 5 HIV1 30 Photos from UN photo gallery

4 Is it your genes?

5 Is it your transcripts?

6 Is it your proteins?

7 Is it your protein interactions?

8 Is it your systems?

9 Bioinformatics and genome biology
Bioinformatics is the analytical wing of genome biology It concerns itself with large amounts of data (more than you can look at!) It uses computers and efficient algorithms It is Data assembly Data summary Data modelling Data analysis

10 The raw material

11 The output

12 Classical bioinformatics I: DNA and protein sequence alignment

13 Classical bioinformatics II: Genome assembly

14 Classical bioinformatics III: Gene finding

15 Classical bioinformatics IV: Protein structure prediction

16 Bioinformatics of genetic variation
An area of considerable current attention is human genetic variation The aim of current experiments is to map the genetic basis of human phenotypic variation Disease susceptibility Normal variation It is challenging because of The scale of the data The structure of the data The underlying processes that shape variation Bioinformatics is needed to Assemble, collate, check and summarise data Model the data Make inferences

17 What does the data look like?
Single Nucleotide Polymorphisms (SNPs) Insertion-Deletion Polymorphisms (INDELs) TGCTTGGCAGGGCAGACTGACTGT TGCTTGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAG-CTGACTGT TGCATGGCAGGGCAGACTGACTGT TGCATGGCAGGGCAGACTGACTGT SNP INDEL

18 Collections of SNPs HCB JPT YRI CEU SNP

19 Engineering challenges
Identifying SNPs Working out which SNPs will work on a given platform Controlling the genotyping work-flow Controlling the output quality Performing quality-assurance exercises Identifying problems, gaps and inconsistencies

20 A Bioinformatics problem: How small is my P-value?
The basic idea of association studies is to look for genetic differences between groups Cases (D) It is easy to ask the question “Is there a significant difference in the frequency of a mutation between groups?” Controls (C) Locus of interest

21 The problems In a study of several hundred thousand mutations (or even millions) it is unlikely that we have actually typed the causal variant(s). In a study of several hundred thousand mutations (or even millions), even if NONE of them are causal a lot of them will show significance at the 5%, 1% or even 0.01% level Differences in the frequency of disease incidence between groups (for example African Americans and European Americans) will be associated with ANY genetic difference between them

22 What we really want to ask
“Does any of the genome show an association with disease over and above any effect I might expect from the correlation between genotype and environmental risk?” “If so, what is the most likely position for the causal mutation(s)?” Answering these questions is difficult, but a natural way to approach the problem is to model the process

23 Modelling genetic variation
Evolutionary parameters Population Sample Stochastic Evolutionary process Stochastic Sampling process Selection Mutation Genetic drift Recombination Migration ATGCATGGGCTATTGGACCT ATGGATGGGCTATTGCACCT MODEL ATGCATGGGCAATTGCACCT ATGCATGGGCAATTGGACCT ATGGATGGGCTATTGCACCT Inference

24 Genes in populations Present day

25 Ancestry of current population
Present day

26 Ancestry of sample Present day

27 The coalescent: samples in populations
Most recent common ancestor (MRCA) coalescence Ancestral lineages Present day time

28 How does this help us to think about mapping disease?
Individuals are related to each other through their genealogical history Two nearby points on the genome will have similar genealogical histories, a result of which is that mutations at these positions will also be correlated Understanding how genealogical history changes along the genome (through recombination) and between populations (through historical demography) will allow us to Construct more powerful tests for disease association Localise disease-associated mutations

29 The bioinformatics module
Genomic technologies Annotating genomes Modelling gene evolution Mapping disease genes Measuring gene and protein expression Predicting protein structure


Download ppt "Gil McVean Department of Statistics"

Similar presentations


Ads by Google