Presentation is loading. Please wait.

Presentation is loading. Please wait.

BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Office hours: Tuesday and Thursday:

Similar presentations


Presentation on theme: "BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Office hours: Tuesday and Thursday:"— Presentation transcript:

1 BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Email: zcsu@uncc.edu Office hours: Tuesday and Thursday: 2:00~3:00pm 08-23-2010

2 Textbook and reading materials  Textbook: Bioinformatics and Molecular Evolution by Paul G. Higgins and Teresa K. Attwood, Blackwell Publishing, 2005.  Additional readings from the current literature may be assigned as appropriate  All lecture slices will be available on line at http://bioinfo.uncc.edu/zhx/binf8201/binf8201.html

3  Weekly or bi-weekly homework assignments, Ph.D students may have additional assignments (30%).  Two midterm exams (60%): 10/5(Tuesday) and 12/14 (Tuesday)  Classroom participation will count for 10% of the grade. Students Evaluation

4 Sequence data explosions  Three almost equivalent biological sequence databases International Sequence Database Collaboration 1.GenBank at NCBI 2.European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database at European Bioinformatics Institute (EBI) 3.DNA database of Japan (DDBI)  Features 1.All published biological sequences are requested to be deposited in the one of these three databases; 2.Data are exchanged among these three databases on a daily basis.

5 Data explosions  Both the number/length of sequences and number of transistors in a CPU increase exponentially with the time.  However the number/length of sequences increases even faster than the number of transistors in a CPU. (t)(t) lnN(t)

6 Sequence data explosions are the result of the continuous development of new sequencing technologies:  Chain termination (Sanger) method (1977)  Automation of sequence determination (late 1980s)  Shotgun sequencing strategy (1995)  NexGen sequencing technologies (2004) 1. 454 pyrosequencing: 454 Life Sciences/Roche Diagnostics 2. Solexa sequencing: Illumina 3. SOLiD sequencing: Applied Biosystems 4. Helico BioSciences: 5. Pacific Biosciences: 6. Polonator: open source

7 Data explosions  Since 1995, the number of sequenced genomes also increases exponentially. As of 8-19-2010 http://www.genomesonline.orghttp://www.genomesonline.org

8 Data explosions :  Since 2006, the number of meta-genome sequences increases exponentially thanks to the advent of next-generation sequencing technologies.  In September, 2009, about 200 meta-genomes are sequenced or are in the process of sequencing. http://www.genomesonline.org

9 Data explosions  The speed of computers also increase exponentially with the time.  However, how can we use the ever powerful computers to solve biological problems is a very challenging task for computer science and biology research communities.

10 Data explosions  More and more biological researches use computational analyses.

11 Microarray/RNA-Seq: transcriptomics Mass spectrometry: Proteomics Nucleus magnetic resonance (MR) and mass spectrometry: Metabolomics What is genomics?  The availability of whole genome sequences of organisms has led to the birth of Genomics that studies the organisms based on the genetic information encoded in the genomes.  According to the subjects of the study, genomics can be divided into: 1. Functional genomics, which is coupled with the development of relevant high-throughput technologies, such as, 2. Comparative/evolutionary genomics

12 What is Bioinformatics?  For a short answer: “Bioinformatics is the use of computational methods to study biological data and problems”.  For a more detailed answer: Bioinformatics is 1.“The development and use of computational methods for studying the structure, function, and evolution of genes, proteins and whole genomes;” 2.“The development and use of methods for the management and analysis of biological information arising from genomics and high-throughput experiments.”

13 Population genetics, molecular evolution and sequence analysis  According to the evolutionary theory, biological sequences are related to one another through heredity and variation;  Sequence analysis methods are thus based on the principles of the evolution of sequences.  Therefore, to analyze sequences, we must understand 1.the dynamics changes of genes (loci) in a population of the same species— population genetics; and 2.how the gene sequences change during the course of evolution among different species — molecular evolution.

14 Sequence Similarity  The similarity of two sequences can be identified by aligning the two sequences using an alignment method/algorithm, such as the BLAST or Smith-Waterman method/algorithm.  Two parameters to describe the similarity of two sequences 1. Identity 2. Similarity Identities = 38/139 (27%), Similarity = 66/139 (47%), Gaps = 9/139 (6.5%) LELTYIVNFGSELAVVSMLPTFFETTFDLPKATAGILASCFAFVNLVARPAGGLISDSVG + Y + FG +A + LPT+ T + AG + FA ++ARP GG +SD + MSFLYAIVFGGFVAFSNYLPTYITTIYGFSTVDAGARTAGFALAAVLARPVGGWLSDRIA SRKNTMGFLTAGLGVGYLVMSMIKPGTFTGTTGIAVAVVITMLASFFVQSGEGATFALVP R + L + + P ++ T I +AV + + G G FA V PRHVVLASLAGTALLAFAAALQPPPEVWSAATFITLAVCLGV--------GTGGVFAWVA -LVKRRVTGQVAGLVGAYGNVG G V G+V A G +G RRAPAASVGSVTGIVAAAGGLG

15 Homologous Sequence  Homology: If the similarity of the two sequences are high enough, it is highly likely that they have evolved from a common ancestor, and we say that they are homologous to each other. For example, if two sequences of 100 amino acids have 80% of identical residuals, the probability by chance that the two sequences share this level of similarity is (1/20) 80.  Homology of two sequences can only be inferred computationally, but is difficult to be tested experimentally.

16 Orthologs and Paralogs There are two distinct types of homologous relationships, which differ in their evolutionary history and functional implications. Orthologs: Evolutional counterparts derived from a single ancestral gene in the last common ancestor of the given two species. Therefore, orthologous genes are related due to vertical evolution. Orthologous genes typically have the same function. Paralogs: homologous genes evolved through duplication within the same or ancestral genome. Therefore, paralogous genes are related due to duplication events. Paralogous genes do not necessary have the same function. duplication speciation

17  When the similarity between two sequences are very low, say, 8% identity, then they could be still homologous due to divergent evolution;  Divergently evolved genes usually have similar biochemical functions. Speciation or duplication homologues Divergence evolution

18  When the similarity between two sequences are very low, say, 8%, they could be of difference origin, and the observed sequence similarity is due to convergent evolution under functional selection during the course of evolution. These two sequences are called analogues. analogues  Analogues may have similar biochemical functions, and they usually only share several amino acids in the active site of enzymes, called motifs. Convergence evolution

19 Horizontal gene transfer (HGT)  During evolution, a progeny obtains its genes from its ancestor (vertical gene transfer), however, it also can obtain genes from other species, genera, or even taxa. This phenomenon is called horizontal gene transfer or lateral gene transfer.  HGT is very pervasive, in particular, in prokaryote, and is believed to be a major driving force for evolution. Archaea Bacteria Eukaryota Vertical gene transfer Horizontal gene transfer LCA (Last common ancestor)


Download ppt "BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Office hours: Tuesday and Thursday:"

Similar presentations


Ads by Google