Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topics in 2 nd Part: Biological Information and Tools. Molecular Modeling Technology and Applications. Computer-aided drug design SMA5422: Special Topics.

Similar presentations


Presentation on theme: "Topics in 2 nd Part: Biological Information and Tools. Molecular Modeling Technology and Applications. Computer-aided drug design SMA5422: Special Topics."— Presentation transcript:

1 Topics in 2 nd Part: Biological Information and Tools. Molecular Modeling Technology and Applications. Computer-aided drug design SMA5422: Special Topics in Biotechnology Chen Yu Zong Department of Computational Science, NUS Office: Blk SOC1 Room 07-24 Tel.: 65-874-6877. E-mail: yzchen@cz3.nus.edu.sg SMA5422: Special Topics in Biotechnology Chen Yu Zong Department of Computational Science, NUS Office: Blk SOC1 Room 07-24 Tel.: 65-874-6877. E-mail: yzchen@cz3.nus.edu.sg

2 Schedule Lecture 6 (Feb 14): Biological information database and data mining. Lecture 7 (Feb 21): Gene and protein sequence alignment methods. Lecture 8 (Feb 26): Machine learning techniques in sequence analysis. Lecture 9 (Mar 5): Computer modeling of biomolecules: Structure, motion, and binding. Lecture 10 (Mar 7): Computer aided drug design: structure-based approach. Lecture 11 (Mar 12): Computer aided drug design: QSAR approach.

3 Lecture 6: Biological information database and data mining Biology as an information intensive science Typical databases Introduction to data mining Data mining in biology

4 Biology as an information intensive science Organization of living systems: Ecosystems=> Communities=> Populations => Organisms => Organ systems => Organs => Tissues => Cells => Molecules. Ecosystem: All living things in a particular area (such as an island) and all non-living, physical components of the environment that affect living things (such as air, soil, water, sunlight). Community: All living things in an ecosystem (such as all animals, plants, bacteria, fungal, viruses etc. in a rain forest). Population: A group of interbreeding individuals of one species (such as all flying squirrels in a rain forest). Organism: An individual living thing (such as one flying squirrel). Organ system: A group of related body components that perform a specific type of function (such CNP). Organ: Functional group of organ system (such as brain).

5 Biology as an information intensive science Fundamental Theory: Evolution: Simple molecules => Organic molecules => RNA-based life systems => Single cells => Multiple cellular organisms => Higher organisms Molecular Basis of Life: DNA (Genes) => RNAs => Proteins: Structural organization Chemical reaction, synthesis and destruction of molecules Signal transduction Transportation of molecules. Regulation

6 Biology as an information intensive science Cell Organization and Function: Structural organization Chemical reaction, synthesis and destruction of molecules Signal transduction Transportation of molecules. Regulation

7 Biology as an information intensive science Information (Molecular Level): DNA: 30,000 ~ 100,000 genes for human (many with unknown functions) 3x10 9 base pairs for human DNA (< 10% coding region) Protein: 60,000 ~ 100,000 proteins for human. Individual level: sequence, 3D structure, molecular function. Group level: pathways, cellular location, collective function. Classification: Family: superfamily, family, subfamily (based on evolution and function) Type: receptor, ion channel, enzyme, carrier, regulator, structure Function: Physiological function, diseases, therapeutics, toxicity, pharmacokinetics, agriculture, plant, environmentally relevant.

8 Typical Databases Category: General Sequence 3D structure Protein function, proteomics, and pathways. Pharmainformatics Medical informatics and disease information Reference: Nucleic. Acids. Res.Nucleic. Acids. Res., 30, 1-12 (2002). Internet links: http://www.cz3.nus.edu.sg/~yzchen/database.html

9 Typical Databases General: The National Center for Biotechnology Information (NCBI). The National Center for Biotechnology Information (NCBI). Integrated ENTREZ retrieval software and databases for genetics, gene and protein sequences, 3D structures, and on-line PubMed library. CAM (Complementary and Alternative Medicine) on PubMed. ENTREZ CAM (Complementary and Alternative Medicine) on PubMed Pedro's BioMolecular Research Tools. Pedro's BioMolecular Research Tools. A Collection of WWW Links to Information and Services Useful to Molecular Biologists. Other mirror sites in Germany, and Switzerland.GermanySwitzerland The CMS Molecular Biology ResourceThe CMS Molecular Biology Resource. This site is a compendium of electronic and Internet-accessible tools and resources for Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry, and Biomolecular Modeling. Other mirror sites in Japan, Canada, France, Germany, Italy, and UK.JapanCanadaFrance GermanyItalyUK

10 Typical Databases Sequence: The Genome Data Base (GDB).The Genome Data Base (GDB). Database for genes of human and other species. Located at Johns Hopkins University School of Medicine. Mirror site in Japan.Mirror site in Japan. Genome Sequence DataBaseGenome Sequence DataBase. Located at the National Center for Genome Resources (NCGR) in Santa Fe. Site has info on Human Genome Project, gentics and public issues, education and references. SWISS-PROTSWISS-PROT Annotated protein sequence database. Online Mendelian Inheritance in Man.Online Mendelian Inheritance in Man. Database that catalogs the human genes and genetic disorders. Located at NCBI. Pfam: Protein families database of alignments and HMMsPfam: Protein families database of alignments and HMMs. A large collection of multiple sequence alignments and hidden Markov models covering many common protein domains.

11 Typical Databases Structure: Protein Data Bank (PDB).Protein Data Bank (PDB). 3D crystal and NMR structure of proteins, DNA, RNA and ligand-bound complexes. Official mirror site in Singapore, and other places in China., Japan, Taiwan and several places in USA: Boston, North Carolina.Singapore China.JapanTaiwanBostonNorth Carolina Nucleic Acids Database (NDB).Nucleic Acids Database (NDB). 3D crystal structure of DNA and RNA. Mirror sites in UK, Japan, and other sites in USA: San Diego.UKJapanSan Diego SCOPSCOP. Structural classification of proteins. Mirror sites in Singapore, China, the U.S., and Japan.SingaporeChina U.SJapan CATHCATH. Protein Structure Classification. A hierarchical domain classification of protein structures in PDB. MODBASEMODBASE. A database of Comparative Protein Structure Models. Models were generated by PSI-BLAST and MODELLER. As of Aug 2000, there are 3,379 reliable models for domains in 2,220 proteins, and 5433 reliable fold assignments for domains in 3,083 proteins.

12 Typical Databases Function and pathways: GeneCardsGeneCards. A database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as selected others [gene listing]. PROSITEPROSITE. Protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Mirror sites in Australia, Canada, China, Taiwan.AustraliaCanadaChinaTaiwan PRINTSPRINTS. Protein fingerprint database. A fingerprint is a group of conserved motifs used to characterise a protein family. PROCATPROCAT. A database of 3D enzyme active site templates. It can be thought of as the 3D equivalent of the 1D templates found in sequence motif databases such as PROSITE and PRINTS.PROSITE PRINTS KEGG: Kyoto Encyclopedia of Genes and GenomesKEGG: Kyoto Encyclopedia of Genes and Genomes. Site contains Pathway Info, Disease Catalogs, Cell Catalogs, Molecule Catalog, and Genomic Info. It also provides Links to Pathway and Other Databases.Links to Pathway and Other Databases SPAD: Signaling Pathway DatabaseSPAD: Signaling Pathway Database. An integrated database for genetic information and signal transduction systems. Divided into four categories based on extracellular signal molecules (Growth factor, Cytokine, and Hormone) and stress, that initiate the intracellular signaling pathway.

13 Typical Databases Pharmainformatics: TTD: Therapeutic Target DatabaseTTD: Therapeutic Target Database. A database to provide information about the known and newly proposed therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs/ligands directed at each of these targets. Links to relevant databases also provided. MedChem/Biobyte QSAR Database. MedChem/Biobyte QSAR Database. A collection of 10,000 of QSAR datasets that covers both biological and physical-organic chemistry. The NCI Drug Information System 3D DatabaseThe NCI Drug Information System 3D Database. A collection of 3D structures for over 400,000 drugs which was built and is maintained by the Developmental Therapuetics Program Division of Cancer Treatment, National Cancer Institute. The database is an extension of the NCI Drug Information System.Developmental Therapuetics Program Division of Cancer TreatmentNCI Drug Information System Drug Discovery Databases Compiled by The Biophysical Pharmacology Group at NCIDrug Discovery Databases Compiled by The Biophysical Pharmacology Group at NCI. Site has links to several therapeutics program databases and tools, and a 2D-Gel protein expression database. Pharmaceutical Information Network Pharmaceutical Information Network. A comprehensive information database about drugs and diseases. U. S. Food and Drug Administration Center for Drug Evaluation and ResearchU. S. Food and Drug Administration Center for Drug Evaluation and Research.

14 Introduction to Data Mining Main Objective: Pattern identification, Classification, Extraction of related data (character) set. Tasks: Generation of association rules. Classification and clustering. Pre-processing and post-processing of relevant dataset. General Procedure: 1.Understanding of application domain. 2.Data source identification and data selection. 3.Pre-processing: feature selection, discretization, data cleaning. 4.Data mining: pattern extraction and model building. 5.Post-processing: identification of interesting/useful/novel patterns/rules. 6.Incorporation of patterns in real world tasks.

15 Introduction to Data Mining Example: Generation of association rules: Record of customer purchases: John: Jacket, Boots Alfred: Milk, Cheese, Bread, Shoes Green: Milk, Bread Brown: Milk, Bread, Shoes, Greeting Cards, Pork Eric: Cheese, Milk, Shoes, Beef Bob: Jacket, Boots, Ski Pants Form of association rules: Item A => Item B [sup, conf] sup = support = % of records containing both item A and B conf = confidence = sup / (% of records containing item B)

16 Data Mining in Biology Types of Tasks: Search for similar pattern in a subsection of each member of datasets (e.g. protein sequence motifs). Classification of datasets into groups (e.g. proteins into families). Search for a dataset matching given characteristics (e.g. alignment of a protein sequence against all entries in a protein sequence database). Extraction of particular information from literature (e.g. drugs that bind to a particular protein). Proc. Natl. Acad. Sci. USA 95, 10710-10715 (1998) Structure 7, 1099-1112 (1999) Bioinformatics 17, 721-728 (2001) Bioinformatics 17, 155-161 (2001); 17, 359-363 (2001))

17 Homework 1.Write a very short report about a database assigned to you. 2.Can you give at least two more examples to each type of tasks in biological data mining? 3.Read the reference about typical biological database and get a broad picture about the current status of publicly-accessible bioinformatics databases. 4.Read at least one of the references about data mining in biology and be prepared to give a brief description about the paper.


Download ppt "Topics in 2 nd Part: Biological Information and Tools. Molecular Modeling Technology and Applications. Computer-aided drug design SMA5422: Special Topics."

Similar presentations


Ads by Google