Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:800.707.

Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.707

People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite gene, protein, or disease, or a high throughput dataset Who is taking this course?

Different user needs, different approaches web-based or graphical user interface (GUI) command line NCBI EBI, central resources UCSC, Ensembl genome browsers Galaxy: web implementation of browser data, NGS tools Perl, R: manipulate data files Linux: next- generation sequencing, other tools Software for data analysis, large databases Partek MEGA5 RStudio

To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI), UCSC, and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems What are the goals of the course?

Suppose we are interested in learning about beta globin (HBB), a subunit of hemoglobin NCBI offers information at the levels of DNA (e.g. variation in disease), RNA (e.g. gene expression data from microarrays), protein (e.g. 3D structure), pathways EBI offers comparable information Other portals such as ExPASy offer hundreds of analysis tools Workflow #1: How do we analyze a protein?

Choose an individual (e.g. a patient), obtain informed consent, get blood, purify genomic DNA Obtain raw DNA reads by next-generation sequencing Align the reads to a reference human genome Call variants (single nucleotide variants, indels) Determine the functional significance of variants (deleterious or neutral) Workflow #2: How do we analyze a human genome?

Textbook The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2 nd edition 2009). The lectures in this course correspond closely to chapters. I will make pdfs of the chapters available to everyone. You can also purchase a copy at the bookstore, at amazon.com, or at Wiley with a 20% discount through the book’s website www.bioinfbook.org.

Web sites The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle (or Google “moodle bioinformatics”) --This site contains the powerpoints for each lecture, including black & white versions for printing --The weekly quizzes are here --You can ask questions via the forum --Audiovisual files of each lecture will be posted here The textbook website is: http://www.bioinfbook.org This has powerpoints, URLs, etc. organized by chapter. This is most useful to find “web documents” corresponding to each chapter.

Themes throughout the course: the beta globin gene/protein family We will use beta globin as a model gene/protein throughout the course. Globins including hemoglobin and myoglobin carry oxygen. We will study globins in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species

Computer labs There is no in-class computer lab, but the seven weekly quizzes function as a take-home computer lab. To solve the questions, you will need to go to websites, use databases, and use software. Most quizzes are due in 7 days. Because of Thanksgiving, the first quiz will be due in 9 days (November 28 th at noon).

Grading 60% moodle quizzes (your top 6 out of 7 quizzes). Quizzes are taken at the moodle website, and are due one week after the relevant lecture. Special extended due date for quizzes due immediately after Thanksgiving and the New Year. 40% final exam Thursday, January 17 (2pm, in class). Closed book, cumulative, no computer, short answer / multiple choice. Past exams will be made available ahead of time.

Google “moodle bioinformatics” to get here; Click “Bioinformatics” to sign in; The enrollment key you need is…

Outline for the course (all on Mondays) 1. Accessing information about DNA and proteinsNov. 19 2. Pairwise alignment and BLASTNov. 26 3. Advanced BLAST to next-generation sequencingDec. 3 4. Multiple sequence alignment Dec. 10 5. Molecular phylogeny and evolution Dec. 17 6. MicroarraysJan. 7 7. Next-generation sequencingJan. 14 Final exam (Thursday 2:00 pm)Jan. 17

Outline for today Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy

Learning objectives for today [1] You should be able to explain what accession numbers are and why the RefSeq project is significant [2] You should be able to find accession numbers for any gene (or protein) from any organism via Entrez Gene [3] You should be able to locate any human gene using the UCSC Genome Browser [4] You should be able to locate information about genes (and proteins) using the UCSC Table Browser and the related resource Galaxy

Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. What is bioinformatics?

Three perspectives on bioinformatics The cell The organism The tree of life Page 4

First perspective: the cell

DNARNAphenotypeprotein Page 5

DNARNAprotein Central dogma of molecular biology genometranscriptomeproteome Central dogma of bioinformatics and genomics

DNARNA cDNA ESTs UniGene phenotype genomic DNA databases protein sequence databases protein Fig. 2.2 Page 18

Growth of GenBank Year Base pairs of DNA (millions) Sequences (millions) 198219861990199419982002 Fig. 2.1 Page 17

Sequence Read Archive (SRA) at NCBI: over 900 terabases of sequence (~1 petabase) A project to sequence 10,000 human genomes requires about 3 petabytes of storage (3,000 terabytes). To buy a server with 100 Tb now costs $10,000. Year 2013 20122011201020092008 Size, terabases

Arrival of next-generation sequencing: We have sequenced hundreds of terabases (November 2012) 6 years ago GenBank celebrated reaching 100 billion base pairs of DNA. Now when my lab sequences the genome of one patient, we obtain ~150 billion base pairs of sequence and ~1 terabyte of data for $3,000. This course will reflect the major impact of increased DNA sequencing on all areas of biology.

GenBankEMBLDDBJ There are three major public DNA databases The underlying raw DNA sequences are identical Page 14

GenBankEMBLDDBJ Housed at EBI European Bioinformatics Institute There are three major public DNA databases Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 14

There storage of next-generation sequence data is an emerging challenge NCBI offers a sequence read archive (SRA), but the best storage strategies are uncertain. http://www.ncbi.nlm.nih.gov/Traces/sra/ The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), http://www.ebi.ac.uk/ena/ For individual labs, tens of terabytes of storage are routinely needed.

Time of development Body region, physiology, pharmacology, pathology Page 5 Second perspective: the organism

After Pace NR (1997) Science 276:734 Page 6 Third perspective: the tree of life

Taxonomy at NCBI: >250,000 species are represented in GenBank Page 16 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi 11/12

The most sequenced organisms in GenBank Homo sapiens 16.3 billion bases Mus musculus 10.0b Rattus norvegicus 6.5b Bos taurus5.4b Zea mays 5.1b Sus scrofa4.9b Danio rerio 3.1b Strongylocentrotus purpurata1.4b Macaca mulatta 1.3b Oryza sativa (japonica) 1.3b Updated Nov. 2012 GenBank release 192.0 Excluding WGS, organelles, metagenomics Table 2-2 Page 17

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Fig. 2.4 Page 24

National Library of Medicine's search service 22 million citations in MEDLINE (as of 2012) links to participating online journals PubMed tutorial on the site http://www.ncbi.nlm.nih.gov/pubmed or visit NLM: http://www.nlm.nih.gov/bsd/disted/pubmed.html Page 23 NCBI key features: PubMed Be sure to access PubMed via Welch Library! http://welch.jhmi.edu/ Also get to know your informationists— Peggy Gross, Rob Wright et al.

Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24 NCBI key features: Entrez search and retrieval system

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for beta globin, HBB): X02775GenBank genomic DNA sequence NG_000007.3RefSeqGene rs192792910dbSNP (single nucleotide polymorphism) AA970968.1An expressed sequence tag (1 of 2,345) NM_000518.4RefSeq DNA sequence (from a transcript) NP_000509.1RefSeq protein CAA00182.1GenBank protein Q14473SwissProt protein 1YE0|BProtein Data Bank structure record protein DNA RNA Page 27

NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_000518 ProteinNP_###### e.g. NP_000509 Page 27

Accession MoleculeMethodNote AC_123456 GenomicMixedAlternate complete genomic AP_123456 ProteinMixedProtein products; alternate NC_123456 GenomicMixedComplete genomic molecules NG_123456 GenomicMixedIncomplete genomic regions NM_123456 mRNAMixedTranscript products; mRNA NM_123456789 mRNAMixedTranscript products; 9-digit NP_123456 ProteinMixedProtein products; NP_123456789 ProteinCurationProtein products; 9-digit NR_123456 RNAMixedNon-coding transcripts NT_123456 GenomicAutomatedGenomic assemblies NW_123456 GenomicAutomatedGenomic assemblies NZ_ABCD12345678 GenomicAutomatedWhole genome shotgun data XM_123456 mRNAAutomatedTranscript products XP_123456 ProteinAutomatedProtein products XR_123456 RNAAutomatedTranscript products YP_123456 ProteinAuto. & CuratedProtein products ZP_12345678 ProteinAutomatedProtein products NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences

Access to sequences: Entrez Gene at NCBI Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_000518.4 for beta globin DNA corresponding to mRNA) or protein (NP_000509.1) Page 29

From the NCBI home page, type “beta globin” and hit “Search” Fig. 2.5 Page 28

Fig. 2.5 Page 28 Follow the link to “Gene”

Entrez Gene is in the header Note the “Official Symbol” HBB for beta globin Note the “limits” option

Entrez Gene (top of page): Note a useful summary, and links to other databases

“Gene” page at NCBI offers a wealth of information Genomic context Bibliography Phenotypes Gene Ontology (organizing principles of biological process, molecular function, cellular component) Reference sequences Additional (non-RefSeq sequences) Many, many links to NCBI resources (e.g. HomoloGene) Many, many links to external resources Page 29

Entrez Gene (bottom of page): non-RefSeq accessions (it’s unclear what these are, highlighting usefulness of RefSeq)

Fig. 2.8 Page 31 Entrez Protein: accession, organism, literature…

Fig. 2.8 Page 31 Entrez Protein: …features of a protein, and its sequence in the one-letter amino acid code

You should learn the one-letter amino acid code!

Entrez Protein: You can change the display (as shown)…

FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code Fig. 2.9 Page 32

While FASTA is one file format, there are many others FASTASequences in one letter DNA or protein code FASTQDNA sequences with quality scores for each base BAMcompressed binary version of SAM SAMSequence Alignment/Map file (tab-delimited) VCFvariant call format (genomic variants; indels) (See genome.ucsc.edu/FAQ/FAQformat.html for the following:) BEDa table including chromosome, start, end WIGwiggle format (displays dense, continuous data) GFFGeneral Feature Format (tab separated) Also, besides Excel (.xls,.xlsx) spreadsheets can also be:.txttab-delimited text file (or space delimited).csvcomma separated text file

FASTQ format specification The FASTQ format has four lines per read (and typically has millions of reads) http://maq.sourceforge.net/fastq.shtml Sequence read (like FASTA) Quality scores (per base) Sequencing run information

Genome Browsers: increasingly important resources Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information. The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used.

Ensembl genome browser (www.ensembl.org) click human note BioMart

enter beta globin

Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom). There are various horizontal annotation tracks.

A quiz question about Ensembl/BioMart For this week’s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the microRNAs on chromosome 11, as well as their GC content. There are step-by-step instructions and it should take no more than a couple minutes.

The UCSC Genome Browser: an increasingly important resource This browser’s focus is on humans and other eukaryotes you can select which tracks to display (and how much information for each track) tracks are based on data generated by the UCSC team and by the broad research community you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

[1] Visit http://genome.ucsc.edu/, click Genome Browser [2] Choose organisms, enter query (beta globin), hit submit Page 36

Note that there are choices of assemblies such as hg19 An assembly (or “build”) is a fixed version of a genome. Builds are released every several years. In practice, you should always be aware whether you are using hg18 or hg19 for the human genome. They are annotated with different types of information such as experimental data sets. To learn more visit http://www.ncbi.nlm.nih.gov/assembly/basics/ http://www.ncbi.nlm.nih.gov/assembly/model/

[3] Choose the RefSeq beta globin gene (HBB)

[4] On the UCSC Genome Browser: --choose which tracks to display --add custom tracks --the Table Browser is complementary

Exploring the UCSC Genome Browser The human genome can be viewed with different “assemblies” (hg18, hg19). These contain different data sets. You can get information about a track by clicking its header (e.g. “RefSeq Genes”). You can choose the density to display each track (e.g. hide, dense, squish, pack).

You can reach the Table Browser from the Genome Browser

Use the UCSC Table Browser to get tabular data related to any of the Genome Browser tracks. It is quantitative, not visual. Get output Get a summary of the output assembly species position (where you were on the Genome Browser) check box to send a BED file to Galaxy

Step 2: click “Send query to Galaxy”

Tools panel Display panel History panel Step 3: In history panel, click eye to view main panel; click edit attributes to change name to snps

Step 3 (continued): Thus, there are ~890,000 SNPs on human chromosome 11 Step 4: To get coding exons, click “Get Data” (under Tools) then “UCSC Main table browser”

Step 4 (continued): To get coding exons, set the group to Genes and the track to RefSeq Genes, then “get output”

Step 4 (continued): Set the output to “Coding Exons” then send the query to Galaxy

Step 5: In Galaxy, a new data set appears. Rename it “coding exons” by clicking the pencil. Under Tools click “Operate on Genomic Intervals.”

Step 6: Click “Intersect the intervals of two datasets.” In the central panel, choose snps and coding exons. Execute.

Step 7. Here’s the answer. There are 11,652 SNPs in coding exons. Finally, click “display at UCSC main.”

Step 7. This returns us to the UCSC Genome Browser, where our Galaxy results are displayed as a custom track (entitled “User Supplied Track”).

Galaxy’s URL is http://usegalaxy.org. Its features include: Integration with UCSC and many other major genomics resources such as BioMart It is intuitive and does not require knowledge of computer programming, but it offers access to sophisticated software There are excellent videocasts and tutorials It fosters reproducibility. Your history and workflows are saved. See the “User” tab and sign up for a free account! Summary: Galaxy

[1] NCBI is a central resource for bioinformatics. [2] We described accession numbers and the RefSeq project, which offers a trusted version of a sequence. Use Entrez Gene as a key source of gene information. [3] We used the UCSC Genome Browser to visualize information along chromosomes. [4] We then used the UCSC Table Browser to obtain tabular outputs of queries. The underlying data are the same in the Genome and Table Browsers. [5] Next we used Galaxy to solve a problem, and sent the final result back to the UCSC Genome Browser. Summary: today’s lecture

Reminder: Please enroll! Google “moodle bioinformatics” to get here; click “Bioinformatics” to sign in; The enrollment key is…

Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:800.707.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:800.707."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:800.707.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E:800.707."— Presentation transcript:

Similar presentations

About project

Feedback