Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)

Similar presentations


Presentation on theme: "Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)"— Presentation transcript:

1 Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)

2 Scientific literature Scientific publications began as interpersonal communications – lectures, seminars and discussions – oral communication. Formal written article or books – scientific literature. Today, journals, presentation at meetings, books, book chapters, Web material, films, radio, television programs, podcasts. Formal academic publications must pass the test of ‘peer review’ – quality control. Before the Internet, scientific literature appeared on paper (journals). Today, journals appear electronically as well as on paper (some rarely visit a library to read journals). Delocalized literature delivery and computational methods of information retrieval. 2

3 Economic factors governing access to scholarly publications 3 Traditional economic model of scientific journals: a scientific organization or publisher produces and distribute at regular intervals, a paper-bound ‘issue’ of articles. – Cost: editorial office; preparation of manuscripts; printing/distribution. – Support (income): sales (subscription), page charges to authors, donation, subsidy, advertisements etc. Recently, changes: – More papers are published – driving up costs. – Larger volume of publication puts libraries under financial pressure. – Electronic facilities reduces costs. – Electronic distribution extends the potential format of journal articles. – User community supports open access.

4 Open access / traditional and digital libraries Redefinition of the author/publisher/reader relationship. – Retains peer-review process. – Accepted articles are placed on the Web, with free access. – Authors retain copyright (instead of publisher). – Cost of publication are transferred from readers to authors. Traditional libraries – you know what it is. Digital libraries. – Electronic form, on-line. – Raise economic questions. – Large-scale digital libraries by scanning? 4

5 The information explosion / Databases Efficient delivery can be a mixed blessing. Impossible for anyone to read all the literature in a given field. The Web gives a higher dimension – no longer linear, new media, new way of searching, bibliography management, organizing and sharing the harvest. Databases: contents, ontology, logical structure, format of the data, routes for retrieval of data, links to other resources. Literature as a database: e.g. Medline (Medical Literature Analysis and Retrieval System Online) – now part of PubMed, bibliographic database. 5

6 Databases Database organization / design – e.g. design of a relational database of amino acids. Annotation: a typical entry in a molecular biology database might contain other information (other than say gene sequences). – Reference information (citations of publications). – Interpretative information. – Links to other information. Database quality control (errors?) – “Get it right the first time”: database curation and annotation – a new profession. – Identify errors – external curators /users. – Tracking database changes. 6

7 Databases Database access: a issue to consider. Links (utility of a database): internal links and external links. Database interoperability: questions that require appeal to multiple database at once? – Merge several databases? – Methods for intercommunication between databases? Data mining. – Knowledge discovery: description/explanation. – Successful forecasting / predictive modeling. – Statistical techniques. – Artificial neutral networks. – Support vector machines. 7

8 Programming languages and tools Traditional programming languages: FORTRAN, C, C++ Scripting languages: PERL, PYTHON, RUBY… Program libraries specialized for molecular biology: standard libraries (numerical analysis and text processing), libraries for molecular biology (e.g. bioperl.org). Java – Java Virtual Machine – computing over the Web? Markup languages: implements data structures, XML. 8

9 Natural language processing Natural language: verbal-oral and/or textual forms of human- human communication. Natural language processing has been a goal of computing. Difficulty: ambiguity of words and phrases. Identifying keywords and combinations of keywords: e.g. names of genes and names of diseases. Knowledge extraction: protein-protein interactions (automatic text- mining software). Text mining: – Identification of references to genes and proteins. – Identification of interactions. – Interaction networks and diseases. – Hypothesis generation (unsuspected relationships between genes and diseases). 9

10 Archives and information retrieval Lesk, Ch 4 (Lesk, 2008)

11 Database indexing and specification of search terms An index: set of pointers to information in a database. Information retrieval programs accepts multiple query terms and keywords. Possible to ask for logical combinations of indexing terms. Many database search engines allow complex logical expressions. Follow-up questions: modify query, cumulative searches, links between entries in different databases. Analysis and processing of retrieved data: using results retrieved in one search as input for another one (some information retrieval systems provide such facilities). 11

12 Nucleic acid sequence databases Archiving of bioinformatics data was originally carried out by individual research groups. As requirements grew, projects become very large-scale. Primary data collections related to biological macromolecules: – Nucleic acid sequences, including whole-genome projects. – Amino acid sequence of proteins. – Protein and nucleic acid structures. – Small-molecule crystal structures. – Protein functions. – Expression patterns of genes. – Networks: of metabolic pathways, of gene and protein interactions, and of control cascades. – Publications. 12

13 Nucleic acid sequence databases Triple partnership of the National Center for Biotechnology Information (USA); the EMBLBank (European Bioinformatics Institute, UK) and the Data Bank of Japan (National Institute of Genetics, Japan). Curate, archive and distribute DNA and RNA sequences. Entries have life history: – Unannotated -> Preliminary -> Unreviewed -> Standard Sample entry includes: properties of specific regions (e.g. coding sequences, performs of affect function, interaction with other molecules, affect replication, etc) 13

14 Genome databases and genome browsers Genome browsers (full-genome sequences): databases bringing together all molecular information available about a particular species. E.g. ensembl.org: intended to be the universal information source for the human and other genomes. 14

15 Protein sequence databases In 2002, three protein sequence databases, the Protein Information Resource (PIR), USA and SWISS-PORT, Swiss and TrEMBL, Europe, formed the UniPort consortium. Share the database but continue to offer separate information-retrieval tools for access. Databases associated with SWISS-PORT: – ENZYME DB and PROSITE PIR and associated databases: – PIRSF: protein family classification system. – iProClass: protein knowledge, access to over 90 biological databases. – iProLINK: gateway to protein literature. 15

16 Databases of protein families Evolutionary relationships / homology detection. Two full-length protein sequences (>=100 residues) that have >=25% identical residues in an optional alignment are likely to be related. Need sequence alignment algorithms. Refer to a group of related proteins as a family. 16

17 Databases of structures Structure databases archive, annotate and distribute sets of atomic coordinates. World-wide Protein Data Bank (wwPDB.org). – Joint effort of the Research Collaboratory for Structural Bioinformatics (RCSB) and the Protein Data Bank Japan. – Contains the structures of proteins. – It overlaps several other databases. Several website offer hierarchical classification of all proteins of known structure – SCOPE, CATH, DALI, CE 17

18 Other databases Classification and assignment of protein function. – The Enzyme Commission. – The Gene Ontology Consortium protein function classification. Specialized, or ‘boutique’ databases. Expression (mRNA levels) and proteomics databases (interpretation in terms of protein patterns). Databases of metabolic pathways (flow of molecules and energy through pathways of chemical reactions). Bibliographic databases. Only a few of the many databases… 18


Download ppt "Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)"

Similar presentations


Ads by Google