Download presentation
Presentation is loading. Please wait.
Published byBartholomew Quinn Modified over 8 years ago
1
Introduction to medical informatics. Information retrieval
2
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval2 The reasons for information explosion An important characteristic of research work is its historic continuity. Scientists use the knowledge made by their predecessors. Scientific knowledge spreads by publishing. All the research policies are based on publishing (“publish or perish”). The information explosion is rooted in the last four centuries of scientific work but happened in the 20 th century.
3
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval3 Manifestations of information explosion Many phenomena in scientific information grow exponentially: number of classical (paper) publications, number of e-documents, number of web servers, usage of web information tools...
4
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval4 How does it all started? From: Derek J. de Sola Price. Little science, big science. Columbia University Press, 1963. First scientific societies establish scientific journals (2 nd half of the 17 th cent.). In 100 years exponential growth was established. In 200 years information tools were needed. First abstract journals in the 1 st half of the 19 th cent.
5
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval5 History of scientific information 20 th century 50’s: Specialised information centres and services, fast growth of secondary bibliographic publications. 60’s: Introduction of computers to scientific information. First bibliographic databases. 70’s: On-line access to bibliographic databases. Successful experiments in automatic indexing. 80’s: Bibliographic databases on CD-ROMs. 90’s and later: Access to information is independent of its position. Networks prevail.
6
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval6 Manifestations of information explosion Number of Internet hosts.
7
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval7 Manifestations of information explosion Number of web sites (1993 – 2003)
8
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval8 The role of scientific information Information tools alleviate the consequences of information explosion. Approx. 10% --15% of research deals with problems which were already solved and results published. Without information tools this proportion would be much higher, and absolute amount of research done would be much lower.
9
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval9 Types of information tools Library catalogues (originating in ancient times), secondary publications, bibliographic databases, document (full-text) databases, mostly on Web access organised by web search engines, access organised by web directories, documents organised by digital libraries, Compound web information tools – web portals. The central problem of them all How to describe the document’s content.
10
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval10 Secondary publications The oldest of “modern” information tools. Designed to help users find primary documents. “Pointers” to primary documents were bibliographic records. Users were directed to bibliographic records from abstracts of primary documents (abstract journals), from subject description of primary documents with key-words or key-phrases (index journals). Index Medicus – well developed index journal.
11
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval11 Bibliographic databases Basically computerised secondary (index) journals. Born in early 60’s. Great advantage: searching instead of browsing, searching through long periods of time instead of one issue.
12
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval12 Bibliographic databases Ways of usage: user needs information on specific subject – search using subject description (key-words, descriptors); user need all papers of specific author, research group or institution – production of bibliographies; politics need to assess the quality of an individual, group or institution and use citation analysis of their work (parallel use of bibliographic database and Science Citation Index).
13
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval13 Bibliographic databases Connection with the library: in bibliographic database we learn about the existence of a document that suits our information need; in library we get a document, possibly by interlibrary loan. In the last years bibliographic records increasingly serve as hypertextual pointers to full e-documents.
14
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval14 Bibliographic database vs. library catalogue Bibliographic database is not library catalogue. The essence of library catalogues are the location data – positions and holdings of library units. Library catalogues normally have data on books, proceedings, journals, and very rarely on journal articles. Bibliographic databases and library catalogues are complementary.
15
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval15 Describing the subject of documents Most information needs could be fulfilled with documents on specific subject. To find such documents we need to describe their subject in a database prior to searching. The procedure is called indexing. In bibliographic databases it is done intellectually (“manually”). In big databases of full-text documents it is done automatically.
16
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval16 Indexing bibliographic databases Indexing and searching are mirror images of the same procedure. While indexing the document D the indexer tries to guess key-words which the searcher would use to find documents with subject like D. While searching for documents with subject S the searcher tries to guess keywords which the indexer would use to index documents with subject S.
17
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval17 Thesaurus Guessing is relatively easy if the indexer and searcher use key-words (descriptors) from the same thesaurus. Thesaurus is a list of subject concepts and instructions for their use. Subject concepts in a thesaurus are connected with semantic relations, most often hierarchical relations.
18
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval18 Thesaurus What are subject concepts? The smallest units of knowledge written with words or phrases. Each subject concept has an independent meaning. The subject concept describes a distinct object or conception. Each subject concept includes all synonyms and lexical variants. One synonym or variant is chosen as a “preferred term” and is called descriptor.
19
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval19 Thesaurus Descriptors form artificial language, which is used for indexing and searching: for each object or conception only one descriptor exists (control of synonyms), and each descriptor describes only one object or conception (control of homonyms).
20
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval20 Document databases Bibliographic record is only a substitute for a real carrier of information – the document. Bibliographic record is an implicit pointer to a document. Information need could be fulfilled only with a full document. Contemporary textual databases organise full documents instead of bibliographic records.
21
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval21 Document databases The gap between the worlds of bibliographic and document databases is closing: good news: bibliographic databases include more and more pointers to full documents; bad news: the user can access the full document only if he or his institution have commercial contract with the publisher.
22
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval22 Subject description in document databases Intellectual indexing is expensive and takes time. Indexer is a rare kind of person with at least superficial knowledge of domain from which the documents are, and profound knowledge of information tools and procedures. Since 70’s we have very efficient algorithms for automatic indexing of full texts.
23
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval23 Automatic indexing The subject of a document is represented by the document itself, not by the indexer’s understanding of the document. From document automatic procedures select words that are best in representing its subject. The most successful are statistical procedures. Natural language understanding (by computer) is not economic for large numbers of documents. Great part of automatic indexing methods is language-dependent.
24
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval24 Automatic indexing Usual steps in automatic indexing: Use of stop-words (leaving out words without information – conjunctions, adjectives, pronouns…); stemming (normalisation of word forms into a common form, stem); weighting of remaining stems (calculating the amount of information in them). Stemming is not as important for English as it is for more inflected languages, e.g. Slavonic languages.
25
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval25 Automatic indexing Weighting of word stems: The stem that represent an important subject of a document should have high weight. For calculation of weights are important frequencies of word stems in a document and document collection. The high weight will be attributed to a stem, that is frequent in a document, and is found in small number o documents.
26
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval26 Search models Boolean search model: Mostly is used for searching in databases where subject is described with few key-words or key- phrases (e.g. bibliographic databases). Divides database in two simple sets: relevant documents (hits) and non-relevant documents. Relevancy is a binary property. Non-Boolean search model: Relevancy is non-binary property – documents could be more or less relevant.
27
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval27 Boolean search model Boolean operators AND, OR and NOT. Query diabetes AND insulin finds documents that contain both descriptors. Query diabetes OR insulin finds documents that contain at least one of the descriptors. Query diabetes NOT insulin finds documents that contain first descriptor, but not the second one.
28
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval28 Boolean search model Critique of the Boolean model: With the query d1 AND d2 AND d3 AND d4 only the document that contain all four descriptors will be found. Probably the document with three or even two of the descriptors could be useful but it will newer be among hits.
29
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval29 Boolean search model Critique of the Boolean model: With the query d1 OR d2 OR d3 OR d4 all documents that contain at least one of the descriptors will be found. All the found documents will be served to the searcher as equivalent. Probably documents with all four descriptors would be more relevant that documents with one of them.
30
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval30 Non-Boolean search models Best for searching in databases with many key- words per document (probably result of automatic indexing). Relevancy is computed as a similarity between query and document. Similarity is computed from the number of words (stems) appearing both in query and document. Relevancy score is more precise if weights of words (stems) common to query and document are used.
31
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval31 Non-Boolean search models If relevancy is a non-binary property, then the user can be served with a list of hits, ranked by the relevancy score; the user inspects the documents starting at the top of the list and descending until relevant documents become too rare. Usually the user have to inspect smaller number of documents to find the same number of relevant than with the Boolean model searches. In that way the big web search engines work, e.g. Google, AltaVista, Teoma…
32
Alpe Adria Master Course :: Medical Informatics :: Dr. J. Dimec - Introduction, information retrieval32 Relevancy score: an example document #common stemsstem weightsrelevancy D1anatomi thor 2323 sum: 5 D2pictur anatomi thor web 45934593 sum: 21 Q: Pictures of thorax in the anatomy atlases on the web. D1:Department of the anatomy of thorax got new lecture room. D2:Educational collection with the pictures of thorax now on the web.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.