Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,

Similar presentations


Presentation on theme: "Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,"— Presentation transcript:

1 Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?

2 Overview  Metadata as Infrastructure –What, Where, When and Who?  What are Entry Vocabulary Indexes? –Notion of an EVI –How are EVIs Built  Time Period Directories –Mining Metadata for new metadata  4W Demo  New Project: Bringing Lives to Light

3 Metadata as Infrastructure  The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

4 Metadata as Infrastructure  The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.  The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

5 What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: Documents are assigned to topic categories, e.g. Dewey Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

6 Texts ‘What’ searches involve mapping to controlled vocabularies Thesaurus/ Ontology

7 Start with a collection of documents. Building a Search Term Recommender

8 Classify and index with controlled vocabulary Or use a pre- indexed collection. Index

9 Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index Use: “Economic Policy” In Library of Congress subj For: “Wirtschaftspolitik”

10 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

11 “What” and Entry Vocabulary Indexes  EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

12 Has an Entry Vocabulary Module been built? User selects a subject domain of interest. Download a set of training data. Build associations between extracted terms & controlled vocabularies. Map user’s query to ranked list of controlled vocabulary terms Part of speech tagging Use an existing EVI. Extract terms (words and noun phrases) from titles and abstracts. User selects search terms from the ranked list of terms returned by the EVI. YES Building an Entry Vocabulary Module (EVI) Searching For noun phrases Internet DB indexed with a controlled vocabulary. Domains to select from: Engineering, Medicine, Biology, Social science, etc. User has question but is unfamiliar with the domain he wants to search. NO Building and Searching EVIs

13 Technical Details Download a set of training data. Build associations between extracted terms & controlled vocabularies. Part of speech tagging Extract terms (words and noun phrases) from titles and abstracts. Building an Entry Vocabulary Module (EVI) For noun phrases Internet DB indexed with a controlled vocabulary.

14 Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a class in the training set

15 Association Measure  Maximum Likelihood ratio W(C,t) = 2[logL(p 1,a,a+b) + logL(p 2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p 1 = p 2 = p= a a+b c c+d a+c a+b+c+d Vis. Dunning

16 Alternatively  Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

17 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

18 EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

19 But why stop there? Index EVI

20 “Which EVI do I use?” Index EVI Index EVI Index EVI

21 EVI to EVIs Index EVI Index EVI Index EVI EVI 2

22 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

23 Support for the Learner with a Query Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages Any catalog: Archives, Libraries, Museums, TV, Publishers Facet Vocabulary Displays WHAT Thesaurus Cross- e.g. LCSH references WHERE Gazetteer Map WHEN Period directory Timeline WHO Biograph. dict. Personal e.g. Who’s Who relations

24 Texts Numeric datasets It is also difficult to move between different media forms Thesaurus/ Ontology EVI

25 Searching across data types  Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

26 Texts Numeric datasets But texts associated with numeric data can be mapped as well… Thesaurus/ Ontology captions EVI

27 Texts Numeric datasets But there are also geographic dependencies… Thesaurus/ Ontology captionsMaps/ Geo Data EVI

28 WHERE: Place names are problematic…  Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg,...  Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar.  Names changes: Bombay  Mumbai.  Homographs:Vienna, VA, and Vienna, Austria; –50 Springfields.  Anachronisms: No Germany before 1870  Vague, e.g. Midwest, Silicon Valley  Unstable boundaries: 19th century Poland; Balkans; USSR  Use a gazetteer!

29 WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map. Timebar 

30 Zoom on map. Click on place for a list of records. Click on record to display text.

31 Texts Numeric datasets So geographic search becomes part of the infrastructure Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI

32 WHEN: Search by time is also weakly supported…  Calendars are the standard for time  But people use the names of events to refer to time periods  Named time periods resemble place names in being: –Unstable: European War, Great War, First World War –Multiple: Second World War, Great Patriotic War –Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc.  Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region

33 Vocabularies are the key! Want: Kung-fu movies? Use LCSH: Hand-to-hand fighting, oriental, in motion pictures. Linking vocabularies WHAT, WHERE, WHEN Library subject headings Topic – Geographic subdivision – Chronological subdivision Place name gazetteer: Place name – Type – Spatial markers (Lat & long) – When Time Period Directory Period name – Type – Time markers (Calendar) – Where

34 Texts Numeric datasets Time period directories link via the place (or time) Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI Time Period Directory Time lines, Chronologies

35 WHEN: Time Period Directory Timeline Link to Catalog Link to Wikipedia

36 WHO: Biographical Dictionary Complex relationships Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links

37 Any document, object, or performance Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages Any catalog: Archives, Libraries, Museums, TV, Publishers Connect it with its context – and other resources. Facet Vocabulary Displays WHAT Thesaurus Cross- e.g. LCSH references WHERE Gazetteer Map WHEN Period directory Timeline WHO Biograph. dict. Personal e.g. Who’s Who relations

38 Demo of search interface

39 Entry Vocabulary Index suggests correct LCSH with different spelling

40 Related places

41 Potentially related people

42 Potentially related periods

43 Mostly in India 16 th - 18 th century

44 Find out more about this area.

45 Different Browsing Options!

46 Zooming in to South Asia Restricting time frame Select

47 More information about the country of India…

48 Wikipedia CIA Factbook BBCEthnologue Berkeley Natural History Museums

49 Historical events – linked to Library catalog & Wikipedia : none avail. for this time period

50 ECAI Cultural Atlases: presenting history in its geographical & chronological contexts

51 Mongol Empire Video

52 Demo Interface  http://ecai.berkeley.edu/imls2004/imls4w/

53 New Project: Bringing Lives to Light: Biography in Context Ray R. Larson, Michael Buckland, Fredric Gey University of California, Berkeley

54 Overview  Focussing on the Who in Who, What, Where and When  Types of Biographical Markup

55 WHEN, WHERE and WHO  Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

56 Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

57 Texts Numeric datasets A new form of biographical dictionary would link to all Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI Time Period Directory Time lines, Chronologies Biographical Dictionary

58 Projected Work  Develop XML markup for Biographical Events  Most likely to be adaptation and extension of existing biographical event markup –Example: EAC/EAD  Harvest biographical resources –Wikipedia, etc.  Integrate as next generation of current interface

59 EAC/EAD Biographical Note 1892, May 7 Born, Glencoe, Ill. 1915 A.B., Yale University, New Haven, Conn. 1916 Married Ada Hitchcock 1917-1919 Served in United States Army

60 Wikipedia data Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links

61

62 A Metadata Infrastructure CATALOGS Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages RESOURCES INTERMEDIA INFRASTRUCTURE Biographical DictionaryWHO TimelinesTime Period DirectoryWHEN MapsGazetteer WHERE Syndetic StructureThesaurusWHAT Special Display ToolsAuthority ControlFacet Learners Dossiers

63 Acknowledgements  Electronic Cultural Atlas Initiative project  This work is being supported supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries  Contact: ray@ischool.berkeley.edu


Download ppt "Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,"

Similar presentations


Ads by Google