Presentation is loading. Please wait.

Presentation is loading. Please wait.

2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Similar presentations


Presentation on theme: "2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval."— Presentation transcript:

1 2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 19: DLs and GIR

2 2010.04.05 - SLIDE 2IS 240 – Spring 2010 Today Digital Libraries and IR Image Retrieval in DL From paper presented at the 1999 ASIS Annual Meeting More on Geographic Information Retrieval

3 2010.04.05 - SLIDE 3IS 240 – Spring 2010 UCB Digital Library Project: Research Agenda Funded by NSF/NASA/DARPA Digital Library Initiative (Phases I and II) ~1993- 2004 Research agenda –Understand user needs. –Extend functionality of documents. “Enliven” legacy documents. –Improve access to information. –Scale to large systems. –Re-Invent Scholarly Information Access and Use

4 2010.04.05 - SLIDE 4IS 240 – Spring 2010 Testbed: An Environmental Digital Library Collection: Diverse material relevant to California’s key habitats. Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries. Potential: Impact on state-wide environmental system (CERES )

5 2010.04.05 - SLIDE 5IS 240 – Spring 2010 The Environmental Library - Users/Contributors California Resources Agency, California Environment Resources Evaluation System (CERES) California Department of Water Resources The California Department of Fish & Game SANDAG UC Water Resources Center Archives New Partners: CDL and SDSC

6 2010.04.05 - SLIDE 6IS 240 – Spring 2010 The Environmental Library - Contents Environmental technical reports, bulletins, etc. County general plans Aerial and ground photography USGS topographic maps Land use and other special purpose maps Sensor data “Derived” information Collection data bases for the classification and distribution of the California biota (e.g., SMASCH) Supporting 3-D, economic, traffic, etc. models Videos collected by the California Resources Agency

7 2010.04.05 - SLIDE 7IS 240 – Spring 2010 The Environmental Library - Contents As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases.

8 2010.04.05 - SLIDE 8IS 240 – Spring 2010 Botanical Data: The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.

9 2010.04.05 - SLIDE 9IS 240 – Spring 2010 Geographical Data: Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.

10 2010.04.05 - SLIDE 10IS 240 – Spring 2010 Documents: Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.

11 2010.04.05 - SLIDE 11IS 240 – Spring 2010 Documents - cont. The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.

12 2010.04.05 - SLIDE 12IS 240 – Spring 2010 Photographs: The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos.

13 2010.04.05 - SLIDE 13IS 240 – Spring 2010 Testbed Success Stories LUPIN: CERES’ Land Use Planning Information Network –California Country General Plans and other environmental documents. –Enter at Resources Agency Server, documents stored at and retrieved from UCB DLIB server. California flood relief efforts –High demand for some data sets only available on our server (created by document recognition). CalFlora: Creation and interoperation of repositories pertaining to plant biology. Cloning of services at Cal State Library, FBI

14 2010.04.05 - SLIDE 14IS 240 – Spring 2010 Research Highlights Documents –Multivalent Document prototype Page images, structured documents, GIS data, photographs Intelligent Access to Content –Document recognition –Vision-based Image Retrieval: stuff, thing, scene retrieval –Natural Language Processing: categorizing the web, Cheshire II, TileBar Interfaces

15 2010.04.05 - SLIDE 15IS 240 – Spring 2010 User Interface Paradigms: Multivalent Documents An approach to new document types and their authoring. Supports active, distributed, composable transformations of multimedia documents. Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

16 2010.04.05 - SLIDE 16IS 240 – Spring 2010 Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Network Protocols & Resources

17 2010.04.05 - SLIDE 17IS 240 – Spring 2010

18 2010.04.05 - SLIDE 18IS 240 – Spring 2010 GIS in the MVD Framework Layers are georeferenced data sets. Behaviors are –display semi-transparently –pan –zoom –issue query –display context –“spatial hyperlinks” –annotations Written in Java (to be merged with MVD-1 code line?)

19 2010.04.05 - SLIDE 19IS 240 – Spring 2010 GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

20 2010.04.05 - SLIDE 20IS 240 – Spring 2010 Overview of Cheshire II The Cheshire II system is intended to provide an easy-to-use, standards- compliant system capable of retrieving any type of information in a wide variety of settings.

21 2010.04.05 - SLIDE 21IS 240 – Spring 2010 Overview of Cheshire II It supports SGML and XML. It is a client/server application. Uses the Z39.50 Information Retrieval Protocol. Server supports a Relational Database Gateway. Supports Boolean searching of all servers. Supports probabilistic ranked retrieval in the Cheshire search engine. Search engine supports ``nearest neighbor'' searches and relevance feedback. GUI interface on X window displays. WWW/CGI forms interface for DL, using combined client/server CGI scripting via WebCheshire. Image Content retrieval using BlobWorld Support for the SDLIP (Simple Digital Library Interoperability Protocol) for search and as Z39.50 Gateway

22 2010.04.05 - SLIDE 22IS 240 – Spring 2010 Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

23 2010.04.05 - SLIDE 23IS 240 – Spring 2010 Current Usage of Cheshire II Web clients for: –NSF/NASA/ARPA Digital Library Includes support for full-text and page-level search. Experimental Blob-World image search –SunSite –University of Liverpool. –University of Essex, HDS (part of AHDS) –California Sheet Music Project –Cha-Cha (Berkeley Intranet Search Engine) –Univ. of Virginia Cheshire ranking algorithm is basis for Inktomi (i.e., Yahoo, Hotbot, MSN? and others)

24 2010.04.05 - SLIDE 24IS 240 – Spring 2010 Image Retrieval Research Finding “Stuff” vs “Things” BlobWorld Other Vision Research

25 2010.04.05 - SLIDE 25IS 240 – Spring 2010 Blobworld: use regions for retrieval We want to find general objects  Represent images based on coherent regions

26 2010.04.05 - SLIDE 26IS 240 – Spring 2010 Outline Why regions? Creating Blobworld: segmentation and description Using Blobworld: query experiments Indexing blobs for faster querying Conclusions

27 2010.04.05 - SLIDE 27IS 240 – Spring 2010 Creating and using Blobworld extract featuressegment imagedescribe regionsquery CreateUse

28 2010.04.05 - SLIDE 28IS 240 – Spring 2010 Extract features for each pixel Color –Take average color (L*a*b*) at the selected scale  ignore local color variations due to texture –“zebra = gray horse + stripes” Texture –Find contrast, anisotropy, polarity at the selected scale Position

29 2010.04.05 - SLIDE 29IS 240 – Spring 2010 Find groups in feature space Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)

30 2010.04.05 - SLIDE 30IS 240 – Spring 2010 Find regions in the image Label each pixel based on its Gaussian cluster Find connected components  regions 1 33 4 2 1 1 34 2

31 2010.04.05 - SLIDE 31IS 240 – Spring 2010 Describe regions by color, texture, shape Color –Color histogram within region –Quadratic distance: encode similarity between color bins d 2 hist (x, y) = (x - y)' A (x - y) Texture –Mean contrast and anisotropy  stripes vs. spots vs. smooth (Basic) Shape –Fourier descriptors of contour

32 2010.04.05 - SLIDE 32IS 240 – Spring 2010 Select appropriate scale for processing Polarity: do all the gradient vectors point in the same direction? Choose scale where polarity stabilizes  include one approximate period

33 2010.04.05 - SLIDE 33IS 240 – Spring 2010 Initialize means using image data Before, we picked random initialization Now, choose initial means based on image tiles Add noise to means and restart EM (4 runs per K) K = 2K = 5K = 4K = 3

34 2010.04.05 - SLIDE 34IS 240 – Spring 2010 update ,  update labels update ,  Grouping: Expectation-Maximization Given class characteristics ( ,  ), find class membership Given class membership, find class characteristics ( ,  ) Iterate update labels

35 2010.04.05 - SLIDE 35IS 240 – Spring 2010 How many Gaussians? Model selection: Minimum Description Length –Prefer fewer Gaussians if performance is comparable vs.

36 2010.04.05 - SLIDE 36IS 240 – Spring 2010 Find groups in feature space Model feature distribution as a mixture of Gaussians using Expectation- Maximization (EM)

37 2010.04.05 - SLIDE 37IS 240 – Spring 2010 EM math Probability density: Update equations: where

38 2010.04.05 - SLIDE 38IS 240 – Spring 2010 Encode similarity between color bins Quadratic distance Distance between histograms x and y: d 2 hist (x, y) = (x - y)' A (x - y) A ij is based on the similarity between bins i and j –Neighboring bins have A ij = 0.5

39 2010.04.05 - SLIDE 39IS 240 – Spring 2010 Fourier descriptors for shape [Zahn & Roskies ’72, Kuhl & Giardina ’82] Find (x,y) representation of outer contour Find Fourier series of (x,y) –Coefficients specify an ellipse (4 parameters): –major axis, minor axis, orientation, starting point Remove starting point ambiguity Store first ten Fourier coefficients

40 2010.04.05 - SLIDE 40IS 240 – Spring 2010 Creating and using Blobworld extract featuressegment imagedescribe regionsquery CreateUse

41 2010.04.05 - SLIDE 41IS 240 – Spring 2010 Querying: let user see the representation Current systems are unsatisfying –User can’t see what the computer sees –Unclear how parameters relate to the image User should interact with the representation –Helps in query formulation –Makes results understandable –Minimizes disappointment http://elib.cs.berkeley.edu/photos/blobworl d

42 2010.04.05 - SLIDE 42IS 240 – Spring 2010

43 2010.04.05 - SLIDE 43IS 240 – Spring 2010

44 2010.04.05 - SLIDE 44IS 240 – Spring 2010

45 2010.04.05 - SLIDE 45IS 240 – Spring 2010

46 2010.04.05 - SLIDE 46IS 240 – Spring 2010

47 2010.04.05 - SLIDE 47IS 240 – Spring 2010

48 2010.04.05 - SLIDE 48IS 240 – Spring 2010 Query experiments Collection of 10,000 Corel stock photos Five query images in each of ten categories (e.g., cheetahs, polar bears, airplanes) Compare Blobworld to global histogram queries Precision (% of retrieved images that are correct) vs. Recall (% of correct images that are retrieved)

49 2010.04.05 - SLIDE 49IS 240 – Spring 2010 Distinctive objects Tigers, cheetahs, and zebras: –Blobworld does better than global histograms cheetahs zebras

50 2010.04.05 - SLIDE 50IS 240 – Spring 2010 black bears Distinctive objects and backgrounds Eagles and black bears: –Blobworld does better than global histograms

51 2010.04.05 - SLIDE 51IS 240 – Spring 2010 Distinctive scenes Airplanes and brown bears: –Global histograms do better than Blobworld –But Blobworld has room to grow (shape, etc.) airplanes

52 2010.04.05 - SLIDE 52IS 240 – Spring 2010 Index to search huge collections Indexing is trickier than for traditional data We can afford some mistakes: even with full search, we’ll miss some tigers and include some pumpkins Two approaches we have tried: –Store terms and treat image as a document –Store features and index using a tree Final (“correct”) ranking of images from index

53 2010.04.05 - SLIDE 53IS 240 – Spring 2010 Index using conventional IR methods Treat each database blob as a document –Store “terms” (bins) for color, texture, location, and shape –Repeat color terms based on histogram weights Index using Cheshire II Treat each query blob as a document –Repeat “terms” according to query weights

54 2010.04.05 - SLIDE 54IS 240 – Spring 2010 Indexing and Retrieval with Cheshire II Originally used the same probabilistic algorithm used for text –Blobs are not distributed like text words or stems Now using a weighting based on coordination level match with a minimum threshold (must have at least half of the characteristics of the query cluster. Still eyeballing data, but seems much better for many types of queries

55 2010.04.05 - SLIDE 55IS 240 – Spring 2010

56 2010.04.05 - SLIDE 56IS 240 – Spring 2010

57 2010.04.05 - SLIDE 57IS 240 – Spring 2010

58 2010.04.05 - SLIDE 58IS 240 – Spring 2010

59 2010.04.05 - SLIDE 59IS 240 – Spring 2010 Conclusions Image retrieval in general collections requires region segmentation and description Blobworld yields high precision in queries for distinctive objects Blobworld can be indexed to allow fast querying

60 2010.04.05 - SLIDE 60IS 240 – Spring 2010 User Interface Paradigms: Multivalent Documents An approach to new document types and their authoring. Supports active, distributed, composable transformations of multimedia documents. Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

61 2010.04.05 - SLIDE 61IS 240 – Spring 2010 Multivalent Documents Cheshire Layer OCR Layer OCR Mapping Layer History of The Classical World The jsfj sjjhfjs jsjj jsjhfsjf sjhfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj ksfksjfkskflk sjfjksf kjsfkjsfkjshf sjfsjfjks ksfjksfjksjfkthsjir\\ ks ksfjksjfkksjkls’ks klsjfkskfksjjjhsjhuu sfsjfkjs Modernjsfj sjjhfjs jsjj jsjhfsjf sslfjksh sshf jsfksfjk sjs jsjfs kj sjfkjsfhskjf sjfhjksh skjfhkjshfjksh jsfhkjshfjkskjfhsfh skjfksjflksjflksjflksf sjfksjfkjskfjskfjklsslk slfjlskfjklsfklkkkdsj GIS Layer taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl taksksh kdjjdkd kdjkdjkd kj sksksk kdkdk kdkd dkk skksksk jdjjdj clclc ldldl Table 1. Table Layer kdk dkd kdk Scanned Page Image Valence: 2: The relative capacity to unite, react, or interact (as with antigens or a biological substrate). Webster’s 7th Collegiate Dictionary Network Protocols & Resources

62 2010.04.05 - SLIDE 62IS 240 – Spring 2010 Image Retrieval Research Finding “Stuff” vs “Things” BlobWorld

63 2010.04.05 - SLIDE 63IS 240 – Spring 2010

64 2010.04.05 - SLIDE 64IS 240 – Spring 2010 Cheshire II Searching Z39.50 Internet Images Scanned Text LocalRemote Z39.50

65 2010.04.05 - SLIDE 65IS 240 – Spring 2010 GIS in the MVD Framework Layers are georeferenced data sets. Behaviors are –display semi-transparently –pan –zoom –issue query –display context –“spatial hyperlinks” –annotations Written in Java

66 2010.04.05 - SLIDE 66IS 240 – Spring 2010 GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

67 2010.04.05 - SLIDE 67IS 240 – Spring 2010 Geographic Information Retrieval and Spatial Browsing Ray R. Larson School of Library and Information Studies University of California, Berkeley

68 2010.04.05 - SLIDE 68IS 240 – Spring 2010 Concerns for Digital Libraries Excellent summary in Distributed Geolibraries from NRC. –Distributed resources –Distributed users –Distributed services Access for a broad population is critical for many Digital Libraries

69 2010.04.05 - SLIDE 69IS 240 – Spring 2010 Concerns for Digital Libraries Georeferenced Information (geoinformation) provides one organizational perspective Other common perspectives include Topical Classification schemes, Temporal/Historical organization (ECAI) DL’s can provide multiple views of the same information

70 2010.04.05 - SLIDE 70IS 240 – Spring 2010 Concerns for Digital Libraries Most DLs are intended for a broad user base: –varying levels of expertise in the contents –varying requirements for access methods –simple expressions of interest in natural language should be supported –Mapping NL to controlled vocabularies (including Digital Gazetteers)

71 2010.04.05 - SLIDE 71IS 240 – Spring 2010 Digital Library Needs Geographic and Spatial Querying Spatial Browsing Geographic and Spatial Indexing (Berkeley DL contents and examples)

72 2010.04.05 - SLIDE 72IS 240 – Spring 2010 Overview What is Geographic Information Retrieval? Geographic and Spatial Querying and Browsing. Geographic and Spatial Indexing. Examples of GIR Systems and Geographically Indexed Information.

73 2010.04.05 - SLIDE 73IS 240 – Spring 2010 Introduction What is Geographic Information Retrieval? –GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. –It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

74 2010.04.05 - SLIDE 74IS 240 – Spring 2010 Introduction The need for Geographic and Spatial Information Retrieval. –Digital Libraries Sequoia 2000 UC Berkeley NSF/NASA/ARPA Digital Library Project UC Santa Barbara Alexandria Project NSDI - National Spatial Data Infrastructure –Next-Generation Online Catalogs Cheshire II

75 2010.04.05 - SLIDE 75IS 240 – Spring 2010 Geographic and Spatial Querying Both imply querying on relationships within a particular coordinate system Spatial querying is the more general term Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

76 2010.04.05 - SLIDE 76IS 240 – Spring 2010 Geographic and Spatial Querying Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale) –E.g. “5.21 miles north of Champaign” Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction) –E.g.: “inside the city limits” –“left side of Beckman Institute”

77 2010.04.05 - SLIDE 77IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries –Point-in-polygon : “What do we have at this X,Y point?” –Region Queries : “What do we have in this region?” Which point encoded items lie within the region What lines (borders, etc.) lie within or the cross the region What areas overlap the region area Y X

78 2010.04.05 - SLIDE 78IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries, cont. –Distance and Buffer Zone Queries What cities lie within 40 miles of the border of Northern and Southern Ireland? What wetlands lie within 50 miles of London? –Path Queries What is the shortest route from San Francisco to Los Angeles?

79 2010.04.05 - SLIDE 79IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries, cont. –Multimedia Queries : Use non- map georeferenced information. What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties? p123 p127

80 2010.04.05 - SLIDE 80IS 240 – Spring 2010 Spatial Browsing Combines ad hoc spatial querying with interactive displays HyperMap concept Pseudo-HyperMaps

81 2010.04.05 - SLIDE 81IS 240 – Spring 2010 Spatial Browsing Advantages: –May not need the accuracy of a full GIS –Comprehensible searching metaphor for many materials Problems: –Clutter and differing scales. –Requires good (and preferably accurate) geographical indexing –Assumes that the user knows some geography

82 2010.04.05 - SLIDE 82IS 240 – Spring 2010 Geographic and Spatial Indexing Traditional geographic indexing involves using place names from LCSH and name authorities. These have some problems: –Names are not unique –The places referred to change size, shape and names over time –Spelling variations –Some places are temporary conventions (study areas, etc.)

83 2010.04.05 - SLIDE 83IS 240 – Spring 2010 Digital Gazetteers Geographic names are and will remain the primary Entry Vocabulary for DL spatial queries –The gazetteer must support as many variant forms of the name as possible Including temporal ranges for particular names –querying must support spatial reasoning based on gazetteer and other geographic and temporal information in the system or accessible by network access

84 2010.04.05 - SLIDE 84IS 240 – Spring 2010

85 2010.04.05 - SLIDE 85IS 240 – Spring 2010 Geographic and Spatial Indexing Geographic coordinates have some advantages over names: –They are persistent regardless of name, political boundary or other changes –The can be simply connected to spatial browsing interfaces and GIS data. –They provide a consistent framework for GIR applications and spatial queries. However, the geographic extents and boundaries of entities also change over time –This may be the primary interest of historical scholarship

86 2010.04.05 - SLIDE 86IS 240 – Spring 2010 Geographic and Spatial Indexing GIPSY: Automatic georeferencing of texts (Geographic Info Processing System) –The work of Allison Woodruff and Christian Plaunt - Later DBMS-based version by Jolly Chen -- New version planned –Designed to operate on the full text of documents –Extracts geographic terms and attempts to identify the coordinates of the places discussed in the text using a combination of evidence

87 2010.04.05 - SLIDE 87IS 240 – Spring 2010 Geographic and Spatial Indexing GIPSY cont. –Used the USGS Geographic Names Information System (GNIS) and Geographic Information Retrieval and Analysis System (GIRAS) to associate names with coordinates of named places, geographic features and land use characteristics.

88 2010.04.05 - SLIDE 88IS 240 – Spring 2010 Geographic and Spatial Indexing GIPSY cont. –Identified places are added as “elevations” with each place adding a weight based on its frequency in the text and database characteristics –The resulting map is analysed to identify the most likely locations, and coordinates for those locations are extracted

89 2010.04.05 - SLIDE 89IS 240 – Spring 2010 Geographic and Spatial Indexing GIPSY Map Overlay “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “

90 2010.04.05 - SLIDE 90IS 240 – Spring 2010 Geographic and Spatial Indexing To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must –Support many different time ranges, location and boundary changes –Support synonymous and variant names with differing locations for the same entity –Support names in multiple languages, scripts and usages

91 2010.04.05 - SLIDE 91IS 240 – Spring 2010 ECAI The Electronic Cultural Atlas Initiative is a collaboration between IT professionals and humanities scholars ECAI is developing a globally distributed spatio- temporal library of cultural and historical resources with a centralized metadata catalogue and a GIS viewer Currently the ECAI consortium includes over 250 projects

92 2010.04.05 - SLIDE 92IS 240 – Spring 2010 ECAI Projects range from small works by individual scholars to large nationally and internationally funded efforts. E.g.: –geography of Greco-Roman culture (Perseus project) –toponym locations for over 300,000 images of Buddhist art and architecture –Seals of the Sassanian Empire –historical trade routes of Eurasia –the map of Hideyoshi’s invasion of Korea –historical GIS projects for China, Great Britain, the United States, the Black Sea and Tibet

93 2010.04.05 - SLIDE 93IS 240 – Spring 2010 Perseus

94 2010.04.05 - SLIDE 94IS 240 – Spring 2010 The Sasanian Empire

95 2010.04.05 - SLIDE 95IS 240 – Spring 2010 Opening shot of the Sasanian Empire ECAI project, showing a map with diverse resources, a timeline, and a menu of available map layers.

96 2010.04.05 - SLIDE 96IS 240 – Spring 2010 Users may zoom in to see resources that are only visible at a higher level of detail.

97 2010.04.05 - SLIDE 97IS 240 – Spring 2010 Spatial objects on the map are linked to a table of attributes, which may include any information about the objects. Note that this is a scholarly tool. By creating a “name quality” field, the author has noted that there is disagreement about the locations and names of places in the Sasanian Empire.

98 2010.04.05 - SLIDE 98IS 240 – Spring 2010 Sites on the map may be linked to resources elsewhere on the internet. In this case, important archaeological sites on the map are linked to web-based tours.

99 2010.04.05 - SLIDE 99IS 240 – Spring 2010 The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

100 2010.04.05 - SLIDE 100IS 240 – Spring 2010 In a different time range, not only do the boundaries of the empire appear different, but the sites that were active during the earlier era (the red dots) have moved as well.

101 2010.04.05 - SLIDE 101IS 240 – Spring 2010 TimeMap is a user authoring tool, not merely a viewer. Users can control the look of the icons, the map layers that comprise a project, and, as shown here, the map scale at which different layers will become visible.

102 2010.04.05 - SLIDE 102IS 240 – Spring 2010 This screen displays the metadata for the a part of the Sasanian Empire project. The metadata includes functional (tm.) metadata to enable connection to the map interface in addition to cataloguing (dc. and ecai.) metadata. Using the menu on the left, users may choose to map individual map layers or packaged projects.

103 2010.04.05 - SLIDE 103IS 240 – Spring 2010 Historic Sydney

104 2010.04.05 - SLIDE 104IS 240 – Spring 2010 Google Earth GIR - Demo

105 2010.04.05 - SLIDE 105IS 240 – Spring 2010 The Mongol Empire

106 2010.04.05 - SLIDE 106IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 23: GIR Continued

107 2010.04.05 - SLIDE 107IS 240 – Spring 2010 Today Review –Geographic Information Retrieval Parts of this this lecture were presented at the invitational conference “The ‘I’ in Geographic Information Science”, Manchester, U.K., July 2001 GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.

108 2010.04.05 - SLIDE 108IS 240 – Spring 2010 Introduction What is Geographic Information Retrieval? –GIR is concerned with providing access to georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval. –It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

109 2010.04.05 - SLIDE 109IS 240 – Spring 2010 Introduction The need for Geographic and Spatial Information Retrieval. –Digital Libraries Sequoia 2000 UC Berkeley NSF/NASA/ARPA Digital Library Project UC Santa Barbara Alexandria Project NSDI - National Spatial Data Infrastructure –Next-Generation Online Catalogs Cheshire II

110 2010.04.05 - SLIDE 110IS 240 – Spring 2010 Geographic and Spatial Querying Both imply querying on relationships within a particular coordinate system Spatial querying is the more general term Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

111 2010.04.05 - SLIDE 111IS 240 – Spring 2010 Geographic and Spatial Querying Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale) –E.g. “5.21 miles north of Champaign” Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction) –E.g.: “inside the city limits” –“left side of Beckman Institute”

112 2010.04.05 - SLIDE 112IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries –Point-in-polygon : “What do we have at this X,Y point?” –Region Queries : “What do we have in this region?” Which point encoded items lie within the region What lines (borders, etc.) lie within or the cross the region What areas overlap the region area Y X

113 2010.04.05 - SLIDE 113IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries, cont. –Distance and Buffer Zone Queries What cities lie within 40 miles of the border of Northern and Southern Ireland? What wetlands lie within 50 miles of London? –Path Queries What is the shortest route from San Francisco to Los Angeles?

114 2010.04.05 - SLIDE 114IS 240 – Spring 2010 Geographic and Spatial Querying Types of spatial queries, cont. –Multimedia Queries : Use non- map georeferenced information. What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties? p123 p127

115 2010.04.05 - SLIDE 115IS 240 – Spring 2010 Spatial Browsing Combines ad hoc spatial querying with interactive displays HyperMap concept Pseudo-HyperMaps

116 2010.04.05 - SLIDE 116IS 240 – Spring 2010 Geographic and Spatial Indexing GIPSY Map Overlay “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ “The proposed project is the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “ the construction of a new State Water Project facility, the coastal branch... by water purveyors of northern Santa Barbara County... delivering water to San Luis Obispo... “

117 2010.04.05 - SLIDE 117IS 240 – Spring 2010 Geographic and Spatial Indexing To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must –Support many different time ranges, location and boundary changes –Support synonymous and variant names with differing locations for the same entity –Support names in multiple languages, scripts and usages

118 2010.04.05 - SLIDE 118IS 240 – Spring 2010 The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

119 2010.04.05 - SLIDE 119IS 240 – Spring 2010 Historic Sydney

120 2010.04.05 - SLIDE 120IS 240 – Spring 2010 The Mongol Empire


Download ppt "2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval."

Similar presentations


Ads by Google