2013.04.03 - SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Ontology Assessment – Proposed Framework and Methodology.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Design of Experiments Lecture I
Multimedia Database Systems
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Spatial Mining.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Geographic Information Retrieval (GIR) Definitions: Geographic information retrieval (GIR) is concerned with spatial approaches to the retrieval of geographically.
INFO 624 Week 3 Retrieval System Evaluation
Evaluating Hypotheses
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1ECDL 2004 Ray R. Larson and Patricia Frontiera University of California, Berkeley Spatial Ranking Methods for Geographic Information.
GI Systems and Science January 23, Points to Cover  What is spatial data modeling?  Entity definition  Topology  Spatial data models Raster.
Alexandria Digital Library Project Goals and Challenges in Georeferenced Digital Libraries Greg Janée.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Measurement and Data Quality
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Bug Localization with Machine Learning Techniques Wujie Zheng
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Natural Language Processing for Information Retrieval D a v i d D. L e w i s AT&T Bell Lab.’s K a r e n S p a r c k J o n e s University of Cambridge Ferhat.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
GEOGRAPHICAL INFORMATION SYSTEM
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Presentation 王睿.
Introduction to Information Retrieval
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
CS246: Information Retrieval
Information Retrieval and Web Design
Probabilistic Ranking of Database Query Results
Presentation transcript:

SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 17: Introduction to GIR

SLIDE 2IS 240 – Spring 2013 Overview Review –NLP and IR Geographic Information Retrieval GIR

SLIDE 3IS 240 – Spring 2013 General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 4IS 240 – Spring 2013 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John John is a student. He runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 5IS 240 – Spring 2013 Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 6IS 240 – Spring 2013 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 7IS 240 – Spring 2013 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 8IS 240 – Spring 2013 NLP & IR Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

SLIDE 9IS 240 – Spring 2013 NPL & IR Lewis and Sparck Jones suggest research in three areas –Examination of the words, phrases and sentences that make up a document description and express the combinatory, syntagmatic relations between single terms –The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching –Using NLP-based methods for searching and matching

SLIDE 10IS 240 – Spring 2013 NLP & IR Issues Is natural language indexing using more NLP knowledge needed? Or, should controlled vocabularies be used Can NLP in its current state provide the improvements needed How to test

SLIDE 11IS 240 – Spring 2013 NLP & IR New “Question Answering” track at TREC has been exploring these areas –Usually statistical methods are used to retrieve candidate documents –NLP techniques are used to extract the likely answers from the text of the documents

SLIDE 12IS 240 – Spring 2013 Mark’s idle speculation What people think is going on always Keywords NLP From Mark Sanderson, University of Sheffield

SLIDE 13IS 240 – Spring 2013 Mark’s idle speculation What’s usually actually going on Keywords NLP From Mark Sanderson, University of Sheffield

SLIDE 14IS 240 – Spring 2013 What we really need is… The reason NLP fails to help is because the machine lacks the human flexibility of interpretation and knowledge of context and content So what about AI? –There are many debates on whether human- like AI is or is not possible “the question of whether machines can think is no more interesting than the question of whether submarines can swim” –Edsger Dijkstra

SLIDE 15IS 240 – Spring 2013 The Challenge “the open domain QA is attractive as it is one of the most challenging in the realm of computer science and artificial intelligence, requiring a synthesis of information retrieval, natural language processing, knowledge representation and reasoning, machine learning and computer-human interfaces.” –“Building Watson: An overview of the DeepQA Project”, AI Magazine, Fall 2010

SLIDE 16IS 240 – Spring 2013 DeepQA DeepQA High-Level Architecture from “Building Watson” AI Magazine Fall 2010

SLIDE 17IS 240 – Spring 2013 Question Analysis Attempts to discover what kind of question is being asked (usually meaning the desired type of result - or LAT Lexical Answer Type) –I.e. “Who is…” needs a person, “Where is…” needs a location. DeepQA uses a number of experts and combines the results using the confidence framework

SLIDE 18IS 240 – Spring 2013 Hypothesis Generation Takes the results of Question Analysis and produces candidate answers by searching the system’s sources and extracting answer-sized snippets from the search results. Each candidate answer plugged back into the question is considered a hypothesis A “lightweight scoring” is performed to trim down the hypothesis set –What is the likelihood of the candidate answer being an instance of the LAT from the first stage?

SLIDE 19IS 240 – Spring 2013 Hypothesis and Evidence Scoring Candidate answers that pass the lightweight scoring then undergo a rigorous evaluation process that involves gathering additional supporting evidence for each candidate answer, or hypothesis, and applying a wide variety of deep scoring analytics to evaluation the supporting evidence This involves more retrieval and scoring (one method used involves IDF scores of common words between the hypothesis and the source passage)

SLIDE 20IS 240 – Spring 2013 Final Merging and Ranking Based on the deep scoring, the hypotheses and their supporting sources are ranked and merged to select the single best-supported hypothesis Equivalent candidate answers are merged After merging the system must rank the hypotheses and estimate confidence based on their merged scores. (A machine-learning approach using a set of know training answers is used to build the ranking model)

SLIDE 21IS 240 – Spring 2013 Running DeepQA A single question on a single processor implementation of DeepQA typically could take up to 2 hours to complete The Watson system used a massively parallel version of the UIMA framework and Hadoop (both open source from Apache now :) that was running 2500 processors in parallel They won the public Jeopardy Challenge (easily it seemed)

SLIDE 22IS 240 – Spring 2013 Ray R. Larson University of California, Berkeley School of Information Geographic Information Retrieval (GIR): Algorithms and Approaches

SLIDE 23IS 240 – Spring 2013 Overview What is GIR? Spatial Approaches to GIR A Logistic Regression Approach to GIR –Model –Testing and Results –Example using Google Earth as an interface GIR Evaluation Tests –GeoCLEF –GikiCLEF –NTCIR GeoTime

SLIDE 24IS 240 – Spring 2013 Geographic information retrieval (GIR) is concerned with spatial approaches to the retrieval of geographically referenced, or georeferenced, information objects (GIOs) –about specific regions or features on or near the surface of the Earth. –Geospatial data are a special type of GIO that encodes a specific geographic feature or set of features along with associated attributes maps, air photos, satellite imagery, digital geographic data, photos, text documents, etc. Geographic Information Retrieval (GIR) Source: USGS

SLIDE 25IS 240 – Spring 2013 San Francisco Bay Area , Georeferencing and GIR Within a GIR system, e.g., a geographic digital library, information objects can be georeferenced by place names or by geographic coordinates (i.e. longitude & latitude)

SLIDE 26IS 240 – Spring 2013 GIR is not GIS GIS is concerned with spatial representations, relationships, and analysis at the level of the individual spatial object or field GIR is concerned with the retrieval of geographic information resources (and geographic information objects at the set level) that may be relevant to a geographic query region

SLIDE 27IS 240 – Spring 2013 Spatial Approaches to GIR A spatial approach to geographic information retrieval is one based on the integrated use of spatial representations, and spatial relationships. A spatial approach to GIR can be qualitative or quantitative –Quantitative: based on the geometric spatial properties of a geographic information object –Qualitative: based on the non-geometric spatial properties.

SLIDE 28IS 240 – Spring 2013 Spatial Matching and Ranking Spatial similarity can be considered as a indicator of relevance: documents whose spatial content is more similar to the spatial content of query will be considered more relevant to the information need represented by the query. Need to consider both: –Qualitative, non-geometric spatial attributes –Quantitative, geometric spatial attributes Topological relationships and metric details We focus on the latter…

SLIDE 29 Simple Quantitative Ranking The simplest measure to use is distance from one point to another – usually the center point of a spatial object The “Great Circle Distance” (orthodromic) IS 240 – Spring 2013

SLIDE 30 Great Circle Distance Programming simple distance calculations (in is the start location in decimal degrees and latitude, longitude are the destination, 3959 is the radius (approx.) of the Earth in miles The “radians” function is simply radians = degrees * π/180 IS 240 – Spring 2013 Distance = 3959 * acos( cos( ) * cos( radians( latitude ) ) * cos( radians( longitude ) - ) + sin( ) * sin( radians( latitude ) ) );

SLIDE 31 Simple Distance Problems Distance is usually between the center of geographic objects (center point of a city or zip code, for example) –But, this is a very simple ranking method used almost everywhere for mobile apps, store locators, etc. where the geographic coordinates are known Issues – Earth is not really a sphere –Large distances are not correct, but more complex versions of the formula can be used to correct IS 240 – Spring 2013

SLIDE 32IS 240 – Spring 2013 Spatial Similarity Measures and Spatial Ranking When dealing with areas things get more interesting… Three basic approaches to areal spatial similarity measures and ranking Method 1: Simple Overlap Method 2: Topological Overlap Method 3: Degree of Overlap:

SLIDE 33IS 240 – Spring 2013 Method 1: Simple Overlap Candidate geographic information objects (GIOs) that have any overlap with the query region are retrieved. Included in the result set are any GIOs that are contained within, overlap, or contain the query region. The spatial score for all GIOs is either relevant (1) or not relevant (0). The result set cannot be ranked –topological relationship only, no metric refinement

SLIDE 34IS 240 – Spring 2013 Method 2: Topological Overlap Spatial searches are constrained to only those candidate GIOs that either: –are completely contained within the query region, –overlap with the query region, –or, contain the query region. Each category is exclusive and all retrieved items are considered relevant. The result set cannot be ranked –categorized topological relationship only, –no metric refinement

SLIDE 35IS 240 – Spring 2013 Method 3: Degree of Overlap Candidate geographic information objects (GIOs) that have any overlap with the query region are retrieved. A spatial similarity score is determined based on the degree to which the candidate GIO overlaps with the query region. The greater the overlap with respect to the query region, the higher the spatial similarity score. This method provides a score by which the result set can be ranked –topological relationship: overlap –metric refinement: area of overlap

SLIDE 36IS 240 – Spring 2013 Example: Results display from CheshireGeo:

SLIDE 37IS 240 – Spring 2013 Geometric Approximations The decomposition of spatial objects into approximate representations is a common approach to simplifying complex and often multi-part coordinate representations Types of Geometric Approximations –Conservative: superset –Progressive: subset –Generalizing: could be either –Concave or Convex Geometric operations on convex polygons much faster

SLIDE 38IS 240 – Spring ) Minimum Bounding Circle (3) 2) MBR: Minimum aligned Bounding rectangle (4) 3) Minimum Bounding Ellipse (5) 6) Convex hull (varies)5) 4-corner convex polygon (8)4) Rotated minimum bounding rectangle (5) Presented in order of increasing quality. Number in parentheses denotes number of parameters needed to store representation After Brinkhoff et al, 1993b Other convex, conservative Approximations

SLIDE 39IS 240 – Spring 2013 Our Research Questions Spatial Ranking –How effectively can the spatial similarity between a query region and a document region be evaluated and ranked based on the overlap of the geometric approximations for these regions? Geometric Approximations & Spatial Ranking: –How do different geometric approximations affect the rankings? MBRs: the most popular approximation Convex hulls: the highest quality convex approximation

SLIDE 40IS 240 – Spring 2013 Spatial Ranking: Methods for computing spatial similarity

SLIDE 41IS 240 – Spring 2013 Proposed Ranking Method Probabilistic Spatial Ranking using Logistic Inference Probabilistic Models –Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query –Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) –Rely on accurate estimates of probabilities

SLIDE 42IS 240 – Spring 2013 Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the m X attribute measures (on the following page) Logistic Regression

SLIDE 43IS 240 – Spring 2013 Probabilistic Models: Logistic Regression attributes X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO X 3 = 1 – abs(fraction of overlap region that is onshore fraction of candidate GIO that is onshore) Where: Range for all variables is 0 (not similar) to 1 (same)

SLIDE 44IS 240 – Spring 2013 Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Computationally efficient, straight- forward implementation (if based on LR) Relevance information is required -- or is “guestimated” Important indicators of relevance may not be captured by the model Optimally requires on- going collection of relevance information AdvantagesDisadvantages

SLIDE 45IS 240 – Spring 2013 Test Collection California Environmental Information Catalog (CEIC) Approximately 2500 records selected from collection (Aug 2003) of ~ 4000.

SLIDE 46IS 240 – Spring 2013 Test Collection Overview 2554 metadata records indexed by 322 unique geographic regions (represented as MBRs) and associated place names. –2072 records (81%) indexed by 141 unique CA place names 881 records indexed by 42 unique counties (out of a total of 46 unique counties indexed in CEIC collection) 427 records indexed by 76 cities (of 120) 179 records by 8 bioregions (of 9) 3 records by 2 national parks (of 5) 309 records by 11 national forests (of 11) 3 record by 1 regional water quality control board region (of 1) 270 records by 1 state (CA) –482 records (19%) indexed by 179 unique user defined areas (approx 240) for regions within or overlapping CA 12% represent onshore regions (within the CA mainland) 88% (158 of 179) offshore or coastal regions

SLIDE 47IS 240 – Spring 2013 CA Named Places in the Test Collection – complex polygons Counties Cities National Parks National Forests Water QCB Regions Bioregions

SLIDE 48IS 240 – Spring 2013 CA Counties – Geometric Approximations MBRs Ave. False Area of Approximation: MBRs: 94.61%Convex Hulls: 26.73% Convex Hulls

SLIDE 49IS 240 – Spring 2013 CA User Defined Areas (UDAs) in the Test Collection

SLIDE 50IS 240 – Spring 2013 Test Collection Query Regions: CA Counties 42 of 58 counties referenced in the test collection metadata 10 counties randomly selected as query regions to train LR model 32 counties used as query regions to test model

SLIDE 51IS 240 – Spring 2013 Test Collection Relevance Judgements Determine the reference set of candidate GIO regions relevant to each county query region: Complex polygon data was used to select all CA place named regions (i.e. counties, cities, bioregions, national parks, national forests, and state regional water quality control boards) that overlap each county query region. All overlapping regions were reviewed (semi-automatically) to remove sliver matches, i.e. those regions that only overlap due to differences in the resolution of the 6 data sets. –Automated review: overlaps where overlap area/GIO area > considered relevant, else not relevant. –Cases manually reviewed: overlap area/query area <.001 and overlap area/GIO area <.02 The MBRs and metadata for all information objects referenced by UDAs (user- defined areas) were manually reviewed to determine their relevance to each query region. This process could not be automated because, unlike the CA place named regions, there are no complex polygon representations that delineate the UDAs. This process resulted in a master file of CA place named regions and UDAs relevant to each of the 42 CA county query regions.

SLIDE 52IS 240 – Spring 2013 LR model X 1 = area of overlap(query region, candidate GIO) / area of query region X 2 = area of overlap(query region, candidate GIO) / area of candidate GIO Where: Range for all variables is 0 (not similar) to 1 (same)

SLIDE 53IS 240 – Spring 2013 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: These results suggest: Convex Hulls perform better than MBRs Expected result given that the CH is a higher quality approximation A probabilistic ranking based on MBRs can perform as well if not better than a non- probabiliistic ranking method based on Convex Hulls Interesting Since any approximation other than the MBR requires great expense, this suggests that the exploration of new ranking methods based on the MBR are a good way to go.

SLIDE 54IS 240 – Spring 2013 Some of our Results Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list. For metadata indexed by CA named place regions: For all metadata in the test collection: BUT: The inclusion of UDA indexed metadata reduces precision. This is because coarse approximations of onshore or coastal geographic regions will necessarily include much irrelevant offshore area, and vice versa

SLIDE 55IS 240 – Spring 2013 Results for MBR - Named data Recall Precision

SLIDE 56IS 240 – Spring 2013 Results for Convex Hulls -Named Precision Recall

SLIDE 57IS 240 – Spring 2013 Offshore / Coastal Problem California EEZ Sonar Imagery Map – GLORIA Quad 13 PROBLEM: the MBR for GLORIA Quad 13 overlaps with several counties that area completely inland.

SLIDE 58IS 240 – Spring 2013 Adding Shorefactor Feature Variable Candidate GIO MBRs A) GLORIA Quad 13: fraction onshore =.55 B) WATER Project Area: fraction onshore =.74 Query Region MBR Q) Santa Clara County: fraction onshore =.95 Onshore Areas Computing Shorefactor: Q – A Shorefactor: 1 – abs( ) =.60 Q – B Shorefactor: 1 – abs( ) =.79 Shorefactor = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) A B Q Even though A & B have the same area of overlap with the query region, B has a higher shorefactor, which would weight this GIO ’ s similarity score higher than A ’ s. Note: geographic content of A is completely offshore, that of B is completely onshore.

SLIDE 59IS 240 – Spring 2013 About the Shorefactor Variable Characterizes the relationship between the query and candidate GIO regions based on the extent to which their approximations overlap with onshore areas (or offshore areas). Assumption: a candidate region is more likely to be relevant to the query region if the extent to which its approximation is onshore (or offshore) is similar to that of the query region’s approximation.

SLIDE 60IS 240 – Spring 2013 About the Shorefactor Variable The use of the shorefactor variable is presented as an example of how geographic context can be integrated into the spatial ranking process. Performance: Onshore fraction for each GIO approximation can be pre-indexed. Thus, for each query only the onshore fraction of the query region needs to be calculated using a geometric operation. The computational complexity of this type of operation is dependent on the complexity of the coordinate representations of the query region (we used the MBR and Convex hull approximations) and the onshore region (we used a very generalized concave polygon w/ only 154 pts).

SLIDE 61IS 240 – Spring 2013 Shorefactor Model X1 = area of overlap(query region, candidate GIO) / area of query region X2 = area of overlap(query region, candidate GIO) / area of candidate GIO X3 = 1 – abs(fraction of query region approximation that is onshore – fraction of candidate GIO approximation that is onshore) –Where: Range for all variables is 0 (not similar) to 1 (same)

SLIDE 62IS 240 – Spring 2013 Some of our Results, with Shorefactor These results suggest: Addition of Shorefactor variable improves the model (LR 2), especially for MBRs Improvement not so dramatic for convex hull approximations – b/c the problem that shorefactor addresses is not that significant when areas are represented by convex hulls. For all metadata in the test collection: Mean Average Query Precision: the average precision values after each new relevant document is observed in a ranked list.

SLIDE 63IS 240 – Spring 2013 Precision Recall Results for All Data - MBRs

SLIDE 64IS 240 – Spring 2013 Results for All Data - Convex Hull Precision Recall

SLIDE 65IS 240 – Spring 2013 GIR Examples The following screen captures are from a GIR application using the algorithms (2 variable logistic regression model) and data (the CIEC database data) Uses a Google Earth network link to provide a GIR search interface

SLIDE 66IS 240 – Spring 2013

SLIDE 67IS 240 – Spring 2013

SLIDE 68IS 240 – Spring 2013

SLIDE 69IS 240 – Spring 2013

SLIDE 70IS 240 – Spring 2013

SLIDE 71IS 240 – Spring 2013