Project 3 CS652 Information Extraction and Information Integration.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Schema Matching Helen Chen CS652 Project 3 06/14/2002.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Project 2 CS652. Project2 Presented by: REEMA AL-KAMHA.
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman.
Recommender systems Ram Akella November 26 th 2008.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
©Silberschatz, Korth and Sudarshan3.1Database System Concepts - 6 th Edition SQL Schema Changes and table updates instructor teaches.
Chapter 5: Information Retrieval and Web Search
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Lexical Analysis Hira Waseem Lecture
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
New Perspectives on XML, 2nd Edition
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
XML Schema Integration Ray Dos Santos July 19, 2009.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Regular Expressions.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Multimedia Information Retrieval
Query Languages.
Text Categorization Assigning documents to a fixed set of categories
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Fundamentals of Data Representation
Family History Technology Workshop
Introduction to Computer Science
Presentation transcript:

Project 3 CS652 Information Extraction and Information Integration

Project3 Presented by: Reema Al-Kamha

Results Name Matcher 1) Base Line: 2) Improvements: TSPR Facultycornellberkeley11 texasberkeley11 CourseRicereed 1 9/11 uwmreed 1 4/8 Adding many synonyms for the word. TSPR Facultycornellberkeley11 texasberkeley11 CourseRicereed 11 uwmreed 11

Results NB Model 1)Base Line: I treated the continents of each row as one token. 2)Improvements: TSPR Facultycornellberkeley11/10 texasberkeley12/10 CourseRicereed 11/11 uwmreed 12/8 TSPR Facultycornellberkeley16/10 texasberkeley16/10 CourseRicereed 16/11 uwmreed 1/55/8 Combin ation TSPR Facultycornellberkeley16/10 texasberkeley16/10 CourseRicereed 11/11 uwmreed 12/8

Comments I do not figure out how to distinguish start _time and end_time. I parse each row in XML to tokens. I got ride from all stop words (also got ride from.,;#.in vocabulary vector I get ride from suffix like Introduction to Intro. I do not insert the files that are in source but not in target. Sometimes I extract the key words in the documents and treat the document as if it only contains these words like in ward attribute. For some files attributes like code Attribute, I separate the numeric part from the letter part to let the code match subject in course application, and then I drop the numeric part. I had a lot of difficulties in using Java for this project because it was very slow.

Muhammed Al-Muhammed Two schema matching techniques were implemented, Name-matching and NB in Java. In general the type of the data help in achieving a good matching results. Two improvements done. More in the conclusion.

Name Matching applicationtargetsourcerecallprecisionF-M CourseWashingtonReed8/12 9/12 * 100%80.3% 85.7% courseWSUReed6/16 13/16 100%55% 89.3% FacultyWashingtonBerkeley10/10 100% facultyMichiganBerkeley10/10 100% * After doing some improvement

NB applicationTargetSourceRecallPrecisionF-M CourseWashingtonReed6/12 8/ * % 100% 62.9% 79.5% CourseWSUReed9/16 100% 72.1% FacultyWashingtonBerkeley7/10 9/ % 90% 78.02% 90% facultyMichiganBerkeley7/10 9/ % 90% 78.02% 90% *One element wrongly mapped to different one

conclusions In general NB is better than NM Two small improvements - Numerical ratio for the name matching - Building expected patterns for the data. “ help in improving NB matching” Combining the two methods was helpful but the results still not significant enough to argue for the combination.

Tim – Project 3 Results Name Matcher Improvements Word Similarity Function –Convert to lower case –Combine: Levenshtien edit distance – normalized to give % similar_text() – % of characters the same –Soundex –Longest Common Subsequence Checks for substring Normalized to give %

Naïve Bayes Improvements Classify data instances –Use regular expession classifiers –24 general classes Correspond to datatypes No domain specific classes long_string, small_int, big_int, short_all_caps, med_all_caps, init_cap, init_caps, …, short_string –Used only Course data to create REs

Course Results Domain: Course Test 1Test 2All 10 Tests (%) PRPRPRF Name Matcher Base8/88/97/77/ Naïve Bayes Base3/9 5/9 48 Combined3/33/94/44/ Name Macher Improved9/ Naïve Bayes Improved5/9 7/9 57 Combined9/9 97

Faculty Results Domain: Faculty Test 1Test 2All 10 Tests (%) PRPRPRF Name Matcher Base10/ Naïve Bayes Base3/10 30 Combined3/33/103/33/ Name Macher Improved10/ Naïve Bayes Improved5/10 8/10 73 Combined10/10 100

Schema Matching Helen Chen CS652 Project 3 06/14/2002

Results from Name Matcher ApplicationTargetSource# of Attr. # of missing Attr. MatchedRecallPrecision Coursewshuwm12111 (8)*11/11 Coursewsuuwm1679 (8)*9/9 Facultywshtexas100 10/10 Facultymchtexas100 10/10 * The number in () is the # of matched before improvement

Results from Naïve Bayes ApplicationTargetSource# of Attr.# of Missing Attr. RecallPrecision Coursewshuwm1215/11 Coursewsuuwm1675/9 Facultywshtexas1005/105/7 Facultymchtexas1005/105/7

Comments Name matcher works fine in the given two domains with appropriate dictionary –Add stemming words, synonyms, etc. in the dictionary, make the words case insensitive Naïve Bayes is not a good schema matching method in the given domains –Use words instead of tuples as token –Use thesaurus (count stemming words and synonyms as one token, ignore cases) Improvements can be done –Use value characteristics ( String length, numeric ratio, space ratio) –Use Ontology

Yihong’s Project 3 Course Domain: –Rice, 11  Washington, 12; (11/11 directly mapped) –Rice, 11  WSU, 16; (9/11 directly mapped, 1/11 indirectly mapped, 1/11 not mapped) Faculty Domain –Cornell, 10  Washington, 10; (10/10 directly mapped) –Cornell, 10  Michigan, 10; (10/10 directly mapped)

Name Matcher Base line situation –Synonym list for each attribute name by training –Add most common synonyms and abbreviations –Compare with case-insensitive Improvement situation –Add more synonyms using WordNet –String similarity computation –Add a new category as “UNKNOWN”

Naïve Bayes Base line situation –Each entry in Raw_text as a training unit Improvement situation –Remove stopwords –Cluster special strings –String similarity computation –Add a new category as “UNKNOWN” –Training size experiment

Results Conclusion Course DomainFaculty Domain Rice  Washington Rice  WSU Cornell  Washington Cornell  Michigan P (11/11) R (11/11) P (11/11) R (9/9) P (10/10) R (10/10) P (10/10) R (10/10) Name Matcher Base Line 8/11 8/910/10 Naïve Bayes Base Line 3/11 3/93/10 Combined 8/11 8/910/10 Name Matcher Improved 11/11 10/11 * 9/910/10 Naïve Bayes Improved 6/11 6/11 * 6/97/10 8/10 Combined 11/11 10/11 * 9/910/10 Combination: random selection weighted by experimental accuracies

David Marble CS 652 Project 3

Baseline Results Name MatcherWSU Michigan SOURCEPR PR Reed0.58 Berkeley1.00 Rice0.73 Cornell1.00 Naïve Bayes SOURCEPR PR Reed0.42 Berkeley0.45 Rice0.45 Cornell0.45 Combined SOURCEPR PR Reed0.17 Berkeley0.45 Rice0.27 Cornell0.45

Improved Results NB: Improved precision by tokenizing, separating text/numbers, removing leading 0’s in numbers. Name Matcher: Word Stemming. Name Matcher WSU Michigan SOURCEPR PR Reed0.83 Berkeley1.00 Rice0.82 Cornell1.00 Naïve Bayes SOURCEPR PR Reed Berkeley0.73 Rice Cornell0.73 Best of Both SOURCEPR PR Reed0.92 Berkeley1.00 Rice0.91 Cornell1.00

Comments WSU happened to be the “weird” one. –Building names completely different –Faculty with odd last names, only a few first names matched (not a lot of training names) Telephone #’s only matched when changing digits to “digit” instead of value. Start time, end time dilemma – why can’t schools run their schedule like BYU?

Craig Parker

Baseline Results Course 1 –Recall =.6 –Precision = 1 Course 1 –Recall =.66 –Precision = 1 Faculty –Recall =.8 –Precision = 1

Modified Results Course 1 –Recall =.7 –Precision = 1 Course 1 –Recall =.78 –Precision = 1 Faculty –Recall =.8 –Precision = 1

Discussion Modification of Name Matching involved a number of substring comparisons. Modifications improved results for both Course tests. Modifications did not change results for Faculty tests. Naïve Bayesian Classifier not well suited for all types of data (buildings, sections, phone numbers)

Schema Matching results Lars Olson

Baseline test data Test 1 (Course: Washington  Reed) –R = 3/9 (33%), P = 3/3 (100%) –room, title, days Test 2 (Course: Washington  Rice) –R = 4/9 (44%), P = 4/4 (100%) –room, credits, title, days Test 3 (Faculty: Washington  Berkley) –R = 8/10 (80%), P = 8/8 (100%) –name, research, degrees, fac_title, award, year, building, title Test 4 (Faculty: Washington  Cornell) (identical to Test 3)

After Improvements Test 1 (Course: Washington  Reed) –Name matcher: R = 8/9 (89%), P = 8/8 (100%) (missed schedule_line  reg_num) –Bayes: R = 4/9 (44%), P = 4/12 (33%) (also missed schedule_line) Test 2 (Course: Washington  Rice) –Name matcher: R = 9/9 (100%), P = 9/9 (100%) –Bayes: R = 4/9 (44%), P = 4/12 (33%) Test 3 (Faculty: Washington  Berkley | Cornell) –Name matcher: R = 10/10 (100%), P = 10/10 (100%) –Bayes: R = 8/10 (80%), P = 8/10 (80%)

Comments Improvements made: –Name matcher: Remove all symbols (e.g. ‘_’) from string Build thesaurus based on training set –Bayes learner: Attempt 1: classify all numbers together Attempt 2: replace all digits with ‘#’ Idea: FSA tokenizer (to recognize phone numbers ###  ####, times ##:##) Difficulties: –What are the correct matches? (e.g. restrictions  comments) –Aggregate matches were not included in recall measures

Jeff Roth Project 3

Basic Results Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Naïve Bayes: 7 / 12 correct, 6 / 16 FP Name Classifier: 12 / 15 correct, 0 / 19 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Naïve Bayes: 6 / 10 correct, 3 / 10 FP Name Classifier: 14 / 14 correct, 0 / 14 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Naïve Bayes: 7 / 10 * correct, 5 / 16 FP Name Classifier: 12 / 13 correct, 0 / 19 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Naïve Bayes: 5 / 10 correct, 3 / 10 FP Name Classifier: 14 / 14 correct, 0 / 14 FP

“Improved” Naïve Bayes Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Naïve Bayes: 7 / 12 correct, 7 / 16 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Naïve Bayes: 6 / 10 correct, 3 / 10 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Naïve Bayes: 7 / 10 * correct, 5 / 16 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Naïve Bayes: 5 / 10 correct, 3 / 10 FP Improvements: 1. Classification = argmax (Log(P(v j ) + Σ log(P(a i | v j ))) - included in basic 2. If a word in classification doc has no match, classification = 1 / (2 * |vocabulary|) - no help 3. Divide by number of words in test doc and find global max - scratched

Combination Course - Target = Reed Training = Rice, uwm, Washington Source = wsu Name Classifier: 13 / 15 correct, 0 / 19 FP Faculty - Target = Berkley Training = Cornell, Texas, Washington Source = Michigan Name Classifier: 14 / 14 correct, 0 / 14 FP Course - Target = Rice Training = Reed, uwm, Washington Source = wsu Name Classifier: 12 / 13 correct, 0 / 19 FP Faculty - Target = Cornell Training = Berkley, Texas, Washington Source = Michigan Name Classifier: 14 / 14 correct, 0 / 14 FP Combination algorithm: 1. Match source to target if both Naïve Bayes and name matcher agreed 2. Match remaining unmatched target elements to source by name matcher 3. Match any remaining unmatched target elements to source by Naïve Bayes

Schema Matching by Using Name Matcher and Naïve Bayesian Classifier (NB) Cui Tao CS652 Project 3

Name Matcher ApplicationMappingPrecisionRecall Course UWM  Washington 9/9 WSU  Washington 9/9 Faculty Texas  Washington 10/10 Michigan  Washington 10/10 Tokenization of names SectionNr  Section, Nr; Start_time  Start, time Expansion of short-forms, acronyms nr  number, bldg  building, rm  room, sect  section crse or crs  course Thesaurus of synonyms, hypernyms, acronyms Nr  Code, restriction  limit, etc Ignore cases Heuristic name matching (Cupid)Heuristic name matching (Cupid)

Naïve Bayesian Classifier Improvement: –Use tokens instead of tuples Name: –“Richard Anderson”, “Thomas Anderson”, “Thomas F. Coleman”; – “Thomas”, “Richard”, “Anderson”, “F.”, “Coleman”. Building, degree, research, etc –Eliminate stopwords –Stemming words: shared substring at least 80% long in the whole word –Ignore case Problems: – Names, building, etc – Numbers: room, time, code – Keyword confusions: research, award, title – Different systems: room, section number, etc – Phone numbers (Can not match by NB, but easy to find the match by using pattern recognition) ApplicationMappingPrecisionRecall Course UWM  Washington 5/105/9 WSU  Washington 6/76/9 Faculty Texas  Washington 8/88/10 Michigan  Washington 8/88/10

Conclusion Combine them together: –How: conflict  follow name matcher –Result: all 100% Name matcher: works better for this application NB: may work better in indirect mappings

Project 3: Schema Matching Alan Wessman

Baseline Results Course test set: UWM Faculty test set: Texas

Improved Results Name matcher improvements: Lower case, trim whitespace Remove vowels Match if exact, prefix, or edit distance = 1 Naïve Bayes improvements: Lower case, trim whitespace Consider only first 80 chars Consider only first alphanumeric token in string

Commentary Improved name matcher effective –But performance decreases if too general Naïve Bayes not very useful –Fails when different attributes have similar values (start_time, end_time, room, section_num) –Fails when same attribute has different values or formats across data sources (room, comments) “Sophisticated” string classifier for NB failed miserably; worse than baseline so I threw it out!

CS 652 Project #3 Schema Element Mapping -- By Yuanqiu (Joe) Zhou Applicatoin Target Schema Num of Target Elements Source Schema Num of Source Elements Num of expected mapping RecallPrecision CourseUWM15WSU16106/106/13 CourseWashington12WSU16107/107/13 FacultyTexas10Michigan10 8/10 FacultyWashingotn10Michigan10 8/10 Base Line Experimental Results

CS 652 Project #3 Schema Element Mapping -- By Yuanqiu (Joe) Zhou Improvement (at least tried)  Name Matcher  Using simple text transformation functions, such as sub-string, prefix and abbreviation  NB Classifier  Positive Word Density ( does work at all,  )  Regular expressions for common data types, such as time, small integers and large integers  Combination  Favor name matcher over NB classifier  NB classifier can be used to break the tie by name matcher (such as sect  section, sect  section_note)

CS 652 Project #3 Schema Element Mapping -- By Yuanqiu (Joe) Zhou Applicatoin Target Schema Num of Target Elements Source Schema Num of Source Elements Num of expected mapping Recall Precisio n CourseUWM15WSU161010/10 CourseWash12WSU161010/10 FacultyTexas10Michigan10 10/10 FacultyWash10Michigan10 10/10 Experimental Results with Improvements

CS 652 Project #3 Schema Element Mapping -- By Yuanqiu (Joe) Zhou High precisions and recalls result mostly from improvements to Name Match Improvements to NB classifier did not contribute too much (only correct one missed mapping for one course application) NB classifier is not suited to distinguish the elements with similar data type (such as time and number) or the elements sharing many common values Reducing the size of training data can achieve the same precision and recall with less running time Comments