Download presentation
Presentation is loading. Please wait.
Published byAndrew Pearson Modified over 9 years ago
1
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09
2
Problem Overview Implementation Details Deliverables Results 2 VDC/TWG Meeting August 09
3
3 Document ADocument B Parser to parse document format and extract terms Measure Semantic Relatedness of terms from documents A and B VDC/TWG Meeting August 09
4
Implemented parsers to parse the following metadata documents : DublinCore DarwinCore EML Deliverables : DublinCore Parser DublinCore Parser DarwinCore Parser DarwinCore Parser EML Parser EML Parser 4 VDC/TWG Meeting August 09
5
Measure Semantic Relatedness of terms using the following libraries : Lucene : It is a full-featured text search engine written entirely in Java, and it is an open source project available for free download Lucene GTM (General Text Matcher) :GTM measures the similarity between texts.GTM is written in Java, and is open source, released under the BSD license. GTM (General Text Matcher) 5 VDC/TWG Meeting August 09
6
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's queryVector Space Model (VSM) of Information RetrievalBoolean model Idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query 6 VDC/TWG Meeting August 09
7
7 Parsed input document B 1) Read term- description from document A Xml file 3) Search term and description in lucene index for document B Get matching terms (lucene documents) Output Lucene Index Builder (Stemming and stop-word filtering) Expand description for synonyms using Wordnet index Lucene index Parsed input document A Vocabulary Term Mapper Xml file 2) Stem, stop- word filter and expand description (query) for synonyms using Wordnet index Create xml output file VDC/TWG Meeting August 09
8
Lucene Index Builder Lucene Index Builder Lucene Vocabulary Term Mapper Lucene Vocabulary Term Mapper Wordnet Index Builder Wordnet Index Builder 8 VDC/TWG Meeting August 09
9
Original Description : the full, unabbreviated name of the country Stop word filtered,stemmed and synonym expanded description : the full entire fully good total undivided wax wide unabbrevi name advert appoint call cite constitute describe diagnose discover distinguish epithet figure gens identify key list make mention nominate refer of countri 9 VDC/TWG Meeting August 09
10
10 Parsed input document B Read term- description from document A (Stem and stop- word filter) Get similarity score using GTM Parsed input document A Vocabulary Term Mapper Xml file Maintain top 5 scores for every term-description in DocumentA Xml file VDC/TWG Meeting August 09 Read term- description from document B (Stem and stop- word filter) Output Score, Term, Desc, Xpath to xml file
11
Modified GTM Library Modified GTM Library GTM Vocabulary Term Mapper GTM Vocabulary Term Mapper 11 VDC/TWG Meeting August 09
12
The results obtained from various mappings are as follows : DublinCore – DarwinCore DublinCore – DarwinCore DarwinCore – DublinCore DarwinCore – DublinCore EML – DarwinCore EML – DarwinCore DarwinCore – EML DarwinCore – EML EML – DublinCore EML – DublinCore DublinCore – EML DublinCore – EML 12 VDC/TWG Meeting August 09
13
Following is a list of resulting terms obtained from Lucene Vocabulary Term Mapper which matches with the existing mapping between DublinCore and EMLexisting Title – Title Creator – Creator Publisher – Publisher Format – Physical Coverage- Coverage Rights – Intellectual Rights 13 VDC/TWG Meeting August 09
14
The results obtained from various mappings are as follows : DublinCore – DarwinCore DublinCore – DarwinCore DarwinCore – DublinCore DarwinCore – DublinCore 14 VDC/TWG Meeting August 09
15
15 VDC/TWG Meeting August 09
16
Fix a bug in EML parser. Provide two versions of EML parser: One that has description for all terms in the hierarchy and one that has description for only the current term. 16 VDC/TWG Meeting August 09
17
Got a chance to learn new libraries (Lucene and GTM) Learnt new concepts about semantic similarity Honed my XML skills Enjoyed working with this team 17
18
General Text Matcher http://nlp.cs.nyu.edu/GTM/ http://nlp.cs.nyu.edu/GTM/ http://lucene.apache.org/java/2_4_0/scoring.ht ml http://lucene.apache.org/java/2_4_0/scoring.ht ml http://lucene.apache.org/java/2_2_0/api/org/ap ache/lucene/wordnet/package-summary.html http://lucene.apache.org/java/2_2_0/api/org/ap ache/lucene/wordnet/package-summary.html 18 VDC/TWG Meeting August 09
19
VDC/CCIT Meeting June 09 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.