Presentation is loading. Please wait.

Presentation is loading. Please wait.

By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.

Similar presentations


Presentation on theme: "By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09."— Presentation transcript:

1 By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09

2  Problem Overview  Implementation Details  Deliverables  Results 2 VDC/TWG Meeting August 09

3 3 Document ADocument B Parser to parse document format and extract terms Measure Semantic Relatedness of terms from documents A and B VDC/TWG Meeting August 09

4  Implemented parsers to parse the following metadata documents :  DublinCore  DarwinCore  EML  Deliverables :  DublinCore Parser DublinCore Parser  DarwinCore Parser DarwinCore Parser  EML Parser EML Parser 4 VDC/TWG Meeting August 09

5  Measure Semantic Relatedness of terms using the following libraries :  Lucene : It is a full-featured text search engine written entirely in Java, and it is an open source project available for free download Lucene  GTM (General Text Matcher) :GTM measures the similarity between texts.GTM is written in Java, and is open source, released under the BSD license. GTM (General Text Matcher) 5 VDC/TWG Meeting August 09

6  Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's queryVector Space Model (VSM) of Information RetrievalBoolean model  Idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query 6 VDC/TWG Meeting August 09

7 7 Parsed input document B 1) Read term- description from document A Xml file 3) Search term and description in lucene index for document B Get matching terms (lucene documents) Output Lucene Index Builder (Stemming and stop-word filtering) Expand description for synonyms using Wordnet index Lucene index Parsed input document A Vocabulary Term Mapper Xml file 2) Stem, stop- word filter and expand description (query) for synonyms using Wordnet index Create xml output file VDC/TWG Meeting August 09

8  Lucene Index Builder Lucene Index Builder  Lucene Vocabulary Term Mapper Lucene Vocabulary Term Mapper  Wordnet Index Builder Wordnet Index Builder 8 VDC/TWG Meeting August 09

9  Original Description : the full, unabbreviated name of the country  Stop word filtered,stemmed and synonym expanded description : the full entire fully good total undivided wax wide unabbrevi name advert appoint call cite constitute describe diagnose discover distinguish epithet figure gens identify key list make mention nominate refer of countri 9 VDC/TWG Meeting August 09

10 10 Parsed input document B Read term- description from document A (Stem and stop- word filter) Get similarity score using GTM Parsed input document A Vocabulary Term Mapper Xml file Maintain top 5 scores for every term-description in DocumentA Xml file VDC/TWG Meeting August 09 Read term- description from document B (Stem and stop- word filter) Output Score, Term, Desc, Xpath to xml file

11  Modified GTM Library Modified GTM Library  GTM Vocabulary Term Mapper GTM Vocabulary Term Mapper 11 VDC/TWG Meeting August 09

12  The results obtained from various mappings are as follows :  DublinCore – DarwinCore DublinCore – DarwinCore  DarwinCore – DublinCore DarwinCore – DublinCore  EML – DarwinCore EML – DarwinCore  DarwinCore – EML DarwinCore – EML  EML – DublinCore EML – DublinCore  DublinCore – EML DublinCore – EML 12 VDC/TWG Meeting August 09

13  Following is a list of resulting terms obtained from Lucene Vocabulary Term Mapper which matches with the existing mapping between DublinCore and EMLexisting  Title – Title  Creator – Creator  Publisher – Publisher  Format – Physical  Coverage- Coverage  Rights – Intellectual Rights 13 VDC/TWG Meeting August 09

14  The results obtained from various mappings are as follows :  DublinCore – DarwinCore DublinCore – DarwinCore  DarwinCore – DublinCore DarwinCore – DublinCore 14 VDC/TWG Meeting August 09

15 15 VDC/TWG Meeting August 09

16  Fix a bug in EML parser.  Provide two versions of EML parser: One that has description for all terms in the hierarchy and one that has description for only the current term. 16 VDC/TWG Meeting August 09

17  Got a chance to learn new libraries (Lucene and GTM)  Learnt new concepts about semantic similarity  Honed my XML skills  Enjoyed working with this team 17

18  General Text Matcher http://nlp.cs.nyu.edu/GTM/ http://nlp.cs.nyu.edu/GTM/  http://lucene.apache.org/java/2_4_0/scoring.ht ml http://lucene.apache.org/java/2_4_0/scoring.ht ml  http://lucene.apache.org/java/2_2_0/api/org/ap ache/lucene/wordnet/package-summary.html http://lucene.apache.org/java/2_2_0/api/org/ap ache/lucene/wordnet/package-summary.html 18 VDC/TWG Meeting August 09

19 VDC/CCIT Meeting June 09 19


Download ppt "By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09."

Similar presentations


Ads by Google