GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 1 Introduction to Data Mining
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
The ISI Web of Knowledge nce/training/wok/#tab3.
MD – Object Model Domain eSales Checker Presentation Régis Elling 26 th October 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Semantically Processing The Semantic Web Presented by: Kunal Patel Dr. Gopal Gupta UNIVERSITY OF TEXAS AT DALLAS.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
MedKAT Medical Knowledge Analysis Tool December 2009.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Martin Kruliš by Martin Kruliš (v1.1)1.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Jason W. Karl, Ph.D. Jeffrey K. Gillan Jason W. Karl, Ph.D. Jeffrey K. Gillan 23 October 2013 Ty Montgomery Richard Bliss Ty Montgomery Richard Bliss
Lecture Transforming Data: Using Apache Xalan to apply XSLT transformations Marc Dumontier Blueprint Initiative Samuel Lunenfeld Research Institute.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Information Retrieval in Practice
Lexical and Syntax Analysis
Core WG Meeting November 16th, 2017.
CSE 3302 Programming Languages
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
Part of the Multilingual Web-LT Program
Searching and browsing through fragments of TED Talks
How to publish in a format that enhances literature-based discovery?
Question Answering & Linked Data
Jonathan Griffin, Managing Director, IFIS Publishing &
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Roadmap Problem definition Motivation Solution Framework Demo Conclusion

Problem definition The purpose of an automatic glossary compiler is to aid in the construction of a list of definitions across a large collection of documents. Definition is a concise description of what an entity is. Challenges:  Multiple ways to phrase a definition  Single term has multiple definitions  Need clustering

Motivation Benefit for everyone:  Construct a glossary without marking index words by hand;  Briefly look up the definition of a term in a book, journal articles, a set of books or collection of papers on a particular topic. No current similar tool exists.

Solution framework Query processing  Yahoo API; Definition extraction  Minipar; Clustering algorithm  K-means; Technology  IE Toolbar.

Page processing Goals  Fetch pages for a given query Use multi-threading to accelerate  Convert multiple formats into text format e.g., PDF files  Filter Remove HTML tags, incomplete tokens… Detect sentence boundaries. Remove garbage

Process Query Yahoo API query string result set query pages.TXT Fetch URL pdf ? html ? Remove TagConvert to TXT Sentence Segmentation Garbage Cleaning Page processing (cont.)

Definition extraction Dependency parser (MINIPAR):  Based on the theory of dependency grammars;  Broad coverage parser;  Output is a parse tree representing head-modifier relations. Generic definition patterns:  Use generic semantic patterns to overcome the syntactic variability (expressing the same meaning with the same set of words by employing different syntactic structures of a sentence);  Extensible, easily coded in XML, requires minimum knowledge of linguistics.

Definition extraction “Data Mining, also known as knowledge discovery in data bases, is the process of automatically searching large volumes of data for patterns.”

Definition extraction Simple and complex definitions;  Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts;  Data Mining can be defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data“. Simple and complex terms being defined;  Data Mining;  Core of comparative genome analysis. Extensible; High accuracy (limited by the parser).

Clustering Algorithm:  K-means; Similarity measure:  Vector space model; Challenges:  Define k;  Define similarity measure.

Demo

Thank you! Questions?