Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Information Retrieval in Practice
Interfaces for Retrieval Results. Information Retrieval Activities Selecting a collection –Talked about last class –Lists, overviews, wizards, automatic.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Information Retrieval in Practice
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
Decision Support Systems
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
SDMX DATA STRUCTURE DEFINITION SDMX Training BANK INDONESIA SEPTEMBER 2015 YOGYAKARTA, INDONESIA.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
WEB PAGE CONTENTS VERIFICATION AGAINST TAGS USING DATA MINING TOOL IKNOW VІI scientific and practical seminar with international participation "Economic.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Data mining in web applications
Information Retrieval in Practice
Best pTree organization? level-1 gives te, tf (term level)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Sentiment analysis algorithms and applications: A survey
Information Retrieval: Models and Methods
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Web IR: Recent Trends; Future of Web Search
Information Retrieval
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
Introduction to Search Engines
ece 627 intelligent web: ontology and beyond
CSE 635 Multimedia Information Retrieval
How to publish in a format that enhances literature-based discovery?
Introduction to Information Retrieval
Text Mining & Natural Language Processing
CS246: Information Retrieval
e-Discovery through Text Mining
Natural Language Processing (NLP)
Anatomy of a modern data-driven content product
Information Retrieval and Web Design
Recuperação de Informação
Information Retrieval
Introduction to Search Engines
Natural Language Processing (NLP)
Presentation transcript:

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008

Trends in NL Analysis General Technologies Information Retrieval - identify relevant documents from a collection of docs Text Mining - measures word-word, word-passage and passage-passage relations. Not just frequency of word in a document but also looks a word in local contexts over very large numbers of occurrences: word in sentence, in paragraph Computational Linguistics Natural Language Processing Human Language Technologies

Trends in NL Analysis Examples where NL Analysis Can Help in the Future - Search open source code repositorieswww.krugle.org - Search mail of open source projectswww.markmail.org Both of these sites use XML tagging based on simple text analysis of the source documents to add more structure and improve the searchability of the documents. If you try these sites out, think about how you might improve your ability to find information you need by extending text analysis and semantic metadata.

Trends in NL Analysis

Identification of Text Content Indexing – going beyond keywords to capture semantics Distribution Hypothesis: “a word is recognized by the company it keeps” Concept Discovery: extraction to semantic category - “this article is about water, juice and Pepsi” => “drink” - Latent Semantic Analysis - Lexicons - Category trees, ontologies - WordNet, FrameNet, Semantic Web

Trends in NL Analysis Text Content Named Entity Recognition (NER) People people by, people from, people in, births, deaths, by occupation, surname, given name, biography,... Organizations company, team, business, media by, political party, club, union, newpaper, church,... Places / GeoPolitical Entities city, town, village, state, province, country, territory,... Money Vehicles Dates

Trends in NL Analysis Identification of Text Content Indexing (cont'd) Latent Semantic Analysis: measures associations of words by frequency of occurrence within a document

Trends in NL Analysis Text Content Key Problems Addressed: ✔ Measure similarity in word meanings ✔ Classify relations between words ✔ Discover different senses of words ✔ Extract keywords from documents Major Approaches to identifying meaning ✔ Lexicon based – labor intensive ✔ Statistical semantics – focus on meanings of common words and relations between common words discovered through algorithms applied to corpora Strong interest in including learning mechanisms into methods being used

Trends in NL Analysis Categorization and correlations of words

Trends in NL Analysis Identification of Text Content Semantic Analysis FrameNet -

Trends in NL Analysis Identification of Text Content Semantic Analysis FrameNet -

Trends in NL Analysis Text Summarization Selection of specific sentences or phrases from the source document Example: Snippets in search results - First sentence Assumed to contain basic orientation for user - Whole or partial sentences with query phrase keywords - Shallow domain knowledge (such as standard form) may guide selection and output Traditional Approach Process: Selected relevant segments are extracted

Trends in NL Analysis Text Summarization Summary should accurately reflect source content and expression Thus for less structured texts, there can be significant variation. Compare: - Form based documents – e.g., Insurance vs. - Free text – office memos, , discussion threads Good summarization of free text is typically human labor intensive. Consider: summarizing discussion threads Basic Problem Emphasis is on identifying key content across multiple sentences, paragraphs and documents, rather than having good representations of each individual sentence.

Trends in NL Analysis Text Summarization Example of Domain Knowledge Co-occurrence of “unidentified assailant” and “terrorist attack” leads to assumption of assailant as performing the attack.

Trends in NL Analysis Text Summarization Process: - Categorization (key classification terms) - Identification of specific domain(s) - Patterns (e.g., script: joint venture creation; marketing campaign; steps in argumentation) - Restatement of content Therefore: shallow statistical processing, lexicon, shallow parsing, semantic webs, and natural language generation. User profile (query keywords, domain expertise, etc.) may constrain output. Challenge

Trends in NL Analysis Thank you for your attention!