TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

Improved TF-IDF Ranker
Results: 1.Most positive scores related to retrieval precision were much lower than the ideal maximum, even though the queries contained very specific.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Using Percolated Dependencies in PBSMT Ankit K. Srivastava and Andy Way Dublin City University CLUKI XII: April 24, 2009.
Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
PaperLens Understanding Research Trends in Conferences using PaperLens Work by Bongshin Lee, Mary Czerwinski, George Robertson, and Benjamin Bederson Presented.
Information Retrieval in Practice
Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Introduction to Machine Learning Approach Lecture 5.
Overview of Search Engines
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
WEB FORUM MINING BASED ON USER SATISFACTION PAGE 1 WEB FORUM MINING BASED ON USER SATISFACTION By: Suresh Pokharel Information and Communications Technologies.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Custom driven scientific information extraction from digital libraries using integrated text mining services Betim Çiço, Adrian Besimi, Visar Shehu 14th.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Topic Modeling using Latent Dirichlet Allocation
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Single Document Key phrase Extraction Using Neighborhood Knowledge.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Information Retrieval in Practice
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
An Efficient Algorithm for Incremental Update of Concept space
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Personalized Social Image Recommendation
Elsevier Activity Range
Trevor Savage, Bogdan Dit, Malcom Gethers and Denys Poshyvanyk
Search Techniques and Advanced tools for Researchers
Search Engine Architecture
Hierarchical Relational Models for Document Networks
Introduction to Search Engines
Presentation transcript:

TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics

Introduction Formulating a research idea is the 1 st step for success in academia. A worthy research idea must be original and innovative. In order to come up with innovative research ideas, researchers have to read a lot of published articles… … which is time-consuming.

“Is there any shortcut to success?” “No.” “There are efficient ways to achieve success” Search Engines in Digital Libraries:

Search engines support information seeking and retrieval. Introduction Search Engine “Search Query” List of titles (of articles)

Search Results

Search engines support information seeking and retrieval. However, is this enough for the junior researcher? Introduction FYP students1 st year PhD students Define a research topic (from zero knowledge) Help in survey Identify emerging/new research areas to explore Determine related topics How useful is this result to the junior researcher?

Search engines support information seeking and retrieval. Input: “search query” (e.g. machine learning, DNA, polymerase) Output: List of titles + other info Ranked based on semantic closeness to the “search query”. However, Cannot help users understand research trends. Cannot help users recognize “hot” topics. Cannot help users understand how topics interact and influence research activity. Problem Definition

Junior researchers want: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. Problem Definition

Junior researchers want: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. Problem Definition Enter a search query View results Select a few articles to read Extract new terms from selected article Current Inefficient Method

Search Results

Information overload !

Junior researchers want: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. Problem Definition Enter a search query View results Select a few articles to read Extract new terms from selected article Current Inefficient Method

Junior researchers want: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. Problem Definition Enter a search query View results Desired Efficient Method Visualization of the research topics List of HOT research topics (related to the search query) Do it quick! TopicTrend

Our Solution Enter a search query View results Visualization of the research topics List of HOT research topics (related to the search query)

Quick Demo

Recruited 4 participants. Participants: Tested TopicTrend using queries from their respective domains. Rated TopicTrend’s output (w.r.t. their query). [Quantitative] Filled up a questionnaire. [Qualitative] Evaluation Chemistry / PhD Engineering (Transportation) / PhD Comp Science (AI) / PhD Engineering / FYP

Evaluation Topic A Topic B Topic C Topic D Topic E Topic F Topic G Topic H Topic I1 Topic J1 Score 9/10 Topic A Topic B Topic C Topic D Topic E Topic F Topic G Topic H Topic I Topic J “machine learning”

Evaluation Average score = % Quantitative

Evaluation Questionaire using Five-Point Likert Scale. 1=Disagree, 5 =Agree. Some examples: “The system was easy to use.” “The system gave interesting results.” “I was able to get a better understanding of the topics.” “I was able to discover trends.” “I was able to discover relationships between topics.” “I was able to discover potential, novel topics.” Details in Project Report. Qualitative 4.75 / 5 4 / 5

Conclusion TopicTrend is a visualization tool that helps junior researchers: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. However, results were mediocre  Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc) Solutions and Future Work: TF-IDF weight — don’t have to manually enter stop words. Statistical measure to evaluate how important a word is. The importance increases to the number of times a word appears in the document... But is offset by the frequency of the word in the corpus. Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei) Online LDA – find topics faster than normal LDA; analyze in a stream. Dynamic Topic Models (DTM) – captures the word evolution of each topic over time. Search by exemplar (instead of search by keyword) Benefits users who have difficulty expressing their query.

Conclusion TopicTrend is a visualization tool that helps junior researchers: Understand research topics and trends. Recognize HOT topics. Understand how topics interact and influence research activity. However, results were mediocre  Due to presence of stop phrases (e.g., “problem set”, “proposed model”, etc) Solutions and Future Work: TF-IDF weight — don’t have to manually enter stop words. Statistical measure to evaluate how important a word is. The importance increases to the number of times a word appears in the document... But is offset by the frequency of the word in the corpus. Latent Dirichlet Allocation (LDA) – view each abstract as a mixture of topics. (David Blei) Online LDA – find topics faster than normal LDA; analyze in a stream. Dynamic Topic Models (DTM) – captures the word evolution of each topic over time. Search by exemplar (instead of search by keyword) Benefits users who have difficulty expressing their query.

Thank You

Backup Slides

OpenNLP —a machine learning based toolkit for the processing of natural language text. Used OpenNLP to retrieve a list of NPs. Implementation OpenNLP Tools An article 1.Sentence Detection 2.Tokenization 3.Part-of-Speech (POS) Tagging 4.Chunking and Retrieving NPs NP A NP B NP C NP D NP E NP F

Sentence Detection Implementation Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Those contraction-less sentences don't have boundary/odd cases...this one does. Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. Those contraction-less sentences don't have boundary/odd cases...this one does.

Tokenization Implementation Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. [Pierre] [Vinken] [,] [61] [years] [old] [,] [will] [join] [the] [board] [as] [a] [nonexecutive] [director] [Nov.] [29] [.] [Mr.] [Vinken] [is] [chairman] [of] [Elsevier] [N.V.] [,] [the] [Dutch] [publishing] [group] [.]

Part-of-Speech Tagging Implementation Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. [NNP] [NNP] [,] [CD] [NNS] [JJ] [,] [MD] [VB] [DT] [NN] [IN] [DT] [JJ] [NN] [NNP] [CD] [.] [NNP] [NNP] [VBZ] [NN] [IN] [NNP] [NNP] [,] [DT] [JJ] [NN] [NN] [.]

Text Chunking and Extracting NPs Text chunking consists of dividing a text in syntactically correlated parts of words. Uses the Tokenization and POS Tagging data. For example: He reckons the current account deficit will narrow to only # 1.8 billion in September. Becomes: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ]. Implementation

Text Chunking and Extracting NPs Text chunking consists of dividing a text in syntactically correlated parts of words. Uses the Tokenization and POS Tagging data. Implementation Note the: B-Chunk I-Chunk

OpenNLP —a machine learning based toolkit for the processing of natural language text. Used OpenNLP to retrieve a list of NPs. Implementation OpenNLP Tools An article 1.Sentence Detection 2.Tokenization 3.Part-of-Speech (POS) Tagging 4.Chunking and Retrieving NPs NP A NP B NP C NP D NP E NP F

An algorithm to calculate the score of a NP. Implementation NP A NP B NP C NP D NP E NP F # (0 ~ 2 years) # (2 ~ 4 years) # (4 yrs & beyond) Score = = = # (0 ~ 2 years) # (2 ~ 4 years) # (4 yrs & beyond) Score = = 3 33 = 0.090

An algorithm to calculate the score of a NP. Implementation NP A NP B NP C NP D NP E NP F

Re-rank the list of NPs base on the score. Implementation Re-rank NP B NP D NP E NP C NP A NP F New! NP A NP B NP C NP D NP E NP F

Implementation Calculate the relationship strength between NPs by considering the common articles (PIIs) that they have. The more articles they have in common, the thicker the edge.

The End