Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M9490216.

Slides:



Advertisements
Similar presentations
Comparison of BIDS ISI (Enhanced) with Web of Science Lisa Haddow.
Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Chapter 5: Introduction to Information Retrieval
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Digital Libraries and Autonomous Citation Indexing Steve Lawrence C. Lee Giles Kurt Bollacker.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
1 Using Scopus for Literature Research. 2 Why Scopus?  A comprehensive abstract and citation database of peer- reviewed literature and quality web sources.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Overview of Search Engines
Literature Search Techniques 2 Strategic searching In this lecture you will learn: 1. The function of a literature search 2. The structure of academic.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
JingTao Yao Growing Hierarchical Self-Organizing Maps for Web Mining Joseph P. Herbert and JingTao Yao Department of Computer Science, University or Regina.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Kohonen Mapping and Text Semantics Xia Lin College of Information Science and Technology Drexel University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Research Paper Recommender System Monica D ă g ă diţ ă.
The ISI Web of Knowledge nce/training/wok/#tab3.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
JOURNAL CITATION REPORTS James Cook University Celebrating Research 9 OCTOBER 2009 Steven Werkheiser Manager, Customer Education & Training ANZ Thomson.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Information Retrieval
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Researching for your Literature Review
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M

Content Introduction The scholarly publication retrieval system The citation database Document clustering using KSOM Adapting KSOM for retrieving scholarly publications Training process Retrieval process Performance evaluation Conclusion Q&A

Introduction Scholarly publications are available online and in digital libraries, but existing search engines are mostly ineffective for searching for relevant publications. Researchers have developed autonomous citation indexing agents, such as CiteSeer. These agents extract citation information from the literature and store it in a database.

Introduction (cont.1) Citation databases contain rich information that you can mine to retrieve publications. Our intelligent retrieval technique for scholarly publications stored in a citation database uses Kohonen’s self-organizing map (KSOM). The technique consists of  Training process : generates cluster information.  Retrieval process : ranks publications’ relevance on the basis of user queries.

The scholarly publication retrieval system

The scholarly publication retrieval system (cont. 1) Citation indexing agent  Uses search engines to locate Web sites containing publication keywords.  Downloads the publications from the Web sites and converts them from PDF or PostScript format to text using the pstotext tool.  It identifies the Web publications' bibliographic section through keywords such as “bibliography” or “references”, then extracts citation information from the bibliographic section and stores it in the citation database.

The scholarly publication retrieval system (cont. 2) Indexing clients  Let users specify publication Web sites.  Let users determine how often the citation indexing agent visits these sites. Intelligent retrieval agent  mines the citation database to identify hidden relationships and explore useful knowledge to improve the efficiency and effectiveness of scholarly literature retrieval. Retrieval client  provides the necessary user interface for inputting queries that will be passed to the intelligent retrieval agent for further processing.

The citation database A citation database is a data warehouse for storing citation indices. You can use citation indices to identify existing research fields or newly emerging areas, analyze research trends, discover the scholarly impact, and avoid duplicating previously reported works. For our experiment, we set up a test citation database by downloading papers published from 1987 to 1997 that were stored in the information retrieval field of ISI’s Social Science Citation Index, which includes all the journals on library and information science. We selected 1,466 IR-related papers from 367 journals with 44,836 citations.

The citation database (cont. 1)

The citation database (cont. 2) Figure 2 shows citation database’s structure  Source table : stores information from the source papers.  Citation table : stores all the citations extracted from the source papers. The primary keys in the two tables are paper_ID in the source table and citation_ID in the citation table. In the source table, no_of_citation indicates the number of references the source paper contains. In the citation table, source_ID links to paper_ID in the source table to identify the source paper that cites the particular publication the citation table is storing. If two different source papers cite a publication, it is stored in the citation table with two different citation_IDs.

Document clustering using KSOM KSOM  Use neural network models for document clustering.  Ordering high-dimensional statistical data in a way similar to how input items are mapped close to each other.  Users can display a colorful map of topic concentrations that they can further explore by drilling down to browse the specific topic.  AuthorLink, which provides a visual display of related authors, demonstrates the benefits of using KSOM to generate a visualization interface for co-citation mapping  A nonlinear projection of the patterns of arbitrary dimensionality into a one- or two-dimensional array of neurons, the neighborhood refers to the best- matching neuron’s spatial neighbors, which are also updated to react to the same input.

Adapting KSOM for retrieving scholarly publications Citation information lets us judge document relevance because authors cite articles that are related. CCIDF (common citation × inverse document frequency) assigns a weight to each citation that equals the inverse of the citation’s frequency in the entire database.  For every two documents, the weights of their common citations are summed.  The resulting value indicates how related the two documents are—the higher the summed value, the greater the relationship. However, we find that this method is not very practical.  Identifying citations to the same article is difficult because they can be formatted differently in different source papers.

Adapting KSOM for retrieving scholarly publications (cont.1) If two documents share the same citation, they must also share the same keywords. Extract the keywords from its citations Instead of extracting keywords from the document. Each extracted keyword forms an element of a document vector. If d denotes the document vector, then each keyword will be denoted by di where i is between 1 and N, and N is the total number of distinct keywords. For each document, we extract the 20 most frequently occurring keywords from its citations as the feature factors. (We determined experimentally that using 20 keywords gives us the best result.) We can then adopt the TFIDF (term frequency x inverse document frequency) method to represent the document vector. After solving the document representation problem, we use KSOM to categorize documents in the citation database.

Adapting KSOM for retrieving scholarly publications (cont.2)

Adapting KSOM for retrieving scholarly publications (cont.3) Figure 3 shows the citation-based retrieval technique using KSOM. The citation indexing agent generates the citation database. The KSOM retrieval technique consists of two processes:  Training process Mines the citation database to generate cluster information.  Retrieval process Retrieves and ranks the publications according to user queries through the retrieval client.

Training process

Training process (cont.1)

Training process (cont.2)

Training process (cont.3)

Retrieval process A user first submits a query as free-text natural language. The system then preprocesses, parses, and encodes the query in a way similar to the keyword preprocessing and encoding in the training process. We feed these query vectors into the KSOM network to determine which clusters should be activated. Once the best cluster is found, the ranking process uses the query term’s Euclidean distance from all the documents in the cluster. The documents with a lower Euclidean distance to the query will be semantically and conceptually closer to it. Given a document vector d and a query vector q, their similarity function, sim(d,q), is where wdi and wqi are the weight of the ith element in the document vector d and query vector q, respectively.

Retrieval process (cont.1)

Retrieval process (cont.2)

Performance evaluation We evaluated our citation-based retrieval technique’s performance on the basis of its training performance, retrieval speed, and retrieval precision. The experiments were carried out on a Pentium II 450-MHz machine with 128 Mbytes of RAM. During the experiments, we used the following test data:  The number of keywords in the keyword list was 1,614.  The number of words to be searched in the WordNet dictionary was 121,962.  The number of clusters was predefined at 100.  The total number of documents used for training (the training set) was 1,000.

Performance evaluation (cont. 1) Training performance  The number of iterations the neural network requires to reach the convergent state.  The total training time. Retrieval performance  The average online retrieval speed.  The retrieval precision. System-based relevance measures User-based relevance measures We conducted the experiment as follows.  Submitted 50 queries to the system, and we measured the average online retrieval speed from the time difference between the moment of query submission and the point when the user received the results.  For each query result, we examined only the first 20 publications to check their relevance to the query.  Obtained the user-based relevance measure from an average of 10 users.

Performance evaluation (cont. 2)

Performance evaluation (cont. 3)

Conclusion Although we used Web publications as a case study for analysis, our proposed method is generic enough to retrieve relevant publications from other repositories, such as digital libraries. Apart from the proposed citation-based document retrieval technique, we are investigating other techniques, such as co-citation analysis, to support document clustering and author clustering in the publication retrieval system. Additionally, we are using KSOM to explore techniques that can enhance the visual representation of cluster maps.

Q&A Questions & Answers