Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Precision and Recall.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Kernel Technique Based on Mercer’s Condition (1909)
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Using IR techniques to improve Automated Text Classification
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Universit at Dortmund, LS VIII
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Chapter 23: Probabilistic Language Models April 13, 2004.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Machine Learning Saarland University, SS 2007 Holger Bast Marjan Celikik Kevin Chang Stefan Funke Joachim Giesen Max-Planck-Institut für Informatik Saarbrücken,
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Text Classification and Naïve Bayes Text Classification: Evaluation.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Improving Search Engines using Multi-Word Indicies Hatem Nassrat CSCI 6403 December 2008.
Text Based Information Retrieval
Information Retrieval System based on Phrase Indices
Precision and Recall.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING DEPARTMENT

Baskent University Text Filtering 2 introduction Information filtering (IF) –Incoming non-relevant documents are filtered out. Information retrieval (IR) –Provides a list of ordered documents based on the similarity with the user query

Baskent University Text Filtering 3 introduction ( continued... ) Linear Separation - partitions relevant and non-relevant into distinct blocks Optimal Queries - all relevant documents are ahead of non- relevant ones. Steepest Descent Algorithm (SDA)

Baskent University Text Filtering 4 preliminaries Information retrieval system (S) can be defined as 5 tuple S =(T,D,Q,V,f) -T set of ordered index terms -D set of documents -Q set of queries -V set of real numbers -f:DxQ  V retrieval function

Baskent University Text Filtering 5 preliminaries ( continued ) Vector Space Model - Transformation of raw text into more computationally useful forms - Documents and queries are represented as vectors of weighted terms d=(t 1,w d1 ;t 2,w d2 ;... ;t n,w dn ) ti  T  d q = (q 1, w q1 ; q 2, w q2,... ; q m, w qm ) qi  T  q

Baskent University Text Filtering 6 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.  rank matters.

Baskent University Text Filtering 7 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over non-relavent ones.  rank matters. S + number of document pairs where preferred document is ranked higher S - number of document pairs where non-preferred document is ranked higher S + max maximal number of S +  =(rnrn | rnnnnn ) S + =10 S - =2 S + max =21

Baskent University Text Filtering 8 preliminaries ( continued ) predictedactual relevantnon-relevant relevant ab non-relevant cd Contingency Table Precision =a / (a+b)Recall =a / (a+c) Breakeven point Where precision and recall are equal

9 overview of experiment Training With SDA Optimal query... train test Reuters Data set Topics Effectiveness measures Preprocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing

Baskent University Text Filtering 10 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures Preprocessing Consists of economic news stories that originally appeared on the Reuters newswire in 1987 Each story has been manually assigned one or more indexing labels from a fixed list There are 135 TOPIC labels for classification. In order to use a text corpus for machine learning research it splited into sets of training and testing examples Reuters train test Reuters Data set

Baskent University Text Filtering 11 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="9944" NEWID="5031"> 13-MAR :45:35.38 livestock carcass usa ec U.S. MEAT GROUP TO FILE TRADE COMPLAINTS WASHINGTON, March 13 - The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. Reuter Sample Reuters Document train test Reuters Data set

Baskent University Text Filtering 12 train test Reuters Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

Baskent University Text Filtering 13 train test Reuters Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing After Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U S MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute AME said it intended to ask the U S government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that effective April will require U S meat processing plants to comply fully with EC standards

Baskent University Text Filtering 14 train test Reuters Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

Baskent University Text Filtering 15 train test Reuters Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters Data set Topics labels Effectiveness measures PrePocessing After Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body:. MEAT GROUP FILE TRADE COMPLAINTS American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups intended file petition Section General Agreement Tariffs Trade EC directive effective April require meat processing plants comply fully EC standards

Baskent University Text Filtering 16 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Stemming HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: MEAT GROUP FILE TRADE COMPLAINT American Meat Institute AME intend ask government retaliate European Community meat inspection require. AME President Manly Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General Agreement Tariff Trade EC direct effect April require meat process plant compli fulli EC standard Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing train test Reuters Data set

Baskent University Text Filtering 17 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Transform To Vectors HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing meat 5 group 1... Molpus 1... standard 1 train test Reuters Data set

Baskent University Text Filtering 18 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Create Dictionary (only in training) Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing approv 1236 chairman ptd 5 train test Reuters Data set

Baskent University Text Filtering 19 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5 Molpus... standard 1... train test Reuters Data set

Baskent University Text Filtering 20 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing After Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters Data set

Baskent University Text Filtering 21 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

Baskent University Text Filtering 22 overview of experiment train Training With SDA Optimal query test... Reuters Data set Category labels Effectiveness measures PrePocessing After Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group meat standard train test Reuters Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

Baskent University Text Filtering 23 overview of experiment train test... Reuters Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Training 1.Choose a starting query vector Q 0 ; let k = Let Q k be a query vector at the start of the (k+1)th iteration; identify the following set of difference vectors:  (Q k ) ={b=d- d’ :d  d’ and f(Q k,b)  0}; if  (Q k )= , Q opt = Q k is a solution and exit, otherwise, 3. Let Q k+1 = Q k + 4. k = k+1; go back to Step (2). Training With SDA Optimal query

Baskent University Text Filtering 24 overview of experiment train Optimal query test... Reuters Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Training All the category examples as positive examples Random 60% from other topics as negative examples If maximum Rnorm value (1) is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available Training With SDA

Baskent University Text Filtering 25 overview of experiment Training With SDA Optimal query... train test Reuters Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing There are 135 topics Topic# of + earn2877 acq1650 moneyfx538 grain433 crude389 trade369 interest347 wheat212 ship197 corn182 Topic# of earn1087 acq719 moneyfx179 grain149 crude189 trade118 interest131 wheat71 ship89 corn56 train test

Baskent University Text Filtering 26 overview of experiment Training With SDA Optimal query... train test Reuters Data set Topics Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Create contingency tables Find breakeven points

Baskent University Text Filtering 27 Results TopicFindismNbayesSDABnetsTreesSVM earn92,995,9 96,32 95,897,898,0 acq64,787,8 85,26 88,389,793,6 money-fx46,756,6 68,72 58,866,274,5 grain67,578,8 71,81 81,485,094,6 crude70,179,5 82,54 79,685,088,9 trade65,163,5 65,25 69,072,575,9 interest63,464,9 61,07 71,367,177,7 wheat68,969,7 76,06 82,792,591,9 ship49,285,4 65,17 84,474,285,6 corn48,265,3 75,00 76,491,890,3 Avg.Top 10 64,681,584,5485, ,0 Avg.All61,775,276,3780,0N/A87,0 breakevens

Baskent University Text Filtering28 Thank you!