Download presentation
Presentation is loading. Please wait.
1
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER
2
2 introduction Information filtering (IF) –Incoming relevant documents are routed to profilesqueries. Information retrieval (IR) –Provides a list of ordered documents based on the similarity with the user query
3
3 introduction ( continued... ) Linear Separation - partitions relevant and non-relevant into distinct blocks Optimal Queries - all relevant documents are ahead of nonrelevant ones. Steepest Descent Algorithm (SDA)
4
4 preliminaries Information retrieval system (S) can be defined as 5 tuple S =(T,D,Q,V,f) -T set of ordered index terms -D set of documents -Q set of queries -V set of real numbers -f:DxQ V retrieval function
5
5 preliminaries ( continued ) Vector Space Model - Transformation of raw text into more computationally useful forms - Documents and queries are represented as vectors of weighted terms d=(t 1,w d1 ;t 2,w d2 ;... ;t n,w dn ) ti T d q = (q 1, w q1 ; q 2, w q2,... ; q m, w qm ) qi T q
6
6 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over nonrelavent ones. rank matters.
7
7 preliminaries ( continued ) predictedactual relevantnon-relevant relevant ab non-relevant cd Contingency Table Precision =a / (a+b)Recall =a / (a+c) Breakeven point Where precision and recall are equal
8
8 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures Preprocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing
9
9 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures Preprocessing Consists of 21578 economic news stories that originally appeared on the Reuters newswire in 1987 Each story has been manually assigned one or more indexing labels from a fixed list There are 135 TOPIC labels for classification. In order to use a text corpus for machine learning research it splited into sets of training and testing examples Reuters 21578 train test Reuters -21578 Data set
10
10 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="9944" NEWID="5031"> 13-MAR-1987 15:45:35.38 livestock carcass usa ec U.S. MEAT GROUP TO FILE TRADE COMPLAINTS WASHINGTON, March 13 - The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. Reuter Sample Reuters 21578 Document train test Reuters -21578 Data set
11
11 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards
12
12 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U S MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute AME said it intended to ask the U S government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that effective April will require U S meat processing plants to comply fully with EC standards
13
13 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards
14
14 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body:. MEAT GROUP FILE TRADE COMPLAINTS American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups intended file petition Section General Agreement Tariffs Trade EC directive effective April require meat processing plants comply fully EC standards
15
15 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Stemming HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: MEAT GROUP FILE TRADE COMPLAINT American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General Agreement Tariff Trade EC direct effect April require meat process plant compli fulli EC standard Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing train test Reuters -21578 Data set
16
16 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Transform To Vectors HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing meat 5 group 1... Molpus 1... standard 1 train test Reuters -21578 Data set
17
17 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Create Dictionary (only in training) Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing approv 1236 chairman 1225... ptd 5 train test Reuters -21578 Data set
18
18 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5 Molpus... standard 1... train test Reuters -21578 Data set
19
19 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set
20
20 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k
21
21 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 0.127 meat 0.278... standard 0.012... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k
22
22 overview of experiment train test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Training 1.Choose a starting query vector Q 0 ; let k = 0. 2. Let Q k be a query vector at the start of the (k+1)th iteration; identify the following set of difference vectors: (Q k ) ={b=d- d’ :d d’ and f(Q k,b) 0}; if (Q k )= , Q opt = Q k is a solution and exit, otherwise, 3. Let Q k+1 = Q k + 4. k = k+1; go back to Step (2). Training With SDA Optimal query
23
23 overview of experiment train Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Training All the category examples as positive examples Random 60% from other topics as negative examples If maximum Rnorm value (1) is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available Training With SDA
24
24 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing There are 135 categories Topic# of + earn2877 acq1650 moneyfx538 grain433 crude389 trade369 interest347 wheat212 ship197 corn182 Topic# of earn1087 acq719 moneyfx179 grain149 crude189 trade118 interest131 wheat71 ship89 corn56 train test
25
25 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Create contingency tables Find breakeven points
26
26 Results TopicFindismNbayesSDABnetsTreesSVM earn92,995,9 96,32 95,897,898,0 acq64,787,8 85,26 88,389,793,6 money-fx46,756,6 68,72 58,866,274,5 grain67,578,8 71,81 81,485,094,6 crude70,179,5 82,54 79,685,088,9 trade65,163,5 65,25 69,072,575,9 interest63,464,9 61,07 71,367,177,7 wheat68,969,7 76,06 82,792,591,9 ship49,285,4 65,17 84,474,285,6 corn48,265,3 75,00 76,491,890,3 Avg.Top 10 64,681,584,5485,088.492,0 Avg.All61,775,276,3780,0N/A87,0 breakevens
27
27 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.