Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.

Similar presentations


Presentation on theme: "1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER."— Presentation transcript:

1 1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER

2 2 introduction Information filtering (IF) –Incoming relevant documents are routed to profilesqueries. Information retrieval (IR) –Provides a list of ordered documents based on the similarity with the user query

3 3 introduction ( continued... ) Linear Separation - partitions relevant and non-relevant into distinct blocks Optimal Queries - all relevant documents are ahead of nonrelevant ones. Steepest Descent Algorithm (SDA)

4 4 preliminaries Information retrieval system (S) can be defined as 5 tuple S =(T,D,Q,V,f) -T set of ordered index terms -D set of documents -Q set of queries -V set of real numbers -f:DxQ  V retrieval function

5 5 preliminaries ( continued ) Vector Space Model - Transformation of raw text into more computationally useful forms - Documents and queries are represented as vectors of weighted terms d=(t 1,w d1 ;t 2,w d2 ;... ;t n,w dn ) ti  T  d q = (q 1, w q1 ; q 2, w q2,... ; q m, w qm ) qi  T  q

6 6 preliminaries ( continued ) Rnorm value for effectiveness It measures up how relevant documents are distributed over nonrelavent ones.  rank matters.

7 7 preliminaries ( continued ) predictedactual relevantnon-relevant relevant ab non-relevant cd Contingency Table Precision =a / (a+b)Recall =a / (a+c) Breakeven point Where precision and recall are equal

8 8 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures Preprocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing

9 9 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures Preprocessing Consists of 21578 economic news stories that originally appeared on the Reuters newswire in 1987 Each story has been manually assigned one or more indexing labels from a fixed list There are 135 TOPIC labels for classification. In order to use a text corpus for machine learning research it splited into sets of training and testing examples Reuters 21578 train test Reuters -21578 Data set

10 10 overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="9944" NEWID="5031"> 13-MAR-1987 15:45:35.38 livestock carcass usa ec U.S. MEAT GROUP TO FILE TRADE COMPLAINTS WASHINGTON, March 13 - The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. Reuter Sample Reuters 21578 Document train test Reuters -21578 Data set

11 11 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

12 12 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U S MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute AME said it intended to ask the U S government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that effective April will require U S meat processing plants to comply fully with EC standards

13 13 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards

14 14 train test Reuters -21578 Data set overview of experiment train Removing stop words Stemming Transform to Vectors Parsing Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body:. MEAT GROUP FILE TRADE COMPLAINTS American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups intended file petition Section General Agreement Tariffs Trade EC directive effective April require meat processing plants comply fully EC standards

15 15 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Stemming HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: MEAT GROUP FILE TRADE COMPLAINT American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General Agreement Tariff Trade EC direct effect April require meat process plant compli fulli EC standard Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing train test Reuters -21578 Data set

16 16 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Transform To Vectors HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing meat 5 group 1... Molpus 1... standard 1 train test Reuters -21578 Data set

17 17 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Create Dictionary (only in training) Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing approv 1236 chairman 1225... ptd 5 train test Reuters -21578 Data set

18 18 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5 Molpus... standard 1... train test Reuters -21578 Data set

19 19 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set

20 20 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 1 meat 5... standard 1... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

21 21 overview of experiment train Training With SDA Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing After Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing... group 0.127 meat 0.278... standard 0.012... train test Reuters -21578 Data set w k =t k x log (N D /n k ) t k term frequency N D Number of documents in collection n k number of documents containing t k is normalized weight of term k unnormalized weight of term k

22 22 overview of experiment train test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Training 1.Choose a starting query vector Q 0 ; let k = 0. 2. Let Q k be a query vector at the start of the (k+1)th iteration; identify the following set of difference vectors:  (Q k ) ={b=d- d’ :d d’ and f(Q k,b)  0}; if  (Q k )= , Q opt = Q k is a solution and exit, otherwise, 3. Let Q k+1 = Q k + 4. k = k+1; go back to Step (2). Training With SDA Optimal query

23 23 overview of experiment train Optimal query test... Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Training All the category examples as positive examples Random 60% from other topics as negative examples If maximum Rnorm value (1) is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available Training With SDA

24 24 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing There are 135 categories Topic# of + earn2877 acq1650 moneyfx538 grain433 crude389 trade369 interest347 wheat212 ship197 corn182 Topic# of earn1087 acq719 moneyfx179 grain149 crude189 trade118 interest131 wheat71 ship89 corn56 train test

25 25 overview of experiment Training With SDA Optimal query... train test Reuters -21578 Data set Category labels Effectiveness measures PrePocessing Removing stop words Stemming Transform to Vectors Parsing Reducing Normalizing Create contingency tables Find breakeven points

26 26 Results TopicFindismNbayesSDABnetsTreesSVM earn92,995,9 96,32 95,897,898,0 acq64,787,8 85,26 88,389,793,6 money-fx46,756,6 68,72 58,866,274,5 grain67,578,8 71,81 81,485,094,6 crude70,179,5 82,54 79,685,088,9 trade65,163,5 65,25 69,072,575,9 interest63,464,9 61,07 71,367,177,7 wheat68,969,7 76,06 82,792,591,9 ship49,285,4 65,17 84,474,285,6 corn48,265,3 75,00 76,491,890,3 Avg.Top 10 64,681,584,5485,088.492,0 Avg.All61,775,276,3780,0N/A87,0 breakevens

27 27 Thank you!


Download ppt "1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER."

Similar presentations


Ads by Google