Presentation is loading. Please wait.

Presentation is loading. Please wait.

Amy Dai Machine learning techniques for detecting topics in research papers.

Similar presentations


Presentation on theme: "Amy Dai Machine learning techniques for detecting topics in research papers."— Presentation transcript:

1 Amy Dai Machine learning techniques for detecting topics in research papers

2 The Goal Build a web application that allows users to easily browse and search papers

3 Project Overview 1. Part I – Data Processing Convert PDF to text Extract information from documents 2. Part II – Discovering topics Index documents Group documents by similarity Learn underlying topics

4 Part I - Data Processing How do we extract information from PDF documents?

5 Pdf to Text Research papers are in PDF PDFs are images Computer sees colored lines and dots Conversion process loses some of the formatting

6 Getting what we need Construct heuristic rules to extract info First line Between title and abstract Preceded by “Abstract” Preceded by “Keywords”

7 Finding Names

8 Can we predict names? Named Entity Tagger by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign. Spam, Damn Spam, and Statistics Using statistical analysis to locate spam web pages Dennis Fetterly Mark Manasse Marc Najork Microsoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA Mountain View, CA 94043, USA fetterly@microsoft.com manasse@microsoft.com najork@microsoft.com

9 Accuracy To determine how well my script to extract info worked (# right + # needing minor changes)/ Total # of documents Example 30 were correctly extracted 10 needed minor changes 60 total documents (30+10)/60 = 66.7%

10 Accuracy and Error Perfect Match (%) Partial Match (%) No Match (%) Title78517 Abstract631235 Keywords68.7512.518.75 Authors3831

11 Part II – Learning Topics Can we use machine learning to discover underlying topics?

12 The Underlying Topic "Security and Trust in the Web" "Security of World Wide Web Search Engines“ "When Documents Deceive“ “The Quest for Correct Information on the Web"

13 Indexing Documents Index documents Remove common words leaving better descriptors for clustering Compare to corpus Brown Corpus: A Standard Corpus of Present-Day Edited American English From the Natural Language Toolkit Reduce from 19,100 to 12,400 words Documents contain between 100 – 1,700 words after common word removal

14 Effect on Index Size Common Word Frequency CutoffIndex Size 20357 15318 10276 5230 Changes in document index size for “De fi ning quality in web search results ”

15 Keeping What’s Important Common Word Frequency Cutoff 5101520 queryingweb googlequerying yahoogoogle metricsyahoocontroversial retrievalevaluatingyahooengines metricsevaluatingyahoo retrievalmetricsevaluating retrievalmetrics retrieval Words in abstract of “Defining quality in web search results ”

16 Documents as Vectors Represent documents as numerical vectors by transforming words to numbers using tf-idf Length is normalized Vector length is the length of index for corpus Mostly sparse

17 Clustering using Machine Learning Use machine learning algorithms to cluster by: K-means Group Average Agglomerative (GAA) Unsupervised learning Cosine similarity

18 Clustering Results DocumentsK-Means A: SpamRank – Fully Automatic Link Spam Detection B: An Approach to Confidence Based Page Ranking for User Oriented Web Search C: Spam, Damn Spam, and Statistics D:Web Spam, Propaganda and Trust E: Detecting Spam Web Pages through Content Analysis F: A Survey of Trust and Reputation Systems for Online Service Provision Group 1 A Group 2 B,C,D,E Group 3 F GAA Group 1 B Group 2 A,C,D,E Group 3 F

19 Challenges K-Means Finding K Group Average Agglomerative The depth to cut the dendogram

20 Labeling Clusters Compare term frequency in a cluster with the collection A frequent word within the cluster and in the collection isn’t a good discriminative label A good label is one that is infrequent in the collection

21 Summary 1. Part I – Data Processing PDF to text conversion isn’t perfect and imperfections make it difficult to extract text Documents don’t follow one formatting standard, need heuristic rules to extract info 2. Part II – Discovering topics Indexes are large, to keep the important we need a good corpus to compare it to. There are many clustering algorithms and each has limitations How do I choose the best label?

22 Ongoing work Use Bigrams Keywords: Web search, adversarial information retrieval, web spam Limit the number of topic labels by ranking Use algorithm that clusters based on probabilistic distributions Logistic normal distribution

23 Useful Tools 1. Pdftotext – Unix command for converting PDF to text 2. Python libraries Unicode Re –regular expressions 3. NLTK – Natural language processing tool Software and datasets for natural language processing Used for clustering algorithms and reference corpus


Download ppt "Amy Dai Machine learning techniques for detecting topics in research papers."

Similar presentations


Ads by Google