Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times.

Similar presentations


Presentation on theme: "Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times."— Presentation transcript:

1 Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times Articles 3000 articles that mention “Enron” PubMed 15,000,000 articles Queries can return 100k or more articles Enron email data 500,000 emails 11k different authors 1999-2002 Analysis, Exploration, and Retrieval of Information across Multiple Corpora Probabilistic Topic Models topic = distribution over words document = mixture of topics Topic models can be learned automatically using statistical learning [e.g. Griffiths and Steyvers (2004) ] E.g. emails, intelligence reports, news articles. We looked at: Applications: Corpus comparison: automatically compare topics across 2 different corpora Cross-corpus retrieval: given a document in corpus A, find similar documents in corpus B “GateKeeper”: given a document in corpus A, compute the likelihood of finding matching documents in corpus B, without looking at individual document records. Collocation Topic Model Cross Corpus Retrieval GateKeeper Corpus Comparison WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall StreetStock MarketBankruptcy New model combines frequent word combinations (collocations) with topics Model automatically extracts topics and word combinations Collocations in topics improve interpretability: e.g. “United_States”, “Sept_11”, “Osama_Bin_Laden” TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD......... For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word What are the topical similarities and differences between two large sets of documents? Example: PubMed papers before 1980 compared with 2003 … Example: PubMed papers from China and Israel… Example: two corpora, Enron emails and New York Times articles that mention “Enron” Problem: how to find Enron emails relevant to New York Times article (or vice versa)? Approach: 1) Train two separate topic models 2) map the query into the topic space of the other corpus 3) Calculate relevance by proximity in topic space (e.g. using Jensen-Shannon divergence) Example Application: analyst wants to check whether some report X (query) has any similar documents in secure database at a different agency. Analyst uses “gatekeeper” to assess whether there are any relevant documents before going through lengthy process of securing access Problem: information retrieval model cannot have access to individual documents either -- only has summaries of topics across whole database Solution: use log likelihood of query document with the topic model using only the topics. Simulation: assume Biobase docs as secure database. Probe with (relevant) new Biobase docs or (irrelevant) computer science docs from CiteSeer. Figure shows that relevant documents can be discriminated from irrelevant documents based on this global measure. BIOBASE CITESEER TOPIC MODEL WORD MODEL Cattle diseases (6.7) Ricin binding (6.1) Brucellosis (4.1) Animal infections (3.7) Proteins (3.3) Pre 1980 Topics SARS (11.0) Gene mutations (5.5) Biological agents (5.5) Gene sequences (5.0) HIV (4.5) 2003 Topics Child mortality Cell marrow Plague study Patient diagnosis Cases reported Common Topics Cell marrow (30.0) Serum levels (24.5) Gene sequences (22.2) Antibodies (13.5) SARS (10.0) China Topics Biological agents(24.5) Terrorist injuries (14.9) West nile virus (12.2) Public health (8.2) September 11 (11.0) Israel Topics Animal infections Acid mass detection Cattle diseases Nerve motor study Vaccination Common Topics


Download ppt "Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times."

Similar presentations


Ads by Google