Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times.

Slides:



Advertisements
Similar presentations
The Personal Finance Handbook Eleven Topics. The portfolio is due by 05 May Creating a Budget, p Opening and Managing a Checking Account,
Advertisements

Copyright © 2011 Pearson Prentice Hall. All rights reserved. Firms and the Financial Market Chapter 2.
Essential Question What were the major events of the George W. Bush presidency? What were the major events of the George W. Bush presidency?
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Chapter 16: Buying and Selling Securities. Objectives Explain the operation and regulation of securities markets. Discuss factors to consider when selecting.
1 Investment Banking - Equity Bank of America (Merrill Lynch, Country Wide) Goldman Sachs (Bank Holding) JP Morgan Chase (Bear Stearns and Wa.Mutual) Morgan.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
British Museum Library, London Picture Courtesy: flickr.
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
CHAPTER TEN GATHERING INVESTMENT INFORMATION © 2001 South-Western College Publishing.
Financial Markets and Instruments Chapter 2. Major Classes of Financial Assets or Securities Debt - Money market instruments - Bonds Common stock Preferred.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
FrontPage: Turn in Savings Calculator worksheet from yesterday if you didn’t finish. The Last Word: Ch 11 Review/Unit 4 Test Tuesday.
SECURITY-MARKET INDICATOR SERIES
CANADA-UNITED STATES LAW INSTITUTE Private Financing of Entrepreneurships: Sources of Private Financing.
Introduction to Agricultural and Natural Resources The Financial Markets FREC 150 Dr. Steven E. Hastings.
1 1 Ch2&3 – MBA 567 Capital Market Overview Capital Markets Debt Common stock Preferred stock Derivative securities Security Trading Trading Trading Costs.
VANDERBILT INVESTMENT BANKING VANDERBILT INVESTMENT BANKING Meeting 4: Researching Companies.
McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc., All Rights Reserved. Financial Securities CHAPTER 2.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Dick and Mac McDonald open the first McDonald’s drive-thru restaurant in San Bernardino, California Total sales for the company are.
Stock Market Introduction to the Stock Market. What is a stock? Partial ownership in a business – Dividends = Profits paid out to stock holders four times.
Essentials of Investments © 2001 The McGraw-Hill Companies, Inc. All rights reserved. Fourth Edition Irwin / McGraw-Hill Bodie Kane Marcus 1 Chapters 1.
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
AIM6345 – Business Valuation Hillary Campbell Government Documents Librarian & Liaison to the School of Management
McGraw-Hill/Irwin Copyright © 2001 by The McGraw-Hill Companies, Inc. All rights reserved. 2-1 Financial Markets and Instruments Financial Markets and.
Securities Firms and Investment Banks Chapter 4 © 2008 The McGraw-Hill Companies, Inc., All Rights Reserved. McGraw-Hill/Irwin.
© 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. Securities Firms and Investment Banks Chapter 4 K. R. Stanton.
©2007, The McGraw-Hill Companies, All Rights Reserved 9-1 McGraw-Hill/Irwin Chapter Nine Stock Markets.
McGraw-Hill/Irwin © 2007 The McGraw-Hill Companies, Inc., All Rights Reserved. Financial Securities CHAPTER 2.
Chapter 2 Financial Securities. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc., All Rights Reserved. Classes of Financial Assets Financial assets.
Copyright 2006 Prentice Hall Prentice Hall PoliticalScienceInteractive Magleby et al. Government by the People Chapter 20 Special Topic The War on Terrorism.
Reporting on Accounts. Overview Why report on the accounts of a business? Who is interested in the accounts of a business? Types of ratios used.
Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting.
Class Business Homework – Solution Solution Group debates/presentations Stock-Trak – Clip Clip.
Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved 1 Chapter 08 Valuing Stocks McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill.
Essentials of Investments © 2001 The McGraw-Hill Companies, Inc. All rights reserved. Fourth Edition Irwin / McGraw-Hill Bodie Kane Marcus 1 Chapter 2.
© 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
FrontPage: Turn in Savings Calculator webquest from yesterday if you did not do so. The Last Word: Ch 11 Review and Unit 4 Test - Tuesday.
The Stock Market Rough Draft Stock Market Noah Glusman.
C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters.
STOCKS. WHY DO I CARE? BIG indicator of how economy is doing  Effects increases and decreases in taxes, interest rates, and supply of products You may.
McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 2 Financial Instruments.
GEORGE W. BUSH. ELECTION OF 2000 Democrat – Al Gore Republican – George W. Bush.
Bodie Kane Marcus Perrakis RyanINVESTMENTS, Fourth Canadian Edition Copyright © McGraw-Hill Ryerson Limited, 2003 Slide 2-1 Chapter 2.
Of Financial Accounting, 3e CORNERSTONES. © 2014 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part,
“There are two superpowers in the world today in my opinion. There's the United States and there's Moody's Bond Rating Service.” Thomas Friedman (NYT),
Financial Market Theory
Chapter 11 Learning Objectives
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
SECURITY MARKET INDICATORS
Indiana University Finance P.O.O.P. Session
Global Financial Instruments
What were the major events of the George W. Bush presidency?
What were the major events of the George W. Bush presidency?
Sources of Financial Information
Financial Markets Chapter 11
Buying and Selling Securities
September 11, 2001.
Why Has Terrorism Increased?
USA in SW Asian WARS today = Invasion of Afghanistan
Panagiotis G. Ipeirotis Luis Gravano
What were the major events of the George W. Bush presidency?
Review Bell Ringer After the stock market crash of 1929, ___________________ was created to protect peoples’ funds. How much are individual’s savings account.
How do current problems in the world threaten global security?
The Stock Market.
Financial Instruments
Presentation transcript:

Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times Articles 3000 articles that mention “Enron” PubMed 15,000,000 articles Queries can return 100k or more articles Enron data 500,000 s 11k different authors Analysis, Exploration, and Retrieval of Information across Multiple Corpora Probabilistic Topic Models topic = distribution over words document = mixture of topics Topic models can be learned automatically using statistical learning [e.g. Griffiths and Steyvers (2004) ] E.g. s, intelligence reports, news articles. We looked at: Applications: Corpus comparison: automatically compare topics across 2 different corpora Cross-corpus retrieval: given a document in corpus A, find similar documents in corpus B “GateKeeper”: given a document in corpus A, compute the likelihood of finding matching documents in corpus B, without looking at individual document records. Collocation Topic Model Cross Corpus Retrieval GateKeeper Corpus Comparison WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall StreetStock MarketBankruptcy New model combines frequent word combinations (collocations) with topics Model automatically extracts topics and word combinations Collocations in topics improve interpretability: e.g. “United_States”, “Sept_11”, “Osama_Bin_Laden” TOPIC MIXTURE TOPIC WORD X TOPIC WORD X TOPIC WORD For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word What are the topical similarities and differences between two large sets of documents? Example: PubMed papers before 1980 compared with 2003 … Example: PubMed papers from China and Israel… Example: two corpora, Enron s and New York Times articles that mention “Enron” Problem: how to find Enron s relevant to New York Times article (or vice versa)? Approach: 1) Train two separate topic models 2) map the query into the topic space of the other corpus 3) Calculate relevance by proximity in topic space (e.g. using Jensen-Shannon divergence) Example Application: analyst wants to check whether some report X (query) has any similar documents in secure database at a different agency. Analyst uses “gatekeeper” to assess whether there are any relevant documents before going through lengthy process of securing access Problem: information retrieval model cannot have access to individual documents either -- only has summaries of topics across whole database Solution: use log likelihood of query document with the topic model using only the topics. Simulation: assume Biobase docs as secure database. Probe with (relevant) new Biobase docs or (irrelevant) computer science docs from CiteSeer. Figure shows that relevant documents can be discriminated from irrelevant documents based on this global measure. BIOBASE CITESEER TOPIC MODEL WORD MODEL Cattle diseases (6.7) Ricin binding (6.1) Brucellosis (4.1) Animal infections (3.7) Proteins (3.3) Pre 1980 Topics SARS (11.0) Gene mutations (5.5) Biological agents (5.5) Gene sequences (5.0) HIV (4.5) 2003 Topics Child mortality Cell marrow Plague study Patient diagnosis Cases reported Common Topics Cell marrow (30.0) Serum levels (24.5) Gene sequences (22.2) Antibodies (13.5) SARS (10.0) China Topics Biological agents(24.5) Terrorist injuries (14.9) West nile virus (12.2) Public health (8.2) September 11 (11.0) Israel Topics Animal infections Acid mass detection Cattle diseases Nerve motor study Vaccination Common Topics