Download presentation
Presentation is loading. Please wait.
Published byVictor Lemmond Modified over 10 years ago
1
1 Jaime Carbonell and Raj Reddy Carnegie Mellon University January 12, 2006 Talk presented at International Conf on Data Mining, Nov 28, 2005 and MSR India TechVista Symposium, Jan 12, 2006 and MSR India TechVista Symposium, Jan 12, 2006 The Million Book Digital Library Project: Research Issues in Data Mining and Text Mining
2
2 Digital Libraries and Universal Access to Information Create a Universal Digital Library containing all the books ever published Unfortunately many of the books are in English Not readable by over 80% of the population
3
3 Information Overload If we read a book every day we can only read, at most, 40,000 books in a life time Having millions of books online and accessible creates an information overload “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon Multilingual search technology can help to reduce the overload permits users to search very large data bases quickly and reliably independent of language and location
4
4 Understanding Language Books in non-native languages remain incomprehensible to most people Translation and Summarization essential for world wide use Current translation systems are not yet perfect Significant improvements in language understanding systems in the past few decades Systems based on statistical and linguistic techniques have shown significant performance improvements improve performance using machine learning Digitization projects will act as test bed for validating Language Understanding Systems Research e.g. The Million Book Digital Library Project
5
5 The Million Book Digital Library Collaborative venture among many countries including USA, China and India So far 400,000 books have been scanned in China and 200,000 in India Content is made freely available around the globe Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip http://www.rr.cs.cmu.edu/MSRI.ziphttp://www.rr.cs.cmu.edu/MSRI.zip
6
The Grand Challenge Create Access to All published works online Instantly available In any language Anywhere in the world Searchable, browsable, navigable By humans and machines
7
The Challenge:
8
One Step at a Time… Million Book DL Only about 1% of all the world’s books Harvard University12M Library of Congress30M OCLC catalog 42M All Multilingual Books~100M At the rate of digitization of the last decade it would take a 100 years!
9
Million Book Project: Issues Time At one page per second (20,000 pages per day shift), it will take 100 years (200 working days per year) to scan a million books of 400 pages each Cost 100M books at US$100 per book would coat $10B Even in India and China the cost will be $1B The annual cost is currently expected to be close $10M per year with support from US, India and China. Selection Selection of appropriate books for scanning is time consuming and expensive
10
Million Book Project: Issues (cont) Logistics Each containers hold 10,000 to 20,000 books. Shipping and handling costs about $10,000 Meta Data Accessing and/or creating Meta data requires professionals trained in Library science Optical Character Recognition Technology Essential for searching, translation and summarization Many languages don’t have OCR
11
11
12
Million Book Project: Status 21 Centers in India 17 centers in China 1 Center in Egypt Planned : Australia and Europe About 600,000 books scanned About 120,000+ accessible on the web from India http://dli.iiit.ac.in/ http://dli.iiit.ac.in/ Uses 8TB of storage 10 TB server at CMU Library planned for July 2005 1,000,000 books by the end of 2007 Capacity to scan a million pages a day expected to be operational by the end of 2006
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
Million Book Project: Policy Challenges Compensating for Creative Works 5% out of copyright 92% out-of-print and in-copyright 3% in-print and in-copyright Options Tax Credit Usage based Government funded compensation Analogous to Public Lending Right in UK and Australia Usage charges to the user Compulsory Licensing Digital Submissions to National Archives of all books that are “born-digital”
26
Million Book Project: Research Challenges Providing Access to Billions everyday Distributed Cached Servers in every country and region Easy to use interfaces for Billions Text Mining Challenges Multilingual Information Retrieval Summarization Text Categorization Named-Entity identification Novelty Detection Translation
27
27 What is Text Mining Search documents, web, news Categorize by topic, taxonomy Enables filtering, routing, multi-text summaries, … Extract n ames, relations, … Summarize text, rules, trends, … Detect redundancy, novelty, anomalies, … Predict outcomes, behaviors, trends, … Who did what to whom and where?
28
28 Data Mining vs. Text Mining Data Mining vs. Text Mining Data: relational tables DM universe: huge DM tasks: DB “cleanup” Taxonomic classification Supervised learning with predictive classifiers Unsupervised learning clustering, anomaly detection Visualization of results Text: HTML, free form TM universe: 103X DM TM tasks: All the DM tasks, plus: Extraction of roles, relations and facts Machine translation for multi-lingual sources Parse NL-query (vs. SQL) NL-generation of results
29
29 New Bill of Rights New Bill of Rights Get the right information To the right people At the right time On the right medium In the right language With the right level of detail
30
30 Relevant Text Mining Technologies Relevant Text Mining Technologies “…right information” “…right people” “…right time” “…right medium” “…right language” “…right level of detail” IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization
31
31 “…right information” Information Retrieval
32
32 Beyond Pure Relevance in IR Beyond Pure Relevance in IR Information Retrieval Maximizes Relevance to Query What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? Novelty is approximated by non-redundancy! we really want to maximize: relevance to the query, given the user profile and interaction history, P(U(f i,..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history ...but we don’t yet know how. Darn.
33
33 query documents MMR IR Standard IR Maximal Marginal Relevance vs Standard Information Retrieval
34
34 “…right information” Novelty Detection
35
35 Find the first report of a new event (Unconditional) Dissimilarity with Past Decision threshold on most-similar story (Linear) temporal decay Length-filter (for teasers) Cosine similarity with standard weights: Detecting Novelty in Streaming Data
36
36 New First Story Detection Directions Topic-conditional models e.g. “airplane,” “investigation,” “FAA,” “FBI,” “casualties,” topic, not event “TWA 800,” “March 12, 1997” event First categorize into topic, then use maximally-discriminative terms within topic Rely on situated named entities e.g. “Arcan as victim,” “Sharon as peacemaker ”
37
37 Link Detection in Texts Link Detection in Texts Find text (e.g. Newstories) that mention the same underlying events. Could be combined with novelty (e.g. something new about interesting event.) Techniques: text similarity, NE’s, situated NE’s, relations, topic-conditioned models, …
38
38 “…right people” Text Categorization
39
39 Text Categorization Assign labels to each document or web-page Labels may be topics such as Yahoo-categories finance, sports, News World Asia Business Labels may be genres editorials, movie-reviews, news Labels may be routing codes send to marketing, send to customer service
40
40 Manual assignment as in Yahoo Hand-coded rules as in Reuters Machine Learning (dominant paradigm) Words in text become predictors Category labels become “to be predicted” Predictor-feature reduction (SVD, 2, …) Apply any inductive method: kNN, NB, DT,… Text Categorization Methods
41
41 Multi-tier Event Classification Multi-tier Event Classification
42
42 “…right medium” Named-Entity identification
43
43 Purpose: to answer questions such as: Who is mentioned in these 100 Society articles? What locations are listed in these 2000 web pages? What companies are mentioned in these patent applications? What products were evaluated by Consumer Reports this year? Named-Entity identification
44
44 President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization. Named Entity Identification
45
45 Finite-State Transducers w/variables Example output: FNAME: “Bill” LNAME: “Clinton” TITLE: “President ” FNAME: “Bill” LNAME: “Clinton” TITLE: “President ” FSTs Learned from labeled data Statistical learning (also from labeled data) Hidden Markov Models (HMMs) Exponential (maximum-entropy) models Conditional Random Fields [Lafferty et al] Methods for NE Extraction
46
46 Extracted Named Entities (NEs) People Places President Clinton Singapore Mickey Kantor Japan Ms. Xuemei Peng China Mr. Hideto Suzuki Australia Mr. Langford Named Entity Identification
47
47 Motivation: It is useful to know roles of NE’s: Who participated in the economic meeting? Who hosted the economic meeting? Who was discussed in the economic meeting? Who was absent from the the economic meeting? Role Situated NE’s
48
48 Emerging Methods for Extracting Relations Emerging Methods for Extracting Relations Link Parsers at Clause Level Based on dependency grammars Probabilistic enhancements [Lafferty, Venable] Island-Driven Parsers GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’] Tree-bank-trained probabilistic CF parsers [IBM, Collins] Herald the return of deep(er) NLP techniques. Relevant to new Q/A from free-text initiative. Too complex for inductive learning (today).
49
49 Example: (Who does What to Whom) "John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q4 2001 without problems from federal regulators" Relational NE Extraction
50
50 Useful for relational DB filling, to prepare data for “standard” DM/machine-learning methods Acquirer Acquiree Sh.price Year __________________________________ Flexicon Logi-truck 18 1999 Flexicon Supplyhouse 30 2001 buy.com reel.com 10 2000............ Fact Extraction Application
51
51 “…right language” Translation
52
52 Knowledge-Engineered MT Transfer rule MT (commercial systems) High-Accuracy Interlingual MT (domain focused) Parallel Corpus-Trainable MT Statistical MT (noisy channel, exponential models) Example-Based MT (generalized G-EBMT) Transfer-rule learning MT (corpus & informants) Multi-Engine MT Omnivorous approach: combines the above to maximize coverage & minimize errors “…in the Right Language”
53
53 Types of Machine Translation Syntactic Parsing Semantic Analysis Sentence Planning Text Generation Source (Arabic) Target (English) Transfer Rules Direct: EBMT Interlingua
54
54 English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu. EBMT example
55
55 Multi-Engine Machine Translation MT Systems have different strengths Rapidly adaptable: Statistical, example-based Good grammar: Rule-Based (linguisitic) MT High precision in narrow domains: KBMT Minority Language MT: Learnable from informant Combine results of parallel-invoked MT Select best of multiple translations Selection based on optimizing combination of: Target language joint-exponential model Confidence scores of individual MT engines
56
56 Illustration of Multi-Engine MT El punto de descarge The drop-off point se cumplirá en will comply with el puente Agua Fria The cold Bridgewater El punto de descarge The discharge point se cumplirá en will self comply in el puente Agua Fria the “Agua Fria” bridge El punto de descarge Unload of the point se cumplirá en will take place at el puente Agua Fria the cold water of bridge
57
57 State of the Art in MEMT for New “Hot” Languages We can do now: Gisting MT for any new language in 2-3 weeks (given parallel text) Medium quality MT in 6 months (given more parallel text, informant, bi-lingual dictionary) Improve-as-you-go MT Field MT system in PCs We cannot do yet: High-accuracy MT for open domains Cope with spoken-only languages Reliable speech-speech MT (but BABYLON is coming) MT on your wristwatch
58
58 “…right level of detail” Summarization
59
59 Types of Summaries TaskQuery-relevant(focused)Query-free(generic) INDICATIVE for Filtering (Do I read further?) Filter search engine results Short abstracts CONTENTFUL for reading in lieu of full doc Solve problems for busy professionals Executive summaries Document Summarization
60
60Conclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.