Presentation is loading. Please wait.

Presentation is loading. Please wait.

2007.04.26 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations


Presentation on theme: "2007.04.26 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

1 2007.04.26 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 24: NLP for IR

2 2007.04.26 - SLIDE 2IS 240 – Spring 2007 Today Review –Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design – GRID-based DLs NLP for IR Text Summarization Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

3 2007.04.26 - SLIDE 3IS 240 – Spring 2007 Google Google maintains (probably) the worlds largest Linux cluster (over 15,000 servers) These are partitioned between index servers and page servers –Index servers resolve the queries (massively parallel processing) –Page servers deliver the results of the queries Over 8 Billion web pages are indexed and served by Google

4 2007.04.26 - SLIDE 4IS 240 – Spring 2007 Ranking: Link Analysis Assumptions: –If the pages pointing to this page are good, then this is also a good page –The words on the links pointing to this page are useful indicators of what this page is about –References: Page et al. 98, Kleinberg 98

5 2007.04.26 - SLIDE 5IS 240 – Spring 2007 Ranking: PageRank Google uses the PageRank We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0.85. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

6 2007.04.26 - SLIDE 6IS 240 – Spring 2007 PageRank T2Pr=1 T1Pr=.725 T6Pr=1 T5Pr=1 T4Pr=1 T3Pr=1 T7Pr=1 T8Pr=2.46625 X1 X2 APr=4.2544375 Note: these are not real PageRanks, since they include values >= 1

7 2007.04.26 - SLIDE 7IS 240 – Spring 2007

8 2007.04.26 - SLIDE 8IS 240 – Spring 2007

9 2007.04.26 - SLIDE 9IS 240 – Spring 2007

10 2007.04.26 - SLIDE 10IS 240 – Spring 2007

11 2007.04.26 - SLIDE 11IS 240 – Spring 2007 Digital Library Grid Initiatives: Cheshire3 and the Grid Ray R. Larson University of California, Berkeley School of Information Management and Systems Rob Sanderson University of Liverpool Dept. of Computer Science Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation Presentation from DLF Forum April 2005

12 2007.04.26 - SLIDE 12IS 240 – Spring 2007 Overview The Grid, Text Mining and Digital Libraries –Grid Architecture –Grid IR Issues Cheshire3: Bringing Search to Grid-Based Digital Libraries –Overview –Grid Experiments –Cheshire3 Architecture –Distributed Workflows

13 2007.04.26 - SLIDE 13IS 240 – Spring 2007 Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

14 2007.04.26 - SLIDE 14IS 240 – Spring 2007 Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

15 2007.04.26 - SLIDE 15IS 240 – Spring 2007 Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

16 2007.04.26 - SLIDE 16IS 240 – Spring 2007 Today Natural Language Processing and IR –Based on Papers in Reader and on David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996 Text summarization: Lecture from Ed Hovy (USC)

17 2007.04.26 - SLIDE 17IS 240 – Spring 2007 Natural Language Processing and IR The main approach in applying NLP to IR has been to attempt to address –Phrase usage vs individual terms –Search expansion using related terms/concepts –Attempts to automatically exploit or assign controlled vocabularies

18 2007.04.26 - SLIDE 18IS 240 – Spring 2007 NLP and IR Much early research showed that (at least in the restricted test databases tested) –Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) –Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

19 2007.04.26 - SLIDE 19IS 240 – Spring 2007 NLP and IR Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods –E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches

20 2007.04.26 - SLIDE 20IS 240 – Spring 2007 General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

21 2007.04.26 - SLIDE 21IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

22 2007.04.26 - SLIDE 22IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

23 2007.04.26 - SLIDE 23IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

24 2007.04.26 - SLIDE 24IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

25 2007.04.26 - SLIDE 25IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John John is a student. He runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

26 2007.04.26 - SLIDE 26IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999 Tokenization Part of Speech Tagging Term recognition (Ananiadou) Inflection/Derivation Compounding Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

27 2007.04.26 - SLIDE 27IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

28 2007.04.26 - SLIDE 28IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

29 2007.04.26 - SLIDE 29IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

30 2007.04.26 - SLIDE 30IS 240 – Spring 2007 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

31 2007.04.26 - SLIDE 31IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

32 2007.04.26 - SLIDE 32IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Most words in English are ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

33 2007.04.26 - SLIDE 33IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Structural Ambiguities Predicate-argument Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

34 2007.04.26 - SLIDE 34IS 240 – Spring 2007 Structural Ambiguities (1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000. (2) Scope Ambiguities young women and men in the room (3)Analytical Ambiguities Visiting relatives can be boring. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith John bought a car with Mary. $3000 can buy a nice car. Semantic Ambiguities(1) Semantic Ambiguities(2) Every man loves a woman. Co-reference Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

35 2007.04.26 - SLIDE 35IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Structural Ambiguities Predicate-argument Ambiguities Combinatorial Explosion Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

36 2007.04.26 - SLIDE 36IS 240 – Spring 2007 Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

37 2007.04.26 - SLIDE 37IS 240 – Spring 2007 Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

38 2007.04.26 - SLIDE 38IS 240 – Spring 2007 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

39 2007.04.26 - SLIDE 39IS 240 – Spring 2007 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

40 2007.04.26 - SLIDE 40IS 240 – Spring 2007 Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (4) Adaptation Techniques: Machine Learning, Trainable systems (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

41 2007.04.26 - SLIDE 41IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Domain specific rules:, Inc. Mr.. Machine Learning: HMM, Decision Trees Rules + Machine Learning Part of Speech Tagger FSA rules Statistic taggers 95 % F-Value 90 Domain Dependent Local Context Statistical Bias Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

42 2007.04.26 - SLIDE 42IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

43 2007.04.26 - SLIDE 43IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

44 2007.04.26 - SLIDE 44IS 240 – Spring 2007 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

45 2007.04.26 - SLIDE 45IS 240 – Spring 2007 Using NLP Strzalkowski (in Reader) TextNLPrepres Dbase search TAGGER NLP: PARSERTERMS

46 2007.04.26 - SLIDE 46IS 240 – Spring 2007 Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np./per

47 2007.04.26 - SLIDE 47IS 240 – Spring 2007 Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np./per

48 2007.04.26 - SLIDE 48IS 240 – Spring 2007 Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]

49 2007.04.26 - SLIDE 49IS 240 – Spring 2007 Using NLP EXTRACTED TERMS & WEIGHTS President 2.623519 soviet 5.416102 President+soviet 11.556747 president+former 14.594883 Hero 7.896426 hero+local 14.314775 Invade 8.435012 tank 6.848128 Tank+invade 17.402237 tank+russian 16.030809 Russian 7.383342 wisconsin 7.785689

50 2007.04.26 - SLIDE 50IS 240 – Spring 2007 Same Sentence, different sys INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE (using uptagger from Tsujii) The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP./.

51 2007.04.26 - SLIDE 51IS 240 – Spring 2007 Same Sentence, different sys CHUNKED Sentence (chunkparser – Tsujii) (TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (..) )

52 2007.04.26 - SLIDE 52IS 240 – Spring 2007 Same Sentence, different sys Enju Parser ROOTROOTROOTROOT-1ROOTbeenbeVBNVB5 beenbeVBNVB5ARG1PresidentpresidentNNPNNP3 beenbeVBNVB5ARG2heroheroNNNN8 aaDTDT6ARG1heroheroNNNN8 aaDTDT11ARG1tanktankNNNN13 locallocalJJJJ7ARG1heroheroNNNN8 ThetheDTDT0ARG1PresidentpresidentNNPNNP3 formerformerJJJJ1ARG1PresidentpresidentNNPNNP3 RussianrussianJJJJ12ARG1tanktankNNNN13 SovietsovietNNPNNP2MODPresidentpresidentNNPNNP3 invadedinvadeVBDVB14ARG1tanktankNNNN13 invadedinvadeVBDVB14ARG2WisconsinwisconsinNNPNNP15 hashaveVBZVB4ARG1PresidentpresidentNNPNNP3 hashaveVBZVB4ARG2beenbeVBNVB5 sincesinceININ10MODbeenbeVBNVB5 sincesinceININ10ARG1invadedinvadeVBDVB14 evereverRBRB9ARG1sincesinceININ10

53 2007.04.26 - SLIDE 53IS 240 – Spring 2007 NLP & IR Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

54 2007.04.26 - SLIDE 54IS 240 – Spring 2007 NPL & IR Lewis and Sparck Jones suggest research in three areas –Examination of the words, phrases and sentences that make up a document description and express the combinatory, syntagmatic relations between single terms –The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching –Using NLP-based methods for searching and matching

55 2007.04.26 - SLIDE 55IS 240 – Spring 2007 NLP & IR Issues Is natural language indexing using more NLP knowledge needed? Or, should controlled vocabularies be used Can NLP in its current state provide the improvements needed How to test

56 2007.04.26 - SLIDE 56IS 240 – Spring 2007 NLP & IR New “Question Answering” track at TREC has been exploring these areas –Usually statistical methods are used to retrieve candidate documents –NLP techniques are used to extract the likely answers from the text of the documents


Download ppt "2007.04.26 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."

Similar presentations


Ads by Google