Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 9: The Future of Web Mining (Chap 9, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng.

Similar presentations


Presentation on theme: "Lecture 9: The Future of Web Mining (Chap 9, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng."— Presentation transcript:

1 Lecture 9: The Future of Web Mining (Chap 9, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/12/23

2 Important Issues Information Extraction Natural Language Processing Question answering Profiles, Personalization, and Collaboration

3 Information Extraction An HR firm may wish to monitor the Web sites of businesses in a specific sector for available job positions with salaries and locations, and build and maintain a structured database containing this data to help design their pay packages. A market analyst may wish to monitor management changes in companies from a specified sector and get updates of the form “X replaced Y in position P of company C.” A researcher may wish to monitor a set of university and journal Web sites for articles that claim to improve on a specific technique and to be notified with the title, authors, and a URL where the article is available online. An academic department may wish to monitor other universities for promising doctoral candidates to hire in specified areas, with related faculty being notified about significant publications by the candidates.

4 Lexical Network and Ontologies WordNet: English dictioary, unique concepts represented by nodes called synsets (synonym sets) –bronco: bronco, mustang, pony, horse, equine, odd- toed ungulate, placental mammal, mammal, vertebrate, chordate, animal, organism Opposite of (antonym) relation is not between synsets but between words, for example –wet: watery, damp, moist, humid, soggy –dry: parched, arid, anhydrous, sere –Only dry and wet are antonyms

5 Lexical Network and Ontologies An ontology is a kind of schema describing specific roles of entities and relations between entities For example –PC troubleshooting site may use a custom ontology: A hard disk, PCI bus, CPU, CPU fan, SCSI cables, jumper settings, device drivers, CD-ROMs, software, installation, etc. –A university department comprising entities: faculty, student, administrative staff, research project, sponsor organization, research paper, journal, conference, and the like, together with relations A great deal of manual labor are needed to build lexical networks and ontologies

6 Part-of-Speech and Sense Tagging The extent of ambiguity in common words –Run: 11 noun senses, 42 verb senses Delimiting regions of sentences with part- of-speech (POS) A manually designed tag set and a collection of hand-tagged documents are needed for training a supervised tagger.

7 Part-of-Speech and Sense Tagging Approaches to IE and POS tagging are very similar HMMs can be used for POS tagging Over 130 POS used regularly http://www.comp.lancs.ac.uk/ucrel/claws1tags.html http://www.comp.lancs.ac.uk/ucrel/claws1tags.html WordPossible POS Thearticle mannoun, verb stillnoun, verb, adjective, adverb sawnoun, past-tense verb herobject pronoun, possessive pronoun

8 Part-of-Speech and Sense Tagging Accuracy of 96%~99% is not uncommon in statistical POS tagging Word sense disambiguation (WSD) is initiated after POS tagging Ambiguous tokens are tagged with a sense identifier Consider a word w in the training text, which may be represented using a set of features –E,g.: Interest UsageSense 53%money paid for use of money 21%a share in a business or company 18%readiness to give attention

9 Parsing and Knowledge Representation Morphological and syntactic analyses are only the initial steps of the long path to parsing the input and then representing natural language in a form that can be manipulated and searched by a computer

10 Parsing and Knowledge Representation The sentences are quite simple, but it is nontrivial to infer that him refers to Raja in the passage Pronoun resolution is a special case of general resolution of references in sentences

11 Parsing and Knowledge Representation Pragmatics also play important role in correct parsing Raja ate bread with jam Raja ate bread with Ravi Syntactic analysis can offer clues but not completely resolve such ambiguity

12 Parsing and Knowledge Representation Most grammar for natural language is ambiguous The parser are not always context-free, and some might backtrack in source

13 Parsing and Knowledge Representation Link Parser by Sleator and Temperley The Link Parser has a dictionary that stores terms associated with one or more linking requirements or constraints

14 Parsing and Knowledge Representation (a) A set of word from the dictionary, each with one or more linking requirements (b) An illegal sentence and its unsuccessful parse (c) A legal sentence and its successful parse (d) A simpler way to show a legal parse graph (e) A relatively complex sentence parsed by the Link Parser

15 Parsing and Knowledge Representation A successful parse introduces links among the terms in the sentence so three properties hold: –Satisfaction: Each linking requirement for each term in the sentence need to be satisfied by some connector of the opposite polarity emerging from some other word in the sentence –Connectivity: The links introduced should be able to connect all the term in the sentence –Planarity: The links introduced by the parser cannot cross when drawn above the sentence written on a line

16 Parsing and Knowledge Representation The parses produced by the Link Parser or some other parser can be a foundation for representing textual content in a uniform graph formalism Once this is accomplished, the challenge would be in matching parse graphs to query graph and ranking the responses Suitably annotated parse graphs can also be used as an interlingus for translation between many languages


Download ppt "Lecture 9: The Future of Web Mining (Chap 9, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng."

Similar presentations


Ads by Google