Download presentation
Presentation is loading. Please wait.
Published byWilla Anderson Modified over 9 years ago
1
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining
2
Lexical relationship mining A lexical relationship is a relationship between words, such as synonym, antonym, hypernym (“poodle” “dog”) A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot). Thus, subtopic mining is in this category, but definition mining is not.
3
Information Extraction MUC http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ Information Extraction: the extraction or pulling out of pertinent information from large volumes of texts Items of Information Percentile Reliability Entities90 Attributes80definition falls here Facts70 Events60 Attribute: a property of an entity such as its name, alias, descriptor, or type
4
Mining Topic-Specific Concepts and Definitions on the Web Goal : Systematically learn an unfamiliar topic from Web Definitions Topic hierarchy Input : a term “data mining”, “Web mining” Tasks –Identify sub-topics or salient concepts Like building ontology, but no clear hierarchy E.g.: Genetic Algorithm Algorithms –Find and organize definition pages Definition question answering –Concept disambiguation
5
Techniques A lot of heuristics –Simple linguistic patterns {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … … –Web page tags,…, … Frequent pattern mining –A classic data mining technique
6
Algorithm WebLearn(T) Submit T to a search engine, get relevant pages Mines subtopics or salient concepts of T Finds definition pages Output the concepts and definition pages to users. If a user wants to know more about subtopics T’ do WebLearn(T’)
7
Mining subtopic/salient concept(1) Input: a set of top-ranked relevant document Steps: 1. Filter out “noisy” documents Publication listing pages “in proceeding”, “journal” Forum discussion pages “previous message”, “reply to” Pages that do not contain all query terms
8
Mining subtopic/salient concept(2) 2. Identify important phrases in each page Extract text segments in HTML emphasizing tags,…, … Except those containing: Salutation title (Mr. Dr. Professor) URL or email address “conference”, “journal” … Digits ( KDD2004) Images Too many words (15 words as limit)
9
Mining subtopic/salient concept(3) 3. Mine frequent phrases Input: emphasized text segments Mine frequent word sets using associate rule mining technique 4. Eliminate word sets unlikely to be subtopics Heuristic: those that do not appear alone in emphasizing tags in any page “process” Remove generic words from result set “abstract”, “introduction”, “conclusion”, “research”,… 5. Rank result sets According to number of pages they occur
10
Definition Finding Definition identification patterns suitable for Web pages {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … HTML structuring clues and hyperlinks If only one header,,… or one big emphasized segment at the beginning => definition page Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept
11
Subtopic disambiguation By adding context terms – usually parent topic or subtopics context terms tend to dominate results cannot work for the first (root) topic Heuristics to combat domination of context terms – only consider text segments containing the topic or subtopic – identify pages with topic hierarchy HTML list tag The hierarchy should also contain other subtopics of the parent topic – shallow linguistic phenomena Topic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including” + subtopics ) Then, how does this help disambiguate?
12
Evaluation Use Google to get the initial set of relevant pages Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant More salient concepts than subtopics Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves Result 3 : disambiguation Seem to be useful
13
Analysis Interesting topic Potentially to be used in practice A complete system Techniques –Avoid NLP, Machine Learning –Apply heuristics of shallow text structures
14
Limitations Research topics, not much ambiguity Techniques: –Heuristics are empirical, by no means being flawless or exhaustive, and hard to applied to other domains
15
How to improve? -- discussion Better research: – do you think it is a good research topic? Better techniques: – what techniques would you like to try to solve the problme?
16
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.