Download presentation
Presentation is loading. Please wait.
Published byMercy Holland Modified over 9 years ago
1
A hybrid method for Mining Concepts from text CSCE 566 semester project
2
Concept Mining A concept is a units of information made of words having some semantic meaning. Concept extraction is a key and nontrivial step in: -Developing domain knowledge. -Generation of document annotations. -Defining relatedness between corpus documents -Clustering of a domain corpus.
3
Problem definition We need to be able to extract single and multiple terms concepts from a domain corpus of related documents. We will adopt the method in the paper : Ventura, João, and Joaquim Silva. "Mining concepts from texts." Procedia Computer Science 9 (2012): 27-36.
4
Challenges Concepts can be single word or multiple words. The choice of whether to consider a concept as a single word or multiple word depends on the domain and the level of generality E.g : the concept “Network” versus the concept “Semantic network” The former is a general concept that can be used in several domains but the later is more specific to the domain of Knowledge system. In the same domain corpus concepts can exist as single terms or multiple terms with different distribution - we should focus on the higher co-occurrence frequency
5
Previous approaches Common solutions are either: linguistic and knowledge based approaches, which are usually associated with the knowledge of the structure of texts and ontologies. Because of this intrinsic knowledge, these approaches are more or less dependent on the language they work with, or dependent on its structure. Or Statistical approach with Tf-Idf measures which defines the importance of a term in a corpus. but this metric is not sensitive to the distribution of the word frequencies in the rest of the different documents where the word occurs, as long as the number of documents containing the word is kept the same.
6
Solution outline Domain corpus will be provided We will use OpenNLP to tokenize documents to single tokens. Consider nouns only for single terms concepts Compound concepts are made of single-word concepts which show some preference for having fixed distances between them. for each individual word w from a corpus, we obtain a list of neighbor words B = [b1, b2,.., bm]. Each neighbor bi occurs at different positions relative to w. Positions of bi can be positive or negative and are determined by considering that w, the center word, is at the center of the window.
7
Solution Outline We will then compute the co-occurrence frequency of word bi at position j relative to w. Words with high co-occurrence frequency at a fixed length are considered to be candidate compound concepts.
8
Wikipedia based verification After the extraction of candidate concepts from the corpus we need to validate each one using Wikipedia for the given domain. Wikipedia can be accessed using JWPL java library or through querying the Dbpedia (Will be discussed in details if interested in this project)
9
Contacts Elshaimaa Ali eea7236@Louisiana.edu Elshaimaa.ali@Hotmail.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.