Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Representation & Acquisition for Large-Scale Semantic Memory Julian Szymański Dept. of Electronic, Telecommunication & Informatics, Gdańsk University.

Similar presentations


Presentation on theme: "Knowledge Representation & Acquisition for Large-Scale Semantic Memory Julian Szymański Dept. of Electronic, Telecommunication & Informatics, Gdańsk University."— Presentation transcript:

1 Knowledge Representation & Acquisition for Large-Scale Semantic Memory Julian Szymański Dept. of Electronic, Telecommunication & Informatics, Gdańsk University of Technology, Poland, Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland WCCI 2008

2 Plan & main points Goal: Reaching human-level competence in all aspects of NLP. Well... step by step. Representation of semantic concepts is necessary for the understanding of natural languages by cognitive systems. Word games - opportunity for semantic knowledge acquisition that may be used to construct semantic memory. A task-dependent architecture of the knowledge base inspired by psycholinguistic theories of cognition process is introduced. Semantic search algorithm for simplified concept vector reps. 20 questions game based on semantic memory implemented. Good test for linguistic competence of the system. A web portal with Haptek-based talking head interface facilitates acquisition of a new knowledge while playing the game and engaging in dialogs with users.

3 Humanized interface Store Applications, search, 20 questions game. Query Semantic memory Parser Part of speech tagger & phrase extractor On line dictionaries Active search and dialogues with users Manual verification

4 Ambitious approaches… CYC, Douglas Lenat, started in 1984. Developed by CyCorp, with 2.5 millions of assertions linking over 150.000 concepts and using thousands of micro-theories (2004). Cyc-NL is still a “potential application”, knowledge representation in frames is quite complicated and thus difficult to use. Open Mind Common Sense Project (MIT): a WWW collaboration with over 14,000 authors, who contributed 710,000 sentences; used to generate ConceptNet, very large semantic network. Other such projects: HowNet (Chinese Academy of Science), FrameNet (Berkley), various large-scale ontologies. The focus of these projects is to understand all relations in text/dialogue. NLP is hard and messy! Many people lost their hope that without deep embodiment we shall create good NLP systems. Go the brain way! How does the brain do it?

5 Semantic Memory Models Endel Tulving „Episodic and Semantic Memory” 1972. Semantic memory refers to the memory of meanings and understandings. It stores concept-based, generic, context-free knowledge. Permanent container for general knowledge (facts, ideas, words etc). Semantic network Collins & Loftus, 1975 Hierarchical Model Collins & Quillian, 1969

6 Words in the brain Psycholinguistic experiments show that most likely categorical, phonological representations are used, not the acoustic input. Acoustic signal => phoneme => words => semantic concepts. Phonological processing precedes semantic by 90 ms (from N200 ERPs). F. Pulvermuller (2003) The Neuroscience of Language. On Brain Circuits of Words and Serial Order. Cambridge University Press. Phonological neighborhood density = the number of words that are similar in sound to a target word. Similar = similar pattern of brain activations. Semantic neighborhood density = the number of words that are similar in meaning to a target word. Action-perception networks inferred from ERP and fMRI

7 Semantic => vector reps Word w in the context:  (w,Cont), distribution of brain activations. States  (w,Cont)  lexicographical meanings: clusterize  (w,Cont) for all contexts, define prototypes  (w k,Cont) for different meanings w k. Simplification: use spreading activation in semantic networks to define . How does the activation flow? Try this algorithm on collection of texts: Perform text pre-processing steps: stemming, stop-list, spell-checking... Perform text pre-processing steps: stemming, stop-list, spell-checking... Use MetaMap with a very restrictive settings to discover concepts, avoiding highly ambiguous results when mapping text to UMLS ontology. Use MetaMap with a very restrictive settings to discover concepts, avoiding highly ambiguous results when mapping text to UMLS ontology. Use UMLS relations to create first-order cosets (terms + all new terms from included relations); add only those types of relations that lead to improvement of classification results. Use UMLS relations to create first-order cosets (terms + all new terms from included relations); add only those types of relations that lead to improvement of classification results. Reduce dimensionality of the first-order coset space, leave all original features; use feature ranking method for this reduction. Reduce dimensionality of the first-order coset space, leave all original features; use feature ranking method for this reduction. Repeat last two steps iteratively to create second- and higher-order enhanced spaces, first expanding, then shrinking the space. Repeat last two steps iteratively to create second- and higher-order enhanced spaces, first expanding, then shrinking the space. Create X vectors representing concepts.

8 Semantic knowledge representation vwCRK: certainty – truth – Concept Relation Keyword Similar to RDF in semantic web. Cobra is_aanimal is_abeast is_abeing is_abrute is_acreature is_afauna is_aorganism is_areptile is_aserpent is_asnake is_avertebrate hasbelly hasbody part hascell haschest hascosta Simplest rep. for massive evaluation/association: CDV – Concept Description Vectors, forming Semantic Matrix

9 Relations IS_A: specific features from more general objects. Inherited features with w from superior relations; v decreased by 10% + corrected during interaction with user. Similar: defines objects which share features with each other; acquire new knowledge from similar objects through swapping of unknown features with given certainty factors. Excludes: exchange some unknown features, but reverse the sign of w weights. Entail: analogical to the logical implication, one feature automatically entails a few more features (connected via the entail relation). Atom of knowledge contains strength and the direction of relations between concepts and keywords coming from 3 components: directly entered into the knowledge base; deduced using predefined relation types from stored information; obtained during system's interaction with the human user.

10 20q for semantic data acquisition Play 20 questions with Avatar! http://diodor.eti.pg.gda.pl Think about animal – system tries to guess it, asking no more than 20 questions that should be answered only with Yes or No. Think about animal – system tries to guess it, asking no more than 20 questions that should be answered only with Yes or No. Given answers narrows the subspace of the most probable objects. System learns from the games – obtains new knowledge from interaction with the human users. Is it vertebrate? Y Is it mammal? Y Does it have hoof? Y Is it equine? N Is it bovine? N Does it have horn? N Does it have long neck? Y I guess it is giraffe.

11 Algorithm for 20 questions game p(keyword=v i ) is fraction of concepts for which the keyword has value vi Subspace of candidate concepts O(A) is selected: O (A) = {i; d=|CDVi-ANS | is minimal } where CDV i is a vector for i-concept, ANS is a partial vector of retrieved answers we can deal with user mistakes choosing d > minimal

12 Automatic data acquisition Basic semantic data obtained from aggregation of machine readable dictionaries: Wordnet, ConceptNet, Sumo Ontology –Used relations for semantic category: animal –Semantic space truncated using word popularity rank: IC – information content is an amount of appearances of the particular word in WordNet descriptions. GR - GoogleRank is an amount of web pages returned by Google search engine for a given word.IC – information content is an amount of appearances of the particular word in WordNet descriptions. GR - GoogleRank is an amount of web pages returned by Google search engine for a given word. BNC - are the words statistics taken from British National Corpus.BNC - are the words statistics taken from British National Corpus. Initial semantic space reduced to 94 objects and 72 featuresInitial semantic space reduced to 94 objects and 72 features

13 Human interaction knowledge acquisition Data obtained from machine readable dictionaries: –Not complete –Not Common Sense –Sometimes specialized concepts –Some errors Knowledge correction in the semantic space: W 0 – initial weight, initial knowledge (from dictionaries) ANS – answer given by user N – number of answers β - parameter indicating importance of initial knowledge

14 Active Dialogues Dialogues with the user for obtaining new knowledge/features: While system fails guess the object: I give up. Tell me what did you think of? The concepts used in the game corrects the semantic space While two concepts has the same CDV Tell me what is characteristic for ? The new keywords for specified concepts are stored in semantic memory While system needs more knowledge for same concept: I don’t have any particular knowledge about. Tell me more about. System obtains new keywords for a given concept.

15 Experiments in animal domain WordNet, ConceptNet, SumoMilo ontology + MindNet project as knowledge sources; added to SM only if it appears in at least 2 sources. Basic space: 172 objects, 475 features, 5031 relations. # features/concept = CDV density. Initial CDV density = 29, adding IS_A relations =41, adding similar, entails, excludes=46. Quality Q = N S /N = #searches with success/# all searches. Error E = 1-Q = 1-N S /N. For 10 concepts selected with #features close to the average. Q~0.8, after 5 repetition E ~ 18%, so some learning is needed.

16 Quality measures Initial semantic space: average # of games for correct recognition ~2.8. This depends on the number of semantic neighbors close to this concept. Completeness of concept representation: is CDV description sufficient to win the game? how far is it from the golden standard (manually created)? 4 measures of the concept description quality: S d = N f (GS)–N f (O) = #Golden Standard features - #features in O. how many features are still missing compared to the golden standard. S GS =  i [1–  (CDV i (GS),CDV i (O))] similarity based on co-occurrence. S NO = #features in O but not found in GS (reverse of S GS ). Dif w =  i [|CDV i (O)–CDV i (GS)|/m, average difference of O and GS.

17 Learning from games Select O randomly with preference for larger # features, p~exp(-N(O)/N)) N(O) = #features in O, and N = total number of features, Learning procedure: CDV(O) representation of the chosen concept O is inspected, and if necessary corrected. CDV(O) is removed from the memory. Try to learn the concept O by playing the 20 questions game. Average results for 5 test objects as a function of # games shown. NO  = S NO + S GS graph showing the average growth of the number of features as a function of the number of games played. Randomization of questions helps to find different features in each game. Average number of games to learn selected concepts N f =2.7. After the first successful game when a particular concept has been correctly recognized it was always found properly. After 4 games only a few new features are added.

18 Few conclusions Complex knowledge in frames is not too useful for large-scale search. Semantic search requires extensive knowledge. We do not have even the simplest common sense knowledge description, in many applications such representations are sufficient. It should be easier to generate this knowledge rather than wait for embodied systems. Semantic memory built from parsing dictionaries, encyclopedias, ontologies, results of collaborative projects. Active search is used to assign features found for concepts that are not far in ontology (for example, have same parents). Large-scale effort to create a numerical version of Wordnet for general applications is necessary, specialized knowledge is also important. Word games may help to create and correct some knowledge. 20Q is easier than Turing text, good intermediate step. Time for word games Olympics!

19 Thank you for lending your ear... Google: W. Duch => Papers, talks


Download ppt "Knowledge Representation & Acquisition for Large-Scale Semantic Memory Julian Szymański Dept. of Electronic, Telecommunication & Informatics, Gdańsk University."

Similar presentations


Ads by Google