Automatic creation of concept map from unstructured text in a flective language Krunoslav Žubrinić, PhD University of Dubrovnik Prof. Damir Kalpić, PhD.

Slides:



Advertisements
Similar presentations
Organizational Environment for Knowledge Management
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Evaluating Search Engine
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Chapter 14: Usability testing and field studies. Usability Testing Emphasizes the property of being usable Key Components –User Pre-Test –User Test –User.
Seminar /workshop on cognitive attainment ppt Dr Charles C. Chan 28 Sept 2001 Dr Charles C. Chan 28 Sept 2001 Assessing APSS Students Learning.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
Quantitative Research
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Chapter 14: Usability testing and field studies
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Designing and evaluating good multiple choice items Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information.
Chapter 6 Supplement Knowledge Engineering and Acquisition Chapter 6 Supplement.
Roadway and traffic characteristics for bicycling Author Janice Kirner Providelo Suely da Penha Sanches Presenter 謝博任.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Evaluating a Research Report
T 7.0 Chapter 7: Questioning for Inquiry Chapter 7: Questioning for Inquiry Central concepts:  Questioning stimulates and guides inquiry  Teachers use.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
By: TARUN MEHROTRA 12MCMB11.  More time is spent maintaining existing software than in developing new code.  Resources in M=3*(Resources in D)  Metrics.
A Language Independent Method for Question Classification COLING 2004.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
1.  Interpretation refers to the task of drawing inferences from the collected facts after an analytical and/or experimental study.  The task of interpretation.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Research Methodology Class.   Your report must contains,  Abstract  Chapter 1 - Introduction  Chapter 2 - Literature Review  Chapter 3 - System.
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Chapter 14: Affective Assessment
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Paper III Qualitative research methodology.  Qualitative research is designed to reveal a specific target audience’s range of behavior and the perceptions.
Producing Data: Experiments BPS - 5th Ed. Chapter 9 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Looking for statistical twins
WP4 Models and Contents Quality Assessment
Queensland University of Technology
CHAPTER 9 Testing a Claim
What is a CAT? What is a CAT?.
Object-Oriented Software Engineering Using UML, Patterns, and Java,
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Erasmus University Rotterdam
ASSESSMENT OF STUDENT LEARNING
Chapter 2 Sociological Research Methods
Social Knowledge Mining
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Block Matching for Ontologies
A VOYAGE THROUGH PERFORMANCE MANAGEMENT TOOLS
Retrieval Performance Evaluation - Measures
Presentation transcript:

Automatic creation of concept map from unstructured text in a flective language Krunoslav Žubrinić, PhD University of Dubrovnik Prof. Damir Kalpić, PhD University of Zagreb Faculty of electrical engineering and computing

Contents Introduction –Motivation –Goal and hypothesis for research Procedure –Automatic creation of concept map –Creation of thesaurus –Method for creation of the concept map Results –”Gold standard” –Assessment of quality for the selected terms –Assessment of quality for the created concept maps –Usability assessment for the created concept maps Conclusion 2

Motivation Visualisation enables a structural view into a larger quantity of data in shorter time. Concept map is a tool for visualisation, successfully used in education and business. Creation of the concept map may often be a problem. –How to recognise important concepts and relationships –A mentor or an initial field map can help. –Concept map can be automatically created from the related documents. –Past research has yielded promising results. 3

Goal and hypothesis for research Research goal: –Design and verify a new method for automatic creation of concept map from an unstructured textual document in Croatian, construct a prototype and evaluate the achieved results. Research hypothesis: –From unstructured text in Croatian, using an automatic procedure, a concept map can be created to represent the key elements of the original text. –The hypothesis has been formulated based on insight in results of similar research in creation of concept maps in other languages. 4

Automatic creation of concept map 5

Creation of thesaurus

Creation of the thesaurus skeleton 1.Selection of terms: term frequency–inverse document frequency (TF-IDF) 2.Determination of connections: –Apriori algorithm (link of similarity of RT (related term) terms to the terms of the thesaurus seed) –Links within the thesaurus seed (RT, hierarchical links BT/NT (broader term/narrower term), links USE/UF (use/use for)) –Links within WordNet (hierarchical links BT/NT, links USE/UF) 3.Determination of concepts: –Links USE/UF (concept name=USE, terms included in the concept=UF) –Concept weights are calculated using the CF-IDF (concept frequency–inverse document frequency) measu re 7 Excerpt from the created thesaurus

Method for creation of the concept map

Method for creation of the concept map 9 4.

Detection of non-hierarchical links Based on syntagmatic relationships among words in key sentences. –A key sentence contains at least two concepts. –The subject (S), predicate (P) and object (O) within a sentence are in focus. 10 Rules for processing of simple sentences Normalisation of terms −The most frequent appearance is selected. Complex sentences −From each S-P-O set a proposition is created. −Incomplete sets are not considered. Ideal case In practice, more frequent

Method for creation of the concept map 11 4.

Detection of hierarchical links 1.Using of taxonomy in thesaurus 2.Lexical dispersion settings 3.Setting for distribution hypothesis 12 Conditional probabilities for mutual appearance of key concept pairs

Method for creation of the concept map 13 5.

Tree trimming to a given size Given size of the map is 5 concepts 14 T=3,7 T=1,04 T=0,4 T=0,44 weight=3,7+3,7+1,04=8,44 weight=3,7+0,4+0,44=4,54 1 connection 2 connections Unlinked concept   

15 Example for a created concept map Uredba o osnivanju Hrvatskoga športskog muzeja (sažetak) Vlada Republike Hrvatske na temelju Zakona o muzejima i Zakona o ustanovama osniva Hrvatski športski muzej kao javnu ustanovu od interesa za Republiku Hrvatsku. Muzej obavlja muzejsku djelatnost vezano uz područje fizičke kulture, tjelovježbe, športa i srodnih područja ljudskoga djelovanja sukladno Zakonu o muzejima. Financijska sredstva za obavljanje djelatnosti muzeja se osiguravaju u državnom proračunu Republike Hrvatske, a muzej može stjecati i vlastita sredstva. Muzejom upravlja ravnatelj koji se imenuje na temelju natječaja na vrijeme od četiri godine i može biti ponovno imenovan na istu dužnost. Danom osnivanja Muzej preuzima zatečenu imovinu, stvari, prava, novac i radnike ustrojstvene jedinice Hrvatskoga športskog muzeja Kineziološkog fakulteta u Zagrebu. Key terms hrvatski športski muzej; vlada republike hrvatske; javna ustanova; zakon o muzejima; ravnatelj

Evaluation of results With the described procedure 121 concept maps were created. Quality of all the created maps was evaluated. For evaluation, the so called ”gold standard” was used

17 Gold standard Key words and informative-indicative abstract of 121 documents. In preparation participated 12 individuals from the area of science and higher education. –Every document was processed by at least two persons. Evaluated was the sample quality of gold standard. –55 evaluators – everyone evaluated 4 documents = 220 grades. –Created abstracts and selected key words describe the source documents very well.

Evaluation of quality of selected key terms 18 Comparison of selected key terms from the text with gold standard key words. Comparison of 3 algorithms (TF-IDF, Apriori i KEA (Keyphrase Extraction Algorithm) with the referential one. Comparison of precision (P), recall (R) and their harmonic mean (F1) measure.

Evaluation of quality of the created concept maps Comparison of the map with document abstract. –Questionnaire with 5 Likert type assertions: The concept map contains the most important terms from the document. Concepts are connected correctly. Links among concepts are properly named. Hierarchy of concepts is correct. Concept map is useful for learning the document contents. Respondents evaluated each statement with grades 1-5. –6.854 respondents from science and higher education –538 (7,9%) respondents fully completed their questionnaires –Every respondent evaluated five different randomly chosen maps. –All together were collected individual grades. –A single map was evaluated from 7 to 42 respondents. –Excluded from the analysis were the maps evaluated by less than 15 respondents. –115 maps and single grades were left over. 19

Evaluation of quality of the created concept maps 20

Evaluation of quality of the created concept maps 21

Evaluation of quality of the created concept maps 22 Most frequent comments: –The quality of created concept maps varies. –Some maps are incomplete, unclear or hierarchically wrong. –Link names should be improved. –Quality maps should be constructed from one type of documents and then should some other type be attempted. –Promising, but still a lot to be done. Are created concept maps good enough for practical application? Connections between the characteristics of the original document, the observed characteristics of the respondents (gender, age and job position) and given grades was carried out using the χ 2 test. Main conclusions: –Length of the original document does not affect the achieved results. –Observed characteristics of the respondents (gender, age and job position) does not affect the results achieved.

Assessment of applicability The respondents had to find the answers to the posed questions using three kinds of materials: –The created concept map, abstract, source document. Three questions were selected to be answered using all the three kinds of materials. Two forms of questions: YES/NO expected answers, and multiple expected answers: –Correct answer – 100% of credits, partly correct answer 50% credits, wrong answer – 0 credits. Besides answering the questions, the respondents had to evaluate the difficulty to answer, using a 5 levels scale (0 – 1) (I could not find the answer=0; I could hardly find it=0.25; rather difficult to find=0.5; rather easy to find=0.75; very easy to find=1) Number of credits = correctness * difficulty to find the answer Example: a partly correct answer which was found rather easily: 0,5 * 0,75 = 0,375 23

Assessment of applicability Statistical significance of differences among the results by respondents after application of different materials was performed using the t-test for paired samples. The used materials and respondents’ characteristics have statistically significant impact on the results. 24 However, the difference is small and does not bear much practical value Worst results were achieved while using source documents. –Answers difficult to find due to the length of the source text? –Lack of concentration by respondents?

Assessment of applicability 25 Concept maps - Ease of information retrieval - Collected grades Abstracts - Ease of information retrieval - Collected grades Source documents - Ease of information retrieval - Collected grades ρ=0,538; P<0,001 ρ=0,573; P<0,001 ρ=0,542; P<0,001

Conclusion The following contribution has been achieved: –A new method to create concept maps from unstructured text in a flective (i.e. Croatian) language, combining statistical methods and machine learning methods with terms dictionary and linguistic tools and resources specific for the Croatian language. –New method to determine the hierarchy level of concepts in concept map based on links to other concepts and positions in the document where the concept is present. –Proposal for a semi-automatic procedure to create dictionary of terms in a problem domain to be used for recognition of concepts and concept map links. With application of this procedure dictionary of terms in a selected area was formed. The results achieved using the prototype confirm the stated hypothesis. Future work: –Algorithms improvement, implementation of critical processes, improvement of graphical results presentation and shortening of the processing time. 26