COMP3740 CR32: Knowledge Management and Adaptive Systems Overview and example KM exam questions By Eric Atwell, School of Computing, University of Leeds
S1: Eric Atwell Office: 6.06a S2: Vania Dimitrova Office: 9.10p
Semester 1 Topics in KM Knowledge in Knowledge Management –the nature of knowledge, definitions and different types –Knowledge used in Knowledge Based Systems, KM systems Knowledge and Information Retrieval / Extraction –Analysis of WWW data: Google tools, SketchEngine, BootCat –IR: finding documents which match keywords / concepts –IE: extracting key terms, facts (DB-fields) from documents –Matching user requirements, advanced/intelligent matching –Mining WWW as source of data and knowledge Knowledge Discovery –Collating data in data warehouse; transforming and cleaning –Cross-industry standard process for data mining (CRISP-DM) –OLAP, knowledge visualisation, machine learning in WEKA –Analysis of WWW-sourced data
Past Exam Papers? One way to see what you need to learn is to look at past exam papers – this gives a birds eye view past exam papers COMP3740 CR32 is a new module … BUT developed from –COMP3410 Technologies for Knowledge ManagementCOMP3410 Technologies for Knowledge Management –COMP3640 Personalisation and User-Adaptive SystemsCOMP3640 Personalisation and User-Adaptive Systems For example, past COMP3410 exam paper covers some topics in CR32exam paper
Q1a: KM for bibliographic search Serge Sharoff is a lecturer at Leeds University who has published many research papers relating to technologies for knowledge management, for example: … (i)Imagine you are asked to assess the impact of Dr Sharoffs research, by finding a list of papers by other researchers which cite these publications. Suggest three Information Retrieval tools you could use for this task. State an advantage and a disadvantage of each of these three IR tools for this search task, in comparison to the other tools.
A1a: KM for bibliographic search (i)Name 3 appropriate tools e.g. Google Scholar, CiteSeer, ISI Web of Knowledge, Google BooksGoogle ScholarCiteSeerISI Web of KnowledgeGoogle Books An appropriate pro and con of each, eg: Google Scholar: Pro: wider coverage, all publications on open WWW; Con: does not give full references, just URL and some details Citeseer: Pro: stores papers in several formats plus BibTeX references; Con: not as good coverage, esp interdisciplinary ISI Web of Knowledge or Web of Science: Pro: good coverage of top journals including paid-for Con: most papers in this field are not in top journals
Q1a (ii): KM doesnt always work Q: Suggest three reasons why citations for some papers might not be found by any of your suggested IR tools A: - Two of these papers are in Russian, citations may also be; these tools focus on English-language papers; - Papers in this field are mainly in conference/workshop proceedings, not journals, hence less likely to be indexed by IR tools (esp Web of Science) - older papers may not be online, so less likely to be found and cited by others
Q1b: Info Retrieval v Info Extraction What is the difference between Information Retrieval and Information Extraction? A Knowledge Management consultancy aims to build a database of all Data Mining tools available for download via the WWW, including name, cost, implementation language, input/output format(s), and Machine Learning algorithm(s) included; should they use IR or IE for this task, and why?
A1b: Info Retrieval v Info Extraction IR: finding whole documents which match query IE: extracting data/info from a given text to populate fields in data-base or knowledge-base records Both IR and IE are appropriate: this task requires IR to find DM tool description webpages from whole WWW, but then finding the specific details in each webpage is identifying fields in records for DB population task
Q1c: using relevance feedback to adapt a query IR query finds matching documents. The user may say some are not relevant. Relevance feedback can guide the system to adapt the initial query – new query finds more of the same This may look complicated but its just putting the numbers into the equation…
Relevance feedback example [4 marks: 1 for correct q vector, 1 for realising sums a single d vector, 1for 3 weighted vectors, 1 for answer] q' = q + di / | HR | - di / | HNR| = 0.5 q d1 0.5 d4 = 0.5 (1.0, 0.6, 0.0, 0.0, 0.0) (0.8, 0.8, 0.0, 0.0, 0.4) 0.5 (0.6, 0.8, 0.4, 0.6, 0.0) = (0.5, 0.3, 0.0, 0.0, 0.0) + (0.4, 0.4, 0.0, 0.0, 0.2) (0.3, 0.4, 0.2, 0.3, 0.0) = (0.6, 0.3, 0.2, 0.3, 0.2)
Q2: Knowledge processes In 2008, Leeds University adopted the Blackboard Virtual Learning Environment (VLE) to be used in undergraduate taught modules in all schools and departments. In future, lectures and tutorials may become redundant at Leeds University: if we assume that student learning fits Colemans model of Knowledge Management processes, then the Virtual Learning Environment provides technologies to deal with all stages in this model. All relevant explicit, implicit, tacit and cultural knowledge can be captured and stored in our Virtual Learning Environment, for students to access using Information Retrieval technologies. Is this claim plausible? In your answer, explain what is meant by Colemans model of Knowledge Management processes, citing examples relating to learning and teaching at Leeds University. Define and give relevant examples of the four type of knowledge; and state whether they could be captured and stored in our VLE, and searched for via an Information Retrieval system. [20 marks]
even an essay has a marking scheme Key points: - Coleman process of knowledge gathering/acquisition: big problem would be data capture and preparation - Coleman process of knowledge storage/organisation: KM/IR could be of great benefit - Coleman process of knowledge refining/adding value: lectures aim at more than rote learning - Coleman process of knowledge transfer/dissemination: students prefer human factors of lectures? - Explicit Knowledge has been articulated - example: e.g. lecture notes, course handbooks - already captured, and already accessible via IR search - Implicit Knowledge hasnt been articulated (but could be) - example, e.g. extra material known to lecturer but not on the handouts - could potentially be captured, accessible if text form eg transcripts - Tacit Knowledge cant be articulated but is done without thinking - example, e.g. how to design and implement elegant programs - tacit knowledge cannot be captured, hence cannot be searched for via IR - Cultural Knowledge is shared norms/beliefs to enable concerted action - example, e.g. students cooperate in groupwork - written guidelines can be captured and retrieved, but not group spirit
Q3: Data Mining with WEKA Association rules link arbitrary features; e.g. (center = 0) => (color = 0) (100% - perfect predictor); Classification rules predict final feature (class) english=UK/US; e.g. (color (english = UK) (100% - perfect predictor)
Simple decision tree (colorpercent < = 40) / \ Yes No / \ UK US
How to choose the root? aim to balance the decision tree: best attribute is one which naturally splits instances into homogeneous subtrees with least errors. E.g. (colorpercent <= 40) splits into perfectly-predictive subsets with the training set.
Confusion matrix depends on decision-point given in (b); eg: for (colorpercent <= 40) we get 2 wrong classifications: === Confusion Matrix === a b <-- classified as 1 2 | a = UK 0 0 | b = US
Supervised v unsupervised ML Supervised learning involves learning from example instances with desired "answer" or classification, eg building decision tree to predict the last attribute, English=UK/US, given the arff instances; Unsupervised learning involves learning from example instances but not being shown desired "answer" for each, eg clustering instances into groups of similar documents on the basis of discriminative feature-values, not including English as the target class; this may yield another division of documents.
Reminder: birds eye overview of KM Knowledge in Knowledge Management Knowledge and Information Retrieval / Extraction Knowledge Discovery January mock exam: Knowledge Management