ITCS 6010 Natural Language Understanding
Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of natural language, and, natural language understanding devoted to making computers "understand" statements written in human languages Subset of AI and linguistics Has several categories Open domain question answering Natural interfaces to databases Text-based natural language research Dialogue-based natural language research
Text-based Research Performed with respect to text-based applications e.g. magazines, newspapers, messages Information extraction and comprehension Document Retrieval Translations Summarization
Dialogue-based Research Performed with respect to dialogue-based applications e.g. question-answering systems, automated help centers Difficult because dialogue-based applications manage a natural flowing dialogue between the user and the interface
Natural Language Interface to Databases (NLIDB) Allows users to access information from database using natural language queries Removes user’s need to know structure of database and details about data Examples: RENDEVOUS, ASK and LANGUAGEACCESS
Architecture of NLIDB General architecture Linguistic front-end Database back-end Front-end Accepts natural language question as input Translate question to a meaning representation language (MRL)
Architecture of NLIDB (cont’d) Back-end Accepts MRL Translate MRL to supported database language Executes query Results typically presented in sub-set of natural langauge
Natural Language Question Answering (NLQA) Process of retrieving answers for questions Questions posed in natural language Precise answer presented in natural language
Question Answering Using Statistical Model (QASM) Converts natural language questions into search engine specific query Premise: There exists a best possible operator to apply on a natural language question
QASM (cont’d) How it works: Classifier determines best operator to apply to a NL question Operator produces new query that improves upon original Operator matched to question-answer pair Expectation maximization (EM) algorithm stabilizes missing data i.e. paraphrased questions Iteratively maximizes likelihood estimation
Probabilistic Phrase Reranking (PPR) PPR A process that goes through a set of subtasks to retrieve most relevant answer to proposed question Subtasks: Query modulation Question converted to appropriate query Question Type Recognition Queries organized according to the question type i.e. location, definition, person, etc. Document Retrieval Most relevant unit of information e.g. documents are returned in this stage i.e. the units with highest probability of containing the answer
Probabilistic Phrase Reranking (PPR) cont’d Subtasks (cont’d): Passage/Sentence Retrieval Sentences, phrases or textual units that contain answers are identified from information unit returned in previous task Answer Extraction Chosen textual units are split into phrases Each is a potential answer Phrase/Answer Reranking Phrases generated are ranked Top of the list - Phrase with greatest possibility of containing correct answer
Bayesian Approach Uses probabilistic IR model and Bayes’ Rule Goal of probabilistic model: Estimate probability that a document, d k, is relevant (R) to a query, q i.e. P q ( R | d k ) Each document represented by set of words Words stemmed Suffixes and prefixes removed Now known as index terms
Bayesian Approach (cont’d) Each document represented by a vector, t = ( t 1, t 2, …., t p ) where p is the number of index terms Bayes’ Rule applied to model to express probability that document relevant to a specific query, q P q ( R | t ) α P q ( t | R ) P q ( R )
Bayesian Approach (cont’d) Assumption Each word is independent given relevance and non-relevance of document Results in expression for log odds of relevance
Bayesian Approach (cont’d) Document is relevant if: User’s needs satisfied Frequency of terms in relevant and non-relevant documents retrieved Initial status of documents unknown Ad hoc estimation of probabilistic model parameters used to determine an initially-ranked list of documents Strengths of approach: Initial document ranking not based on ad hoc considerations provided Automatic mechanism for learning and incorporating relevance information from other queries provided