Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Evaluation of Decision Forests on Text Categorization
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg.
1 SELC:A Self-Supervised Model for Sentiment Classification Likun Qiu, Weishi Zhang, Chanjian Hu, Kai Zhao CIKM 2009 Speaker: Yu-Cheng, Hsieh.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Stock Volatility Prediction using Earnings Calls Transcripts and their Summaries Naveed Ahmad Aram Zinzalian.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Multi-Class Object Recognition Using Shared SIFT Features
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
Opening Computational Door on Knock Knock Jokes Julia M. Taylor & Lawrence J. Mazlack Applied Artificial Intelligence Laboratory University of Cincinnati.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.
Chapter 23: Probabilistic Language Models April 13, 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Presented By Lingzhou Lu & Ziliang Jiao. Domain ● Optical Character Recogntion (OCR) ● Upper-case letters only.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Lecture 8: Feature matching CS6670: Computer Vision Noah Snavely.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Digital Image Processing
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Sentiment Analysis CMPT 733. Outline What is sentiment analysis? Overview of approach Feature Representation Term Frequency – Inverse Document Frequency.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
CRF &SVM in Medication Extraction
The Greedy Method and Text Compression
CLSciSumm-2018 What to submit Task Framework Task 1A Task 1B
Aline Martin ECE738 Project – Spring 2005
PAE-DIRT Paraphrase Acquisition Enhancing DIRT
CS224N: Query Focused Multi-Document Summarization
Statistical NLP Spring 2011
Clinically Significant Information Extraction from Radiology Reports
Introduction to Sentiment Analysis
Lecture 6: Feature matching
Presentation transcript:

Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung

Problem Given a Chinese novel, extract the names of people and locations Different from English NER: no whitespace within sentences, no capitalization Can use other characteristics since the domain is limited

System Outline Extract bigrams, trigrams, and quadrigrams from text Run logistic regression on extracted features to learn feature weights Use weights to compute a score for each n-gram Apply thresholding to limit the number of guessed names Use word lists from word segmenter and dictionary Compare output list to correct list for F1 score

Features N-gram and segmented word counts Ratio of count of n-gram to (n-1)-gram Transliterated characters Prefixes and suffixes Segmented words and dictionary Mutual information

Thresholding Otsu’s method: Often used in image processing Separates data into two classes, minimizing the variance within the classes Does not depend on training data F1 Maximization Find the threshold on training data that maximizes F1 score Use same threshold on test data

Results No validation set, so chose a baseline set Ablation tests show that the baseline chosen was non-optimal Best individual scores: F1 scorePrecisionRecall F1 scorePrecisionRecall Recall Precision

Conclusion Most useful features: N-gram counts / frequency ratios (0.46F1 alone) Varies depending on type of n-gram Thresholding Otsu’s method yielded better overall performance Both methods had drawbacks Future work More rigorous feature set testing Larger / cleaner data sets