1 Noisy Text Analytics: An Exercise in Futility? Hwee Tou Ng Department of Computer Science National University of Singapore 8 Jan 2007.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Chapter 9: Customer Service via Technology
Stop. Think. Connect. National Cybersecurity Awareness Campaign October 2010.
Slide 1 Insert your own content. Slide 2 Insert your own content.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Noisy Text Analytics: An Exercise in Futility? Lopresti January 2007 Slide 1 AND Workshop on Analytics for Noisy Unstructured Text Data Noisy Text Analytics:
CAMBRIDGE LIBRARIES & GALLERIES QUEEN SQUARE LIBRARY.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Using the Set Operators
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Teacher Name Class / Subject Date A:B: Write an answer here #1 Write your question Here C:D: Write an answer here.
Addition Facts
The ANSI/SPARC Architecture of a Database Environment
CS4026 Formal Models of Computation Running Haskell Programs – power.
General Information Software Robot Benri. Characteristics 1. Connect up to 16 cameras. 2. Do six different type of detections. 3. Define sub-areas where.
Sentiment Analysis and The Fourth Paradigm MSE 2400 EaLiCaRA Spring 2014 Dr. Tom Way.
PRELIMINARY TEST 2. Technicalities VID-ŽULJ, 21 Jan, (Tue) ŠTA-VAR 22 Jan (Wed) Studomat: if Test 1 & Test 2, apply for 29 Jan Signatures:
1 A Test Automation Tool For Java Applets Testing of Web Applications TATJA Program Demonstration Conclusions By Matthew Xuereb.
ABC Technology Project
SEARCHING MULTIMEDIA prepared by Literature Searching Team Library, Faculty of Medicine, UGM 2012.
O X Click on Number next to person for a question.
© S Haughton more than 3?
1 Challenge the future Second-hand Stuff Amsterdam 2 Project.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
Take from Ten First Subtraction Strategy -9 Click on a number below to go directly to that type of subtraction problems
 The environment is everything that influences or is influenced by an information system and its purpose. It includes any factors that affect the system.
1 Academic literacies in the digital university Mary Lea & Robin Goodfellow Institute of Educational Technology Open University Seminar 1 Edinburgh University.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Limits (Algebraic) Calculus Fall, What can we do with limits?
Properties of Exponents
1 ENGLISH – LEVEL VI Week 10 – Lesson UNIT 9 – Import Export LISTENING.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
What is Slope Assignment. List the four types of slope and sketch each type. 1.
11 = This is the fact family. You say: 8+3=11 and 3+8=11
Week 1.
We will resume in: 25 Minutes.
1 Ke – Kitchen Elements Newport Ave. – Lot 13 Bethesda, MD.
Bottoms Up Factoring. Start with the X-box 3-9 Product Sum
FIND THE AREA ( ROUND TO THE NEAREST TENTHS) 2.7 in 15 in in.
O X Click on Number next to person for a question.
13-1 © Prentice Hall, 2004 Chapter 13: Designing the Human Interface (Adapted) Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra,
Management Information Systems, 10/e
Computer Science Department Learning on the Fly: Rapid Adaptation to the Image Erik Learned-Miller with Vidit Jain, Gary Huang, Laura Sevilla Lara, Manju.
An Integration Platform of Social Networking Applications to Support Life Long Learning in Rural Territories: the “SoRuraLL Virtual Learning World” Environment.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Push Singh & Tim Chklovski. AI systems need data – lots of it! Natural language processing: Parsed & sense-tagged corpora, paraphrases, translations Commonsense.
Use of Electronic and Internet advertising options Standard 3.4.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Should Schools Integrate Social Media into the Classroom?
Productivity Content Exploration Communication Communication Production Data Collection/Analysis.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Use of Electronic and Internet advertising options
Discussion Forums.
Presentation transcript:

1 Noisy Text Analytics: An Exercise in Futility? Hwee Tou Ng Department of Computer Science National University of Singapore 8 Jan 2007

2 Noisy Text Analytics: An Exercise in Futility?

3 Sources of Noisy Text Traditional sources –Automatically transcribed text from speech –Automatically OCRed text from image

4 Sources of Noisy Text More recent sources from the Web –Blogs, wikis, message boards, online chats, SMS, etc. –User generated content

5 Sources of Noisy Text More recent sources from the Web –Blogs, wikis, message boards, online chats, SMS, etc. –User generated content –Informal text »Acronyms, abbreviations, specialized vocabulary »Sublanguage, sub-community

6 Importance The rise of social media (Web 2.0) –Commercial, economic interest

7 Importance ACL SIGWAC (Special Interest Group on the Web as Corpus, Association for Computational Linguistics) –CLEANEVAL (shared task and competition for web corpus cleaning)

8 Noisy Text Analytics: An Exercise in Futility?

9 An Exercise in Futility? Necessity is the mother of invention!

10 Noisy Text Analytics: An Exercise in Futility?

11 What is Analytics? American Heritage Dictionary –The branch of logic dealing with analysis Merriam-Websters Online Dictionary –The method of logical analysis

12 Analytics Approach #1 –Eliminate the noise in noisy text (text normalization), followed by processing the text as per normal »Noise: Misspelled words, wrongly cased words, wrong sentence and paragraph boundaries –Examples: »Table recognition Learning to Recognize Tables in Free Text, H T Ng, C Y Lim, J L T Koo, ACL 1999

13 Table Recognition

14 Table Recognition

15 Table Recognition

16 Analytics Approach #2 –Process the noisy text as is directly –Examples: »Upper case text (e.g., speech recognizer output) Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text, H L Chieu, H T Ng, ACL 2002 »Semi-structured text (e.g., seminar announcements, job advertisements) A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text, H L Chieu, H T Ng, AAAI 2002