What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.

Slides:



Advertisements
Similar presentations
Social Media.
Advertisements

Creating Collaborative Partnerships
1.Data categorization 2.Information 3.Knowledge 4.Wisdom 5.Social understanding Which of the following requires a firm to expend resources to organize.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Chapter 3 Database Management
TC2-Computer Literacy Mr. Sencer February 4, 2010.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Chapter 14 The Second Component: The Database.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Overview of Search Engines
Relevant words extraction method for recommender system Presentation slides.
Your Interactive Guide to the Digital World Discovering Computers 2012.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Chapter 3 Application Software.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Module 3: Business Information Systems Chapter 11: Knowledge Management.
© Paradigm Publishing, Inc. 5-1 Chapter 5 Application Software Chapter 5 Application Software.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Term 2, 2011 Week 1. CONTENTS Types and purposes of graphic representations Spreadsheet software – Producing graphs from numerical data Mathematical functions.
Chapter 7 Structuring System Process Requirements
AVI/Psych 358/IE 340: Human Factors Web 2.0 November
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
 The ability to develop step by step procedures for solving problems  She uses algorithmic thinking by setting up her charts.
Knowing Your Facebook From Your Flickr Dan O’ Neill – -
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
10-1 Copyright © 2010 Pearson Education, Inc. publishing as Prentice Hall.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Module 5 A system where in its parts perform a unified job of receiving inputs, processes the information and transforms the information into a new kind.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
OCLC Online Computer Library Center 1 Social Media and Advocacy.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Summer 2007.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Project Management May 30th, Team Members Name Project Role Gint of Communications Sai
Google search in general  Google Search, commonly referred to as Google Web Search or just Google, is a web search engine owned by Google Inc. It is.
Learning Objectives Understand the concepts of Information systems.
Social Media & Social Networking 101 Canadian Society of Safety Engineering (CSSE)
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Chapter 1 Overview of Databases and Transaction Processing.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data mining in web applications
PRIMARY DATA vs SECONDARY DATA RESEARCH Lesson 23 June 2016
Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify the key features of widely.
What Is Cluster Analysis?
Application Software Chapter 6.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Data Warehousing and Data Mining
CSE 635 Multimedia Information Retrieval
Presentation transcript:

What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new information, typically using specialized computer software

Why Do We Use Text Mining? Turn text into data for analysis Generate new information Populate a database with the information extracted

Where Do We See It Being Used?

Applications Enterprise Business Intelligence Healthcare/Medical Records National Security Scientific Discovery Sentiment Analysis Tools Natural Language Service Publishing Automated Ad Placement Information Access Social Media Monitoring

Text Mining Process

Text Collect large volume of textual data Text Characteristics: High dimensionality w/ tens of thousands of words Noisy data Erroneous data or misleading data Unstructured text Written resources, chat room conversations, or normal speech Ambiguity Word ambiguity or sentence ambiguity

Text Preprocessing Text Cleanup Normalize texts converted from binary formats (programs, media, images, and most compressed files) Deal with tables, figures, and formulas Tokenization Process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens

Attribute Generation Text document is represented by the words (features) it contains and their occurrences Two approaches to generate attributes/document representation: Bag of Words Model, used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature Vector Space Model, used cosine similarity to calculate a number that describes the similarity among documents

Attribute Selection Further reduction of high dimensionality Analysts have difficulty addressing tasks with high dimensionality Features Selection Select just a subset of the features to represent a document Not all features help Remove stop words Can be viewed as creating an improved document representation

Data Mining Traditional Data Mining Techniques Classification Clustering Associations Sequential Patterns Extract information from the processed text data via data modeling and data visualization (visual maps) Data Visualization Purpose is to communicate information clearly and efficiently to users via the statistical graphics, plots, information graphics, tables, and charts selected makes complex data more accessible, understandable and usable

Interpretation/Evaluation Terminate Results satisfied Iterate Results not satisfactory but significant the results generated are used as part of the input for one or more earlier stages

Text Mining vs.

Data Mining: In Text Mining, patterns are extracted from natural language text rather than databases Web Mining: In Text Mining, the inputs are unstructured texts, while web sources’ inputs are structured

Text Mining vs. Information Retrieval New information vs. Web Search No genuinely new information is found The desired information merely coexists with other valid pieces of information Hearst’s Analogy: “Discovering new knowledge vs. merely finding patterns is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft”

Computational Linguistics (CPL) & Natural Language Processing (NLP): CPL computes statistics over large text collections in order to discover useful patterns which are used to inform algorithms for various sub- problems within natural language processing Text Mining vs.

Text Mining In Social Media People use social media to communicate Social media provides rich information of human interaction and collective behavior Traditional Media vs. Modern Social Media Information in most social media sites are stored in text format Text Mining can help deal with textual data in social media for research

Distinct Aspects of Text in Social Media Textual data provides insights into social networks Textual data also presents new challenges: Time Sensitivity Short Length Unstructured Phrases

Aspect #1: Time Sensitivity Social media’s real-time nature Example: some bloggers may update their blog once a week, while others may update several times a day Large number of real-time updates from Facebook and Twitter contain abundant information Information  detection and monitoring of an event Use data to track a user’s interest in an event A user is connected and influenced by his/her friends Example: People will not be interested in a movie after several months, while they may be interested in another movie released several years ago because of the recommendation from his friends

Aspect #2: Short Length Certain social media websites have restrictions on the length of user’s content Twitter’s 140 characters rule Windows Live Messenger’s 128 character personal status Short Messages  people become more efficient with their participation in social media applications Short Messages also bring new challenges to text mining

Aspect #3: Unstructured Phrases Variance in quality of content makes the tasks of filtering and ranking more complex Computer software have difficulties to accurately identify semantic meaning of new abbreviations or acronyms

Applying Text Mining in Social Media Certain aspects of textual data in social media presents great challenges to apply text mining techniques

Event Detection Event Detection aims to monitor a data source and detect the occurrence of an event that is captured within that source Monitor Real-Time Events via Social Media Example: Detecting earthquake when people are posting live-situation through microblogging like Twitter & Facebook Improve traditional news detection Large number of news are generated from various new channels, but only few receive attention from users Researchers proposed to utilize blogosphere to facilitate news detection

Collaborative Question Answering Collaborative question answering services bring together a network of self- declared “experts” to answer questions posted by other people Through text mining, a tremendous amount of historical QA pairs have built up their databases, and this transformation gives users an alternative place to look for information, as opposed to a web search The corresponding best solutions could be explicitly extracted and returned

Social Tagging A method for Internet users to organize, store, manage and search for tags / bookmarks (also as known as social bookmarking) of resources online Social Tagging vs. File Sharing Through text mining, it helps to improve the quality of tag recommendation Facebook’s tag recommendation of a photo Utilize social tagging resources to facilitate other applications Web object classification, document recommendation, web search quality

Concerns For Text Mining Text in unstructured documents is hard to process The information one needs is often not recorded in textual form We do not have programs that can fully interpret text. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do

Future Of Text Mining As most information (common estimates say over 80%) is currently stored as text This includes s, newspaper or web articles, internal reports, transcripts of phone calls, research papers, blog entries, and patent applications Thanks to the web and social media, More than 7 million web pages of text are being added to our collective repository, daily We can now begin to see the usefulness of software that can process between 15, ,000 pages an hour, compared to a mere 60 pages for humans Text mining is believed to have a high commercial potential value

Thanks!

Question??

Sources Textmining.pdf Textmining.pdf All Images Came from Google Images