Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by Mick Gu

Why financial professional needs data mining
We need specific information for day-to-day decision making! For example, a financial company may need to know all the companies takeovers that take place in a certain time span and details

Introduction Large amounts of text data
Created in a variety of social network, web and other information-centric applications Text data Managed via a search engine due to lack of structure Information access vs. analyzing information

Top 10 sites from Alexa.com
Rank Website 1 Google 6 Blogger 2 Facebook 7 Baidu 3 Youtube 8 Wikipedia 4 Yahoo 9 Twitter 5 Windows Live 10 QQ.com Social media: contains rich information of human interaction and collective behavior Draws much attention from disciplines including sociology, business, psychology, politics, computer science, economics and so on. So we need to manage and study these text data!!!

Question on Models, algorithms and applications
primary supervised and unsupervised models useful tools and techniques 3. key application domains

Models in the book Named Entity Recognition Relation Extraction
Text Analytics Distance-based Clustering Algorithms Decision Tree Classifier

Named Entity Recognition and Relation Extraction
Steve Jobs – Apple Inc – California Person – Organization – Location Things to solve: JFK?

Named Entity Recognition(continued)
Before: manually crafted patterns Now: statistical machine learning methods Hidden Markov Models Maximum Entropy Models Support Vector Machines

Relation Extraction Example: Automatic Content Extraction(ACE)
In 1998, Larry Page and Sergey Brin founed Google Inc. FounderOf(Larry Page, Google Inc.), FounderOf(Sergey Brin, Google Inc.), FoundedIn(Google Inc., 1998) Automatic Content Extraction(ACE) Personal/social Employment/affiliated Within sentence board vs cross sentence boundaries

Relation Extraction(continued)
Feature-based Classification Kernel machines Both of the method require large amount of training data

Tree-based Kernels T1, T2 two parse trees
n1, n2 sets of all notes in T1 and T2 respectively i donates a subtree I(n) is 1 if subtree is seen rooted at node n and 0 otherwise

Tree-based Kernels Assume C(n1,n2) = ∑ I(n1) I(n2)
If the grammar production of n1 and n2 are different, C(n1,n2) = 0 If same, C(n1,n2) = 1

Text Analytics Given a text contains three microblogging message shown below: Watching the King’s Speech I like the King’s Speech They decide to watch a movie

Text Analytics Text Preprocessing Text Representation
Knowledge Discovery

Text Preprocessing Stop word removal Stemming Watch King’s Speech
Decid watch movi

Text Representation Transfer the phrase into numeric vector
Bag of Words or Vector Space Model A word has a numeric weight of importance

Weight of importance Term Frequency / Inverse Document Frequency
tfidf(w) = tf * logN / df(w) tf : term frequence ( occurrence in docs) dw: document frequency( No. of docs containing w) N is the number of docs in the corpus

Bog of Words Matrix

Knowledge Discovery Why we need the weight matrix?
To calculate similarity for Machine Learning method! Similarity(v1,v2) = cos(θ ) = v1 * v2 / ||v1|| * ||v2||

Conclusion and Future Studies
how to handle textual data with short length? ?

Future Studies How to reduce the noise presentation of textual data?

Future Studies How to analyze cross media data?(text, image, links, and even multilingual data)

Thank You!

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Similar presentations

Presentation on theme: "Mining the Data Charu C. Aggarwal, ChengXiang Zhai"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Similar presentations

Presentation on theme: "Mining the Data Charu C. Aggarwal, ChengXiang Zhai"— Presentation transcript:

Similar presentations

About project

Feedback