Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Similar presentations


Presentation on theme: "Mining the Data Charu C. Aggarwal, ChengXiang Zhai"— Presentation transcript:

1 Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by Mick Gu

2 Why financial professional needs data mining
We need specific information for day-to-day decision making! For example, a financial company may need to know all the companies takeovers that take place in a certain time span and details

3 Introduction Large amounts of text data
Created in a variety of social network, web and other information-centric applications Text data Managed via a search engine due to lack of structure Information access vs. analyzing information

4 Top 10 sites from Alexa.com
Rank Website 1 Google 6 Blogger 2 Facebook 7 Baidu 3 Youtube 8 Wikipedia 4 Yahoo 9 Twitter 5 Windows Live 10 QQ.com Social media: contains rich information of human interaction and collective behavior Draws much attention from disciplines including sociology, business, psychology, politics, computer science, economics and so on. So we need to manage and study these text data!!!

5 Question on Models, algorithms and applications
primary supervised and unsupervised models useful tools and techniques 3. key application domains

6 Models in the book Named Entity Recognition Relation Extraction
Text Analytics Distance-based Clustering Algorithms Decision Tree Classifier

7 Named Entity Recognition and Relation Extraction
Steve Jobs – Apple Inc – California Person – Organization – Location Things to solve: JFK?

8 Named Entity Recognition(continued)
Before: manually crafted patterns Now: statistical machine learning methods Hidden Markov Models Maximum Entropy Models Support Vector Machines

9 Relation Extraction Example: Automatic Content Extraction(ACE)
In 1998, Larry Page and Sergey Brin founed Google Inc. FounderOf(Larry Page, Google Inc.), FounderOf(Sergey Brin, Google Inc.), FoundedIn(Google Inc., 1998) Automatic Content Extraction(ACE) Personal/social Employment/affiliated Within sentence board vs cross sentence boundaries

10 Relation Extraction(continued)
Feature-based Classification Kernel machines Both of the method require large amount of training data

11 Tree-based Kernels T1, T2 two parse trees
n1, n2 sets of all notes in T1 and T2 respectively i donates a subtree I(n) is 1 if subtree is seen rooted at node n and 0 otherwise

12 Tree-based Kernels Assume C(n1,n2) = ∑ I(n1) I(n2)
If the grammar production of n1 and n2 are different, C(n1,n2) = 0 If same, C(n1,n2) = 1

13 Text Analytics Given a text contains three microblogging message shown below: Watching the King’s Speech I like the King’s Speech They decide to watch a movie

14 Text Analytics Text Preprocessing Text Representation
Knowledge Discovery

15 Text Preprocessing Stop word removal Stemming Watch King’s Speech
Decid watch movi

16 Text Representation Transfer the phrase into numeric vector
Bag of Words or Vector Space Model A word has a numeric weight of importance

17 Weight of importance Term Frequency / Inverse Document Frequency
tfidf(w) = tf * logN / df(w) tf : term frequence ( occurrence in docs) dw: document frequency( No. of docs containing w) N is the number of docs in the corpus

18 Bog of Words Matrix

19 Knowledge Discovery Why we need the weight matrix?
To calculate similarity for Machine Learning method! Similarity(v1,v2) = cos(θ ) = v1 * v2 / ||v1|| * ||v2||

20 Conclusion and Future Studies
how to handle textual data with short length? ?

21 Future Studies How to reduce the noise presentation of textual data?

22 Future Studies How to analyze cross media data?(text, image, links, and even multilingual data)

23 Thank You!


Download ppt "Mining the Data Charu C. Aggarwal, ChengXiang Zhai"

Similar presentations


Ads by Google