Download presentation
Presentation is loading. Please wait.
Published byMervin Gallagher Modified over 6 years ago
1
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by Mick Gu
2
Why financial professional needs data mining
We need specific information for day-to-day decision making! For example, a financial company may need to know all the companies takeovers that take place in a certain time span and details
3
Introduction Large amounts of text data
Created in a variety of social network, web and other information-centric applications Text data Managed via a search engine due to lack of structure Information access vs. analyzing information
4
Top 10 sites from Alexa.com
Rank Website 1 Google 6 Blogger 2 Facebook 7 Baidu 3 Youtube 8 Wikipedia 4 Yahoo 9 Twitter 5 Windows Live 10 QQ.com Social media: contains rich information of human interaction and collective behavior Draws much attention from disciplines including sociology, business, psychology, politics, computer science, economics and so on. So we need to manage and study these text data!!!
5
Question on Models, algorithms and applications
primary supervised and unsupervised models useful tools and techniques 3. key application domains
6
Models in the book Named Entity Recognition Relation Extraction
Text Analytics Distance-based Clustering Algorithms Decision Tree Classifier
7
Named Entity Recognition and Relation Extraction
Steve Jobs – Apple Inc – California Person – Organization – Location Things to solve: JFK?
8
Named Entity Recognition(continued)
Before: manually crafted patterns Now: statistical machine learning methods Hidden Markov Models Maximum Entropy Models Support Vector Machines
9
Relation Extraction Example: Automatic Content Extraction(ACE)
In 1998, Larry Page and Sergey Brin founed Google Inc. FounderOf(Larry Page, Google Inc.), FounderOf(Sergey Brin, Google Inc.), FoundedIn(Google Inc., 1998) Automatic Content Extraction(ACE) Personal/social Employment/affiliated Within sentence board vs cross sentence boundaries
10
Relation Extraction(continued)
Feature-based Classification Kernel machines Both of the method require large amount of training data
11
Tree-based Kernels T1, T2 two parse trees
n1, n2 sets of all notes in T1 and T2 respectively i donates a subtree I(n) is 1 if subtree is seen rooted at node n and 0 otherwise
12
Tree-based Kernels Assume C(n1,n2) = ∑ I(n1) I(n2)
If the grammar production of n1 and n2 are different, C(n1,n2) = 0 If same, C(n1,n2) = 1
13
Text Analytics Given a text contains three microblogging message shown below: Watching the King’s Speech I like the King’s Speech They decide to watch a movie
14
Text Analytics Text Preprocessing Text Representation
Knowledge Discovery
15
Text Preprocessing Stop word removal Stemming Watch King’s Speech
Decid watch movi
16
Text Representation Transfer the phrase into numeric vector
Bag of Words or Vector Space Model A word has a numeric weight of importance
17
Weight of importance Term Frequency / Inverse Document Frequency
tfidf(w) = tf * logN / df(w) tf : term frequence ( occurrence in docs) dw: document frequency( No. of docs containing w) N is the number of docs in the corpus
18
Bog of Words Matrix
19
Knowledge Discovery Why we need the weight matrix?
To calculate similarity for Machine Learning method! Similarity(v1,v2) = cos(θ ) = v1 * v2 / ||v1|| * ||v2||
20
Conclusion and Future Studies
how to handle textual data with short length? ?
21
Future Studies How to reduce the noise presentation of textual data?
22
Future Studies How to analyze cross media data?(text, image, links, and even multilingual data)
23
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.