Download presentation
Presentation is loading. Please wait.
Published byLilian Dennis Modified over 9 years ago
1
Data Mining Status and Risks Dr. Gregory Newby UNC-Chapel Hill http://ils.unc.edu/gbnewby
2
Overview What is data mining and related concepts? What is data mining and related concepts? Fundamentals of the science and practice of data mining Fundamentals of the science and practice of data mining What data sources are available? What data sources are available? Causality and correlation Causality and correlation Risks of data mining Risks of data mining Future moves Future moves
3
Data Mining “An information extraction activity whose goal is to discover hidden facts contained in databases. …[D]ata mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.” (Via http://www.twocrows.com/glossary.htm) “An information extraction activity whose goal is to discover hidden facts contained in databases. …[D]ata mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.” (Via http://www.twocrows.com/glossary.htm)http://www.twocrows.com/glossary.htm
4
Data Mining Is: Seeking new information from relations among data, possibly from different sources Is: Seeking new information from relations among data, possibly from different sources Is: An important area of academic, corporate and government research Is: An important area of academic, corporate and government research Is: Important from a security standpoint, because data mining might yield emergent information that would otherwise remain unknown Is: Important from a security standpoint, because data mining might yield emergent information that would otherwise remain unknown
5
The Bigger Picture Information retrievalData mining Data fusion
6
The Data Universe All data All data All topics All topics All sources All sources Numeric, textual Numeric, textual Discrete, longitudinal Discrete, longitudinal Lots and lots of data! Lots and lots of data! The data universe is growing constantly, and many new data sources are being created as a result of security concerns & technological progress The data universe is growing constantly, and many new data sources are being created as a result of security concerns & technological progress
7
Challenges of the Data Universe Scale: too much data to deal with Scale: too much data to deal with Format: many different formats which are difficult to merge or query Format: many different formats which are difficult to merge or query Access: most data (over 90%?) are not Web-accessible Access: most data (over 90%?) are not Web-accessible Databases Databases Proprietary or internal data Proprietary or internal data Formatting problems or issues Formatting problems or issues
8
Solutions Figure out how to get data from one format to another. Standards such as XML and EDI help Figure out how to get data from one format to another. Standards such as XML and EDI help Develop cooperative relationships among data holders for data exchange. This is happening much more in government Develop cooperative relationships among data holders for data exchange. This is happening much more in government Develop tools to identify relationships among data. This is the focus of data mining Develop tools to identify relationships among data. This is the focus of data mining
9
Data Mining != Web Searching On the Web, we’re doing high precision information retrieval On the Web, we’re doing high precision information retrieval We want the first ranked documents to be relevant We want the first ranked documents to be relevant We don’t want to see irrelevant documents We don’t want to see irrelevant documents The data universe for Web search engines is vast, making this a relatively straightforward problem (though a big engineering challenge!) The data universe for Web search engines is vast, making this a relatively straightforward problem (though a big engineering challenge!)
10
Data Mining != Web Searching Data mining is all about recall, not precision Data mining is all about recall, not precision Recall means we find all the relevant documents, regardless of how many irrelevant documents Recall means we find all the relevant documents, regardless of how many irrelevant documents This is a tougher problem, since the set of responses to a given inquiry can be huge This is a tougher problem, since the set of responses to a given inquiry can be huge It’s tougher : data formats, data merging, access, etc. It’s tougher : data formats, data merging, access, etc. The data miner’s goal is to set a threshold over which relationships are “interesting” The data miner’s goal is to set a threshold over which relationships are “interesting” Data miners can also search for particular patterns, i.e. related to an individual or group Data miners can also search for particular patterns, i.e. related to an individual or group
11
Today Law enforcement, industry and government are making their data sources more open to each other (these data sources are not generally publicly available) Law enforcement, industry and government are making their data sources more open to each other (these data sources are not generally publicly available) Data integrity issues are a major concern Data integrity issues are a major concern Data mining is still tough. “False positive” relationships are easy to spot Data mining is still tough. “False positive” relationships are easy to spot Correlation vs. causality Correlation vs. causality Seek and ye shall find Seek and ye shall find Lots of data yields lots of matches Lots of data yields lots of matches
12
Today’s Data Sources Credit and other financials Credit and other financials Law enforcement records Law enforcement records Travel history Travel history Health data Health data Whatever you put on the Internet If you are targeted: Whatever you put on the Internet If you are targeted: Wiretap data (‘net, phone, etc.) Wiretap data (‘net, phone, etc.) Surveillance data Surveillance data HUMINT, etc., etc. HUMINT, etc., etc.
13
Tomorrow Decreased barriers among different data sources (this is a main impact of PATRIOT, but more is coming) Decreased barriers among different data sources (this is a main impact of PATRIOT, but more is coming) Increased data collection (via PATRIOT plus technological trends) Increased data collection (via PATRIOT plus technological trends) Better tools for data mining, and new technologies making data sharing and integration easier Better tools for data mining, and new technologies making data sharing and integration easier
14
Contact Info Greg Newby is moving from UNC to UAF Greg Newby is moving from UNC to UAF New position: New position: Research Faculty at the Arctic Region Supercomputing Center University of Alaska, Fairbanks Research Faculty at the Arctic Region Supercomputing Center University of Alaska, Fairbanks newby@arsc.edu newby@arsc.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.