Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999.

Slides:



Advertisements
Similar presentations
Welcome to the CardSaver VoIP Billing & Call Management Demonstration © 2004, Parwan Electronics Corporation.
Advertisements

Providing information for better business management Since 1979.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Chapter 9 Business Intelligence Systems
Managing Data Resources
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Chapter 12 - Forecasting Forecasting is important in the business decision-making process in which a current choice or decision has future implications:
Week 9 Data Mining System (Knowledge Data Discovery)
Did You Know? Number of spam s sent each day? 100 billion.
Presented By: Katie, Jake, Janet, Marcellous, and Junaid.
Data Mining By Archana Ketkar.
Part II – TIME SERIES ANALYSIS C2 Simple Time Series Methods & Moving Averages © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Database Processing for Business Intelligence Systems
Total Quality Management BUS 3 – 142 Statistics for Variables Week of Mar 14, 2011.
Data Mining – Intro.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Data Mining: A Closer Look
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
An introduction May Offermans, Martijn Tennekes, Alex Priem, Shirley Ortega en Nico Heerschap Using Mobile Phone Meta Data For National Statistics.
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal.
WHAT IS A COMPUTER? Computer is an electronic device designed to manipulate data so that useful information can be generated. Computer is multifunctional.
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
Chapter 9 Business Intelligence and Information Systems for Decision Making.
Auto Technologies Inc. Auto Technologies Call Cap Telemanagement System “ATTS”
Copyright © 2008 by Nelson, a division of Thomson Canada Limited SECONDARY DATA RESEARCH IN A DIGITAL AGE Chapter 6 Part 2 Designing Research Studies.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
BASIC CONCEPTS OF COMPUTING.  What is a computer? What is a computer?  An expanded model of a computer An expanded model of a computer  The role of.
Succeeding with Technology Database Systems Basic Data Management Concepts Organizing Data in a Database Database Management Systems Using Database Systems.
Data Mining Manufacturing Data Dave E. Stevens Eastman Chemical Company Kingsport, TN.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Time Series Forecasting Chapter 13.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
Business Intelligence Systems Appendix J DAVID M. KROENKE and DAVID J. AUER DATABASE CONCEPTS, 6 th Edition.
Time Series Analysis and Forecasting
“A.T.T.S.” Never Miss A Call! Automated Tracking Telemanagement System.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Simon Power Managing Consultant John Rae Director Understanding Communities Through PayCheck
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Classification Ensemble Methods 1
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
Exponential Smoothing 1 Ardavan Asef-Vaziri 6/4/2009 Forecasting-2 Chapter 7 Demand Forecasting in a Supply Chain Forecasting -2 Exponential Smoothing.
DATA MINING It is a process of extracting interesting(non trivial, implicit, previously, unknown and useful ) information from any data repository. The.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Ethernet to the Cell Tracking Tool The prototype that grew, and grew and grew!
 Using Touchloggers To Build User Profiles Through Machine Learning Craig Dezangle.
Data Mining.
Welcome to the CardSaver VoIP Billing & Call Management Demonstration
Data Mining – Intro.
Information Systems in Organizations
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining: Concepts and Techniques
Data Warehousing Data Mining Privacy
Presentation transcript:

Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999

Case Study n Statistical modeling n Processing of multi-GB databases n Data warehousing n Prediction and classification n User interfaces

Three Goals n Daily perform meaningful mining on multi- GB of data n Classify telephone numbers as business or residential (pattern deviation, etc.) n Maintain operational data for each phone number.

Quantity of data n 1997: 275 million phone calls per week day -- total of 76 billion for whole year n 65M unique TNs per weekday n 350M unique TNs over a 40-day period n “Universe list”: Set of all TNs observed on network, each with a 7-byte profile

Contents of each profile n Inactivity -- number of days since TN used n Minutes of use -- average daily minutes TN is observed on network n Frequency -- estimated number of days between observing a TN n “Bizocity” -- Business-like behavior of TN n Stored for inbound/outbound, toll/toll-free

Calculation of each variable n Inactivity: Set to 0 if observed, and (Inactivity++) if not observed. n Other variables are calculated via an exponential weighted average: n X(TN) new = λX(TN) today + (1-λ)X(TN) old, 0 < λ < 1

Aging factor λ n Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time. n Most recent day’s activity is weighted higher than 2 weeks ago. n Weight of a call k days ago is w k = (1-λ) k λ n Old data is “aged out” as new data is “blended in”

“Bizocity” n Concerns over whether a TN is residential or business. n Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.

“Bizocity” continued n AT&T has confirmed residential/business status for 30% of 350M TNs. n Incomplete data is due to lack of communication with local companies, additional lines, out of date information. n Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.

Generating “Bizocity” n When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers). n Those with known biz/res status are flagged, and training sets are generated. n Noise and outliers are usually eliminated by the volume of data.

Generating “Bizocity” -- examples n Example: Long calls originating at night are usually residential, not business. n Example: Residential calls peak in eve., business calls peak between 9am-5pm n Example: Business calls are generally shorter, call other businesses, or call 800 services.

Processed every 24 hours n Provides better aggregate data for each TN n Reduces I/O by 75% n Have to store all call details and sort them. n Each call is reduced to a 32-byte binary record, resulting in 8GB daily. n Sorting takes 30 min. (3GB RAM, 1 processor)

Processing -- continued n 4d data cube is generated n Dimensions are day-of-week, time-of-day, duration, and biz/res/800 status (7x6x5x3) n Have previously developed logistic regression models for scoring TNs based on each profile (to estimate “Bizocity”) n Biz(TN) new = λBiz(TN) today + (1-λ)Biz(TN) old 0 < λ < 1

Processing -- continued n Training set is used to classify TNs with unknown status based on probabilities n Inactive TNs are not updated n “Bizocity” scores for unknown TNs are generated using probabilities

Accuracy n Accuracy of prediction of status is 75% n Failures due to incorrectly provided status of shifting status (ex. home businesses, cell phones, etc.)

Data Structures n Exploit the “exchange” concept (1st 6 digits form an exchange) n Only about 150,000 of 1M exchanges are in use n All 10,000 TNs for each exchange are stored sequentially, whether used or not n Each data structure is 2GB for each variable (lower bound is 1.5GB)

Interface n Variety of visualization tools (start at top, drill-down) n Web interface with password protection n Images are computed on the fly n C-code directly computes images in gif format

Toll Fraud Detection n Same methodology, but event-driven n Only have to track about 15M TNs. n Profiles are about 512 bytes each (7.5GB)