Giga-Mining Corinna Cortes and Daryl Pregibon AT&T Labs-Research Presented by: Kevin R. Gee 28 October 1999
Case Study n Statistical modeling n Processing of multi-GB databases n Data warehousing n Prediction and classification n User interfaces
Three Goals n Daily perform meaningful mining on multi- GB of data n Classify telephone numbers as business or residential (pattern deviation, etc.) n Maintain operational data for each phone number.
Quantity of data n 1997: 275 million phone calls per week day -- total of 76 billion for whole year n 65M unique TNs per weekday n 350M unique TNs over a 40-day period n “Universe list”: Set of all TNs observed on network, each with a 7-byte profile
Contents of each profile n Inactivity -- number of days since TN used n Minutes of use -- average daily minutes TN is observed on network n Frequency -- estimated number of days between observing a TN n “Bizocity” -- Business-like behavior of TN n Stored for inbound/outbound, toll/toll-free
Calculation of each variable n Inactivity: Set to 0 if observed, and (Inactivity++) if not observed. n Other variables are calculated via an exponential weighted average: n X(TN) new = λX(TN) today + (1-λ)X(TN) old, 0 < λ < 1
Aging factor λ n Provides for estimate as a weighted sum of all previous daily values, where weights decrease smoothly over time. n Most recent day’s activity is weighted higher than 2 weeks ago. n Weight of a call k days ago is w k = (1-λ) k λ n Old data is “aged out” as new data is “blended in”
“Bizocity” n Concerns over whether a TN is residential or business. n Different operations for residences and businesses for customer care, billing, collections, fraud detection, etc.
“Bizocity” continued n AT&T has confirmed residential/business status for 30% of 350M TNs. n Incomplete data is due to lack of communication with local companies, additional lines, out of date information. n Behavioral estimate is generated by observing behavior of all 350M TNs, generating a bizocity score, and combining it with previous days’ totals.
Generating “Bizocity” n When a call completes, data such as originating TN, dialed TN, connect time, and call duration (note that callers are not identified, just phone numbers). n Those with known biz/res status are flagged, and training sets are generated. n Noise and outliers are usually eliminated by the volume of data.
Generating “Bizocity” -- examples n Example: Long calls originating at night are usually residential, not business. n Example: Residential calls peak in eve., business calls peak between 9am-5pm n Example: Business calls are generally shorter, call other businesses, or call 800 services.
Processed every 24 hours n Provides better aggregate data for each TN n Reduces I/O by 75% n Have to store all call details and sort them. n Each call is reduced to a 32-byte binary record, resulting in 8GB daily. n Sorting takes 30 min. (3GB RAM, 1 processor)
Processing -- continued n 4d data cube is generated n Dimensions are day-of-week, time-of-day, duration, and biz/res/800 status (7x6x5x3) n Have previously developed logistic regression models for scoring TNs based on each profile (to estimate “Bizocity”) n Biz(TN) new = λBiz(TN) today + (1-λ)Biz(TN) old 0 < λ < 1
Processing -- continued n Training set is used to classify TNs with unknown status based on probabilities n Inactive TNs are not updated n “Bizocity” scores for unknown TNs are generated using probabilities
Accuracy n Accuracy of prediction of status is 75% n Failures due to incorrectly provided status of shifting status (ex. home businesses, cell phones, etc.)
Data Structures n Exploit the “exchange” concept (1st 6 digits form an exchange) n Only about 150,000 of 1M exchanges are in use n All 10,000 TNs for each exchange are stored sequentially, whether used or not n Each data structure is 2GB for each variable (lower bound is 1.5GB)
Interface n Variety of visualization tools (start at top, drill-down) n Web interface with password protection n Images are computed on the fly n C-code directly computes images in gif format
Toll Fraud Detection n Same methodology, but event-driven n Only have to track about 15M TNs. n Profiles are about 512 bytes each (7.5GB)