A presentation by W H Inmon BRIDGING THE GAP BETWEEN UNSTRUCTURED DATA AND STRUCTURED DATA
- unstructured data -.doc files -.txt files -.xls files - - transcripted telephone The informal systems of the corporation: .Txt.Doc - structured systems - structured data - corporate transactions - corporate reports - corporate databases -customer files - audit reports The formal systems of a corporation: Program
It is estimated that less than 20% of corporate systems are structured. 80 % .Txt.Doc 20% Program
.Txt.Doc search engines legal discovery archive taxonomy ontology document mgmt web content Program dbms business intelligence applications transactions OLTP ERP compliance imagine what would happen if the two worlds could be integrated……. the world of dbms, analytics, and other processing opens up.
.Txt.Doc search engines legal discovery archive taxonomy ontology document mgmt web content Program dbms business intelligence applications transactions OLTP ERP compliance .Txt.Doc tight integration between the two types of data.
There is a gulf between the two worlds: - technology - business practice - organizational - historical .Txt.Doc Program
Think of the possibilities! .Txt.Doc Program
Imagine this - Reports and visualization show a lot. have you ever wondered why you can’t hook up your Business Objects to ? or telephone conversations?
.Txt.Doc text numbers There is a fundamental disconnect between unstructured data and business intelligence. So what would happen if we had powerful visualization for text? Business Intelligence
liver cancer skin cancer thirst diabetes blood pressure correlative information becomes very easy to spot
for the general population for women for women who smoke over the age to 50 doing analysis on sub populations of women
for the general population for women who smoke over the age to 50 the contrast between the different correlations of different populations leads to great insight
service delivery late broken installation salesman attitude wait too long did not fit what about looking at customer feedback – complaints? now you can see the broader picture of what is happening
but there are plenty of other places where the technology applies – - manufacturing warranties – (what patterns of defects are there?) - Weblogs (marketing – who is saying what?) - customer complaints – (what are the problem products?) - general – (What’s the buzz? what is on people’s minds?) - insurance claims (what are the circumstances of accidents?)
.Txt.Doc another possibility is the monitoring of and the transport of to the structured environment
Monitoring s and other corporate conversations - .Txt.Doc Sarbanes Oxley HIPAA BASEL II compliance – making sure that is being used properly - compliance - corporate standard for language
Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….” Jan 5 – finance to vp “It looks like we are going to do $9,000,000 this quarter…” Jan 5 – president to analyst “This quarter looks like we are going to break new records…” Feb 1 – employee to employee “Did you see the stock market? Everything is going down…” Feb 3 – president to vp “What is happening to sales in the midwest? We didn’t expect this…” Feb 4 – sales manager to vp Feb 3 – vp to vp “The sales cycle looks like it is extending. The economy is tanking…” “It looks like we are going to be a little short this quarter…” Feb 6 – president to vp “What are we going to do to get sales up? Do we need to do some discounting?” Mar 2 – sales person to vp “Demand has dried up. We aren’t going to close as many sales this quarter as we thought…” A bunch of s and conversations: What do you do with them?
Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….” Jan 5 – finance to vp “It looks like we are going to do $9,000,000 this quarter…” Jan 5 – president to analyst “This quarter looks like we are going to break new records…” Feb 1 – employee to employee “Did you see the stock market? Everything is going down…” Feb 3 – president to vp “What is happening to sales in the midwest? We didn’t expect this…” Feb 4 – sales manager to vp Feb 3 – vp to vp “The sales cycle looks like it is extending. The economy is tanking…” “It looks like we are going to be a little short this quarter…” Feb 6 – president to vp “What are we going to do to get sales up? Do we need to do some discounting?” Mar 2 – sales person to vp “Demand has dried up. We aren’t going to close as many sales this quarter as we thought…” Examining s (“combing” them) for important corporate information: Sarbanes Oxley quarter stock sales discount demand sales cycle external categories
sales – Feb 2 – Mar 5 phone – Mar 8 ……………… quarter – Jan 2 – Jan 4 – Feb 5 ……………… discount phone conversation – Jan 6 – Jan 12 – Jan 14 ………………………….. sales cycle – Feb 24 phone conversation – Mar 14 meeting notes – Mar 18 ……………………………. Structured Environment The “combed” information is brought over to the structured environment. Now you can use standard tools, such as Cognos, Business Objects, Crystal Reports, MicroStrategy to do analysis.
customer data probabilistic match s and telephone conversations can be linked to CDI/CRM data. But there are other ways that communications can be used
A true 360 degree view of the customer can be formed. “I placed an order last week and when it arrived it was the wrong size. And then your company would not take it back. I’m mad.” how easy is it going to be to engage Mrs Jones until she has satisfaction about her order
A true 360 degree view of the customer can be formed. communications demographics delivering on the promise of CDI
.Txt.Doc Program can’t I just use a search engine to link the two worlds? integration search engines do not integrate textual information
.Txt.Doc Program integration text doesn’t need to be searched, it needs to be integrated
.Txt.Doc Program integration “ha” “head ache” “heart attack” “Hepatitis A”
.Txt.Doc Program integration “oblique fractured ulna” “oblique fractured tibia” “obliq fractured tarsi” “broken bone”
.Txt.Doc Program 1 – stop word editing 2 – stemming 3 – synonym replacement 4 – synonym concatenation 5 – homograph resolution 6 – alternate spelling resolution 7 – external category classification 8 – theming 9 – probabilistic matching 10 – negation exclusion 11 – concept clustering 12 – mid process editing 13 – change sensitivity What is meant by editing, integrating text? integration
.Txt.Doc Program For a detailed description of how the unstructured environment should be linked to the structured environment, go to - and look for DW 2.0 TM or go to -
Unstructured Data Structured Environment Query Business Objects, Cognos, MicroStrategy, Crystal Reports DB2 probabilistic match visualization