Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL"— Presentation transcript:

1 INFORMATION RETRIEVAL
DATA WAREHOUSING & INFORMATION RETRIEVAL Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University POBox Dallas, Texas The contents of this presentation draw extensively from slides for: Data Mining, Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

2 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DW&IR Outline Introduction Data Warehousing Research Summary 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

3 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DW&IR Outline Introduction Data Warehousing Overview Information Retrieval Data Warehousing Research Summary 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

4 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Data Warehousing “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

5 Data Warehouse Variations
Data Mart – Subset of complete data warehouse Virtual Warehouse – Warehouse implemented as a view of operational data 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

6 Operational vs. Informational
Operational Data Data Warehouse Application OLTP OLAP Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Business Data Operational Values Integrated Size Gigabits Terabits Level Detailed Summarized Access Often Less Often Response Few Seconds Minutes Data Schema Relational Star/Snowflake 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

7 Information Retrieval
Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining” IR being applied to other unformatted data 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

8 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

9 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DB vs IR (cont’d) Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

10 Information Retrieval (cont’d)
Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant| 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

11 IR Query Result Measures and Classification
4/17/07, Tecnológico de Monterrey, SMU CSE 8337

12 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DW&IR Outline Introduction Data Warehousing Dimensional Modeling OLAP Decision Support Systems Research Summary 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

13 Data Transformation for Data Warehouse
ETL – Extract, Transform, Load Unwanted data must be removed Convert heterogeneous sources into one common schema As the operational data is probably a snapshot of the data, multiple snapshots may need to be merged to create the historical view Summarize data New derived data Handle missing and erroneous data 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

14 Data Warehouse Creation
Fig 1 [1] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

15 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Dimensional Modeling View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

16 Multidimensional Model Example
Fig 2 [1] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

17 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Cube view of Data Fig 4 [1] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

18 Aggregation Hierarchies
4/17/07, Tecnológico de Monterrey, SMU CSE 8337

19 Multidimensional Schemas
Star Schema shows facts and dimensions Center of the star has facts shown in fact tables Outside of the facts, each diemnsion is shown separately in dimension tables Access to fact table from dimension table via join SELECT Quantity, Price FROM Facts, Location Where (Facts.LocationID = Location.LocationID) and (Location.City = ‘Dallas’) View as relations, problem volume of data and indexing 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

20 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Star Schema 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

21 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Flattened Star 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

22 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Normalized Star 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

23 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Snowflake Schema 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

24 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Support ad hoc querying Require analysis of data Can be thought of as an extension of some of the basic aggregation functions available in SQL OLAP tools may be used in DSS systems Mutlidimentional view is fundamental 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

25 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
OLAP Implementations MOLAP (Multidimensional OLAP) Multidimential Database (MDD) Specialized DBMS and software system capable of supporting the multidimensional data directly Data stored as an n-dimensional array (cube) Indexes used to speed up processing ROLAP (Relational OLAP) Data stored in a relational database ROLAP server (middleware) creates the multidimensional view for the user Less Complex; Less efficient HOLAP (Hybrid OLAP) Not updated frequently – MDD Updated frequently - RDB 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

26 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

27 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
OLAP Operations Simple query – single cell in the cube Slice – Look at a subcube to get more specific information Dice – Rotate cube to look at another dimension Roll Up – Dimension Reduction; Aggregation Drill Down Visualization: These operations allow the OLAP users to actually “see” results of an operation. 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

28 Relationship Between Topcs
4/17/07, Tecnológico de Monterrey, SMU CSE 8337

29 Decision Support Systems
Tools and computer systems that assist management in decision making What if types of questions High level decisions Data warehouse – data which supports DSS 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

30 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Data Warehouse Links OLAP General Data Warehousing DW Products Interesting Articles “Teaching Effective Methodologies to Design a Data Warehouse,” by Behrooz Seyed-Abbassi An Oracle DBA’s Guide to the OLAP Option,” by by Mark Rittman 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

31 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DW&IR Outline Introduction Data Warehousing Research Bibliomining Hierarchical Multimedia IR Ontology-based OLAP & IR Summary 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

32 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Bibliomining [2,3] Data Warehousing + Data Mining + Libraries Abstract, cleanse, summarize library data Documents Users (including demographics) Circulation Records (including Web server records) Privacy of utmost importance [2] [3] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

33 Hierarchical Multimedia IR [4]
DW Approach to Multimedia IR Allows easier integration of multiple data types Facilitates indexing Facilitates searching Allows data to be stored at many different granularities and dimensions Data aggregation “data warehouses are not just large databases; they are large, complex environments that integrate many technologies” [p729] Multimedia starflake schema Denormalized star dimension table Normalized snowflake tables 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

34 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Starflake Fig 2 [4] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

35 Hierarchy of Data Cubes
Fig 4 [4] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

36 Ontology-Based OLAP & IR [5]
Combine structured and document data obtained from Web Global Ontology Includes OLAP dimensions Contains resource metadata RDF based IR based on Both queries and resources represented as RDF metadata 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

37 Ontology OLAP&IR Architecture
Fig 1 [5] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

38 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
OLAP Dimensions in RDF Fig 2 [5] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

39 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
RDF Query Fig 6 [5] 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

40 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
DW&IR Outline Introduction Data Warehousing Research Summary 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

41 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Summary Information Retrieval is being extended to many different data types Multimedia Data warehouse Data Warehousing is being extended beyond the basic business domain Little research in combining DW and IR Integrating Unstructured Text into the Structured Environment: The Value Proposition“, by Bill Inmon 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

42 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Bibliography [1] Anne-Muriel Arigon, Anne Tchounikine, and Maryvonne Miquel, “Handling Multiple Points of View in a Multimedia Data Warehouse,” ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, August 2006, Pages 199–218. [2] S. Nicholson, “The Bibliomining Process: Data Warehousing and Data Mining for Library Decision-Making,” Information Technology and Libraries, 22(4), 2003. [3] S. Nicholson, “The Basis for Biliomining: Frameworks for Bringing Together Usage-Based Data Mining and Bibliometrics through Data Warehousing in Digital Library Services,” Information Processing & Management, 42(3), May 2006, pp [4] Jane You, Tharam Dillon, James Liu, Edwige Pissaloux, “On Hierarchical Multimedia Information Retrieval,” You, J.; Proceedings of the 2001 International Conference on Image Processing, 7-10 Oct 2001, pp 729 – 732. [5] Torsten Priebe and Gunther Pernul, “Ontology-based Integration of OLAP and Information Retrieval,” Proceedings of the 14th International Workshop on Database and expert Systems Applications, 2003. 4/17/07, Tecnológico de Monterrey, SMU CSE 8337

43 4/17/07, Tecnológico de Monterrey, SMU CSE 8337
Thank You 4/17/07, Tecnológico de Monterrey, SMU CSE 8337


Download ppt "INFORMATION RETRIEVAL"

Similar presentations


Ads by Google