Multidimensional analysis model for a document warehouse that includes textual measures KIM JEONG RAE UOS.DML
Introduction Author Martha Mendoza, Erwin Alegria, Manuel Maca, Carlos Cobos, Elizabeth Leon Location Information Technology Research Group(GTI), etc. Colombia Title Multidimensional analysis model for a document warehouse that includes textual measures Document Type Decision Support Systems 72(2015) Date February
Contents Abstract Analysis Model Proposed document warehouse model Multi-dimensional model Textual measures and aggregation function OLAP document visualization Conclusion Evaluation results 3
Abstract (1/2) 4 Motivation Business systems are increasingly required to handle substantial quantities of unstructured textual information. Problem To manage unstructured text data stored in data warehouses Approach The new multi-dimensional analysis model is proposed that includes textual measures as well as a topic hierarchy. The textual measures that associate the topics with the text documents are generated by Probabilistic Latent Semantic Analysis, while the hierarchy is created automatically using a clustering algorithm.
Abstract (2/2) 5 Result The model gained an increasing acceptance with use, while the visualization of the model was also well received by users. Contribution This paper proposes a multidimensional model that incorporates textual. The model allows documents to be queried using OLAP operations.
Proposed document warehouse model 6 Four main Processes ② ① ③ ④
Proposed document warehouse model 7 Topic Hierarchy Building ① Two algorithms process Cosme(step1) Modified IGBHSK (Iterative Global-Best Harmony Search K-means algorithm)
8 Topic Hierarchy Building ① Modified IGBHSK (Iterative Global-Best Harmony Search K-means algorithm) : Three levels Proposed document warehouse model
9 Topic Hierarchy Building ① IGBHSK algorithm[Ref.#2] for Topic hierarchy Proposed document warehouse model
10 Probabilistic measures calculation ②
11 Proposed document warehouse model
12 ETL(Extract-Transform-Load) ③
Multi-dimensional model 13 Relational DB Schema
Multi-dimensional model 14 Standard dimensions Document dimension : name, document type Author dimension : name, Date dimension : publish date Location dimension : city, country Word dimension : all words from the stored document set Topic dimension : Topic hierarchy M-M relationships Author-Group Bridge, Topic-Document-Group Bridge, Topic-Word-Group Bridge Measures of the fact table and the topic and word dimension bridge tables Topics_Probab_TM : A average Probability of Topics Documents_TM : Probabilities of a Document within topics Word_Probab_TM : Probabilities of a word within topics
Proposed document warehouse model 15 Multidimensional cube building ④
Textual measures and aggregation function 16
Textual measures and aggregation function 17
Textual measures and aggregation function 18
OLAP document visualization 19 Topics_Probab_TM : Document dimension - Type of Document
OLAP document visualization 20 Topics_Probab_TM : Date Dimension - year
OLAP document visualization 21 Topics_Probab_TM : Document type(rows) and year attribute(columns)
OLAP document visualization 22 Topics_Probab_TM : Attribute of year and Document type Slice – “Journal Article”
OLAP document visualization 23 Topics_Probab_TM : Attribute of year and Document type and author name Dice operation
OLAP document visualization 24 Document_TM : each Topic and Document
OLAP document visualization 25 Document_TM : each Topic and year and Document
Conclusion - Evaluation results 26 Execution time results
Conclusion - Evaluation results 27 Execution time results
Conclusion - Evaluation results 28 User satisfaction results Statistical frequency analysis
Conclusion - Evaluation results 29 User satisfaction results Multivariate analysis
Thank you 30
Proposed document warehouse model 31 Results Cosme : XML file(Metadata)
Proposed document warehouse model 32 Result IGBHSK