Projects.

Slides:



Advertisements
Similar presentations
BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
Advertisements

Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
File Systems.
Presented by Vigneshwar Raghuram
Information Retrieval in Practice
XML Compression Aslam Tajwala Kalyan Chakravorty.
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Hadoop File Formats and Data Ingestion
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Hadoop File Formats and Data Ingestion
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
September 2011Copyright 2011 Teradata Corporation1 Teradata Columnar.
Database Management 9. course. Execution of queries.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Bigtable: A Distributed Storage System for Structured Data
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Information Retrieval in Practice
Storage and File Organization
Why indexing? For efficient searching of a document
Compression and Storage Optimization IDS xC4 Kevin Cherkauer
Module 11: File Structure
Optimizing Parallel Algorithms for All Pairs Similarity Search
File Format Benchmark - Avro, JSON, ORC, & Parquet
3.3 Fundamentals of data representation
Indexing & querying text
Physical Changes That Don’t Change the Logical Design
Lecture 16: Data Storage Wednesday, November 6, 2006.
CSE-291 (Cloud Computing) Fall 2016
Projects on Extended Apache Spark
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Databases.
Introduction to Query Optimization
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
Join Processing in Database Systems with Large Main Memories (part 2)
MapReduce Simplied Data Processing on Large Clusters
Cse 344 May 4th – Map/Reduce.
Physical Database Design
SETL: Efficient Spark ETL on Hadoop
Selected Topics: External Sorting, Join Algorithms, …
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Chapter 13: Data Storage Structures
Four Rules For Columnstore Query Performance
Spreadsheets, Modelling & Databases
Chapter 11 Database Performance Tuning and Query Optimization
Information Retrieval and Web Design
MapReduce: Simplified Data Processing on Large Clusters
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Map Reduce, Types, Formats and Features
Presentation transcript:

Projects

Text Mining Build a library for text mining using extended Spark SQL Implement ReCital (https://github.com/madgik/recital) in Spark SQL Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., & Ioannidis, Y. (2017, September). High-Pass Text Filtering for Citation Matching. InInternational Conference on Theory and Practice of Digital Libraries(pp. 355-366). Springer, Cham.

Text mining Library

Text Mining Library Textwindow Regular expression functions Returns a rolling window in the text. Each window contains 3 variable length parts. The prev part, the middle part, and the next part. The middle part of window may be filtered using a pattern. Regular expression functions Functions to support regular expressions matches Text similarity functions Cosine distance, jaccard distance etc. JSON tools Functions that are able to process JSON lists (eg. sort, set operations, merge, split, compare)

Text Mining Library Normalization functions Keywords functions Stopword removal Tokenization Stemmer Space removal Punctuation removal Keywords functions Select representative terms or ngrams from texts or from corpuses according to their frequency

Text mining library Using this library create a simple algorithm in Spark SQL, to extract FP7 project identifiers from documents

ReCital implementation

ReCital implementation in Spark SQL Relational citation matching algorithm in three steps Text mine publication fulltexts to extract the references section Use high-pass text filtering to extract references sections Match the extracted references sections to a database which contains publication metadata (titles, authors, dates, publishers etc.) Build an inverted index for titles to accelerate the matching Disambiguate the results Calculate a confidence value for each title that matches in order to filter out false positives.

References Section Extraction

References Section Extraction High-Pass Text Filtering

returned lines (density > avg_density) text signal density (window size = 5) returned lines (density > avg_density) ___________________ ______________2004_ ___http://google.com__ _______2009________ _______________2010 ____2006___________ ___________2003____ ________2002_______ ___________1999____ ________2012_______ 1 0.2 0.4 0.6 0.8

Inverted Index on Trigrams Title Matching Inverted Index on Trigrams

Title Matching Split each title in the metadata database on its trigrams Match each trigram from the extracted references section to the trigrams from the metadata database Obviously, this would be too slow. Create an inverted index based on identifying trigrams Identifying trigrams ideally exist in just one title. Approximate the ideal inverted index with an iterative method.

Characteristic Inverted Index Titles Inverted Index First Iteration Second Iteration trigram Title id A 3,4 B C F 4,5 trigram Title id A 1,3,4 B 1,2,3,4 C 2,3,4 D 1 E 2,3 F 2,4,5 G 3 H 2 trigram Title id F 4,5 G 3 trigram Title id D 1 F 4 5 G 3 H 2 Title id Title 1 ABD 2 BCFEH 3 EABCG 4 ABCF 5 F trigram Title id D 1 H 2 G 3

Disambiguation of the results Use text near the matched title Create bag of words that consists of all the available publication metadata Match this bag of words to the text near the title Each word that matches counts differently according to its weight and its distance from the title The length of the matched title also counts The used formula can be found in the paper

Representation formats and Compression

Compression Schemes Gzip Heavy compression based on LZ77 and Huffman coding Bzip2 Heavy compression based on the Burrows Wheeler Transform (BWT) Snappy Lightweight compression targeting at high speed Zstandard (ZSTD) Targets at real-time compression scenarios with high compression rates Other compression schemes LZ4, LZF, LZO, RLE

Simple Data representation formats TEXT CSV, JSON Sequence file Row based flat file consisting of binary key/value pairs AVRO Binary data serialization format

Sophisticated Data representation formats (1) PAX format Hybrid format (Balance between row and column store) Designed for optimizing cache performance Column store, but all different columns are stored in the same disk page

Sophisticated Data representation formats (2) RCFile Record columnar file which organizes records in row groups Compression with ZLIB, Snappy, Bzip2 (Default uncompressed)

Sophisticated Data representation formats (3) ORCFile (Optimized RCFile) Light weight indexes (min/max values, count and other stats) stored within the file Local dictionary encoding 250MB Stripe size but it's parameterizable Compression algorithms: Snappy, ZLIB

Sophisticated Data representation formats (4) Parquet

Sophisticated Data representation formats (5) Parquet Columnar Storage Format Different columns are not stored in the same disk page Ideal for compression and single attribute queries Multiple disk I/Os to reconstruct a record consisting of many attributes Compression algorithms: Snappy, Gzip, LZO

Adaptive dictionary encoding

Differential dictionaries Each block contains the new different values compared to the previous block There are full data blocks and differential blocks. When only differential blocks are used across the file, the encoding is like global dictionary compression.  We currently are using okeanos for large scale experiments.

Full data block

Differential block

Benefits Elimination of storing repeating values More efficient compression Differential dictionaries may be empty or very small in many cases Reduce the time needed for dictionary lookups during a scan

Drawbacks Differential dictionaries may require more bytes to represent the values In this case, they may lead to bigger files than local encoding Can not omit whole blocks like in local dictionaries. 

Adaptive Dictionaries Balance between local and global encoding Selection between full data block and differential block, so that the number of bytes to represent the values remain stable Assures higher compression rates in any case Improves the ability to ommit blocks The same block may contain differential dictionaries for one column, and full dictionaries for another

Crucial details Differential dictionaries can be applied in most well known storage formats Each attribute is encoded and compressed independently of the others The frequency of full data blocks can also be defined as a parameter A block may be differential for an attribute and full data for another

Scheduling

Dataflow Scheduling onto Heterogeneous Resources Applications Queries modelled as a Directed Acyclic Graph Operators with data dependencies between them The execution of an operator can only start after data transfer from all the predecessors has finished

Dataflow Scheduling onto Heterogeneous Resources Cloud Computing On demand provisioning of resources Providers offer different VM types Resource characteristics CPU, memory, disk, etc Pricing Models different ratios of prices and computational speed Users can specify the number of VMs to use “Unlimited” number of resources

Dataflow Scheduling onto Heterogeneous Resources Challenges Determine the number of resources to provision Larger number may lead to better performance, but higher cost Determine which VM type to use Performance may vary for operators with different characteristics e.g CPU-bound vs I/O-bound Identify solutions with different trade-offs between execution time and monetary cost Exhaustive search of all the possible alternatives may be infeasible

Dataflow Scheduling onto Heterogeneous Resources

Sky algorithm and more… Dataflow Scheduling onto Heterogeneous Resources Sky algorithm and more… Execution plans with different time/money tradeoffs Iterative algorithm Incrementally computes skyline of plans Assigns each operator to all the possible slots Already used VMs or newly added VMs Selects only plans at the skyline from the set of partial solutions Provides solutions close to the optimal pareto front

Partitioning

Partitioning Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing Clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations.

Partitioning Tables Orders Customers order_key cust_key cust_key cust_name 1 1 1 A 2 1 2 B 3 3 3 C 4 2 5 2 Partitions Orders Orders Orders order_key cust_key order_key cust_key order_key cust_key 1 1 4 2 3 3 2 1 5 2 Customers Customers Customers cust_key cust_name cust_key cust_name cust_key cust_name 1 A 2 B 3 C

Reference SIGMOD 15, Locality-aware Partitioning in Parallel Database Systems, [Erfan Zamanian, Carsten Binnig, Abdallah Salama]

Data Cleaning Queries directly on dirty data (e.g. with outliers or without deduplication) Clean only the part of the data needed to answer the query, use appropriate partitioning Answer the query

Other thematic areas for a project Recommendation systems Privacy preserving Graph analysis