Download presentation
Presentation is loading. Please wait.
1
Projects
2
Text Mining Build a library for text mining using extended Spark SQL
Implement ReCital ( in Spark SQL Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., & Ioannidis, Y. (2017, September). High-Pass Text Filtering for Citation Matching. InInternational Conference on Theory and Practice of Digital Libraries(pp ). Springer, Cham.
3
Text mining Library
4
Text Mining Library Textwindow Regular expression functions
Returns a rolling window in the text. Each window contains 3 variable length parts. The prev part, the middle part, and the next part. The middle part of window may be filtered using a pattern. Regular expression functions Functions to support regular expressions matches Text similarity functions Cosine distance, jaccard distance etc. JSON tools Functions that are able to process JSON lists (eg. sort, set operations, merge, split, compare)
5
Text Mining Library Normalization functions Keywords functions
Stopword removal Tokenization Stemmer Space removal Punctuation removal Keywords functions Select representative terms or ngrams from texts or from corpuses according to their frequency
6
Text mining library Using this library create a simple algorithm in Spark SQL, to extract FP7 project identifiers from documents
7
ReCital implementation
8
ReCital implementation in Spark SQL
Relational citation matching algorithm in three steps Text mine publication fulltexts to extract the references section Use high-pass text filtering to extract references sections Match the extracted references sections to a database which contains publication metadata (titles, authors, dates, publishers etc.) Build an inverted index for titles to accelerate the matching Disambiguate the results Calculate a confidence value for each title that matches in order to filter out false positives.
9
References Section Extraction
10
References Section Extraction
High-Pass Text Filtering
11
returned lines (density > avg_density)
text signal density (window size = 5) returned lines (density > avg_density) ___________________ ______________2004_ ___http://google.com__ _______2009________ _______________2010 ____2006___________ ___________2003____ ________2002_______ ___________1999____ ________2012_______ 1 0.2 0.4 0.6 0.8
12
Inverted Index on Trigrams
Title Matching Inverted Index on Trigrams
13
Title Matching Split each title in the metadata database on its trigrams Match each trigram from the extracted references section to the trigrams from the metadata database Obviously, this would be too slow. Create an inverted index based on identifying trigrams Identifying trigrams ideally exist in just one title. Approximate the ideal inverted index with an iterative method.
14
Characteristic Inverted Index
Titles Inverted Index First Iteration Second Iteration trigram Title id A 3,4 B C F 4,5 trigram Title id A 1,3,4 B 1,2,3,4 C 2,3,4 D 1 E 2,3 F 2,4,5 G 3 H 2 trigram Title id F 4,5 G 3 trigram Title id D 1 F 4 5 G 3 H 2 Title id Title 1 ABD 2 BCFEH 3 EABCG 4 ABCF 5 F trigram Title id D 1 H 2 G 3
15
Disambiguation of the results
Use text near the matched title Create bag of words that consists of all the available publication metadata Match this bag of words to the text near the title Each word that matches counts differently according to its weight and its distance from the title The length of the matched title also counts The used formula can be found in the paper
16
Representation formats and Compression
17
Compression Schemes Gzip
Heavy compression based on LZ77 and Huffman coding Bzip2 Heavy compression based on the Burrows Wheeler Transform (BWT) Snappy Lightweight compression targeting at high speed Zstandard (ZSTD) Targets at real-time compression scenarios with high compression rates Other compression schemes LZ4, LZF, LZO, RLE
18
Simple Data representation formats
TEXT CSV, JSON Sequence file Row based flat file consisting of binary key/value pairs AVRO Binary data serialization format
19
Sophisticated Data representation formats (1)
PAX format Hybrid format (Balance between row and column store) Designed for optimizing cache performance Column store, but all different columns are stored in the same disk page
20
Sophisticated Data representation formats (2)
RCFile Record columnar file which organizes records in row groups Compression with ZLIB, Snappy, Bzip2 (Default uncompressed)
21
Sophisticated Data representation formats (3)
ORCFile (Optimized RCFile) Light weight indexes (min/max values, count and other stats) stored within the file Local dictionary encoding 250MB Stripe size but it's parameterizable Compression algorithms: Snappy, ZLIB
22
Sophisticated Data representation formats (4)
Parquet
23
Sophisticated Data representation formats (5)
Parquet Columnar Storage Format Different columns are not stored in the same disk page Ideal for compression and single attribute queries Multiple disk I/Os to reconstruct a record consisting of many attributes Compression algorithms: Snappy, Gzip, LZO
24
Adaptive dictionary encoding
25
Differential dictionaries
Each block contains the new different values compared to the previous block There are full data blocks and differential blocks. When only differential blocks are used across the file, the encoding is like global dictionary compression. We currently are using okeanos for large scale experiments.
26
Full data block
27
Differential block
28
Benefits Elimination of storing repeating values
More efficient compression Differential dictionaries may be empty or very small in many cases Reduce the time needed for dictionary lookups during a scan
29
Drawbacks Differential dictionaries may require more bytes to represent the values In this case, they may lead to bigger files than local encoding Can not omit whole blocks like in local dictionaries.
30
Adaptive Dictionaries
Balance between local and global encoding Selection between full data block and differential block, so that the number of bytes to represent the values remain stable Assures higher compression rates in any case Improves the ability to ommit blocks The same block may contain differential dictionaries for one column, and full dictionaries for another
31
Crucial details Differential dictionaries can be applied in most well known storage formats Each attribute is encoded and compressed independently of the others The frequency of full data blocks can also be defined as a parameter A block may be differential for an attribute and full data for another
32
Scheduling
33
Dataflow Scheduling onto Heterogeneous Resources
Applications Queries modelled as a Directed Acyclic Graph Operators with data dependencies between them The execution of an operator can only start after data transfer from all the predecessors has finished
34
Dataflow Scheduling onto Heterogeneous Resources
Cloud Computing On demand provisioning of resources Providers offer different VM types Resource characteristics CPU, memory, disk, etc Pricing Models different ratios of prices and computational speed Users can specify the number of VMs to use “Unlimited” number of resources
35
Dataflow Scheduling onto Heterogeneous Resources
Challenges Determine the number of resources to provision Larger number may lead to better performance, but higher cost Determine which VM type to use Performance may vary for operators with different characteristics e.g CPU-bound vs I/O-bound Identify solutions with different trade-offs between execution time and monetary cost Exhaustive search of all the possible alternatives may be infeasible
36
Dataflow Scheduling onto Heterogeneous Resources
37
Sky algorithm and more…
Dataflow Scheduling onto Heterogeneous Resources Sky algorithm and more… Execution plans with different time/money tradeoffs Iterative algorithm Incrementally computes skyline of plans Assigns each operator to all the possible slots Already used VMs or newly added VMs Selects only plans at the skyline from the set of partial solutions Provides solutions close to the optimal pareto front
38
Partitioning
39
Partitioning Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing Clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations.
40
Partitioning Tables Orders Customers order_key cust_key cust_key
cust_name 1 1 1 A 2 1 2 B 3 3 3 C 4 2 5 2 Partitions Orders Orders Orders order_key cust_key order_key cust_key order_key cust_key 1 1 4 2 3 3 2 1 5 2 Customers Customers Customers cust_key cust_name cust_key cust_name cust_key cust_name 1 A 2 B 3 C
41
Reference SIGMOD 15, Locality-aware Partitioning in Parallel Database Systems, [Erfan Zamanian, Carsten Binnig, Abdallah Salama]
42
Data Cleaning Queries directly on dirty data (e.g. with outliers or without deduplication) Clean only the part of the data needed to answer the query, use appropriate partitioning Answer the query
43
Other thematic areas for a project
Recommendation systems Privacy preserving Graph analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.