Searching Similar Segments over Textual Event Sequences

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

Efficiently searching for similar images (Kristen Grauman)

BLAST Sequence alignment, E-value & Extreme value distribution.

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Structural bioinformatics

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.

Modern Information Retrieval

March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Heuristic alignment algorithms and cost matrices

Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,

The Effectiveness Study of Music Information Retrieval Arbee L.P. Chen National Tsing Hua University 2002 ACM International CIKM Conference.

Indexing and Searching

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.

Sequence alignment, E-value & Extreme value distribution

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Fundamentals of Algorithms MCS - 2 Lecture # 7

Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.

Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp

Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Chapter 3 Computational Molecular Biology Michael Smith

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.

CSC 211 Data Structures Lecture 13

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

LogSig: Generating System Events from Raw Textual Logs Liang Tang 1, Tao Li 1, Chang-Shing Perng 2 1 Florida International University 2 IBM T.J. Watson.

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

Doug Raiford Phage class: introduction to sequence databases.

Heuristic Alignment Algorithms Hongchao Li Jan

Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.

CMPT 438 Algorithms.

Chapter 7. Classification and Prediction

Subject Name: Design and Analysis of Algorithm Subject Code: 10CS43

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Sequence comparison: Local alignment

Divide and Conquer – and an Example QuickSort

Objective of This Course

Unit-2 Divide and Conquer

Fast Sequence Alignments

Locality Sensitive Hashing

Coding Concepts (Basics)

Searching: linear & binary

EE368 Soft Computing Genetic Algorithms.

Dynamic Programming II DP over Intervals

Minwise Hashing and Efficient Search

Guess a random word with n errors

Sequence alignment, E-value & Extreme value distribution

Donghui Zhang, Tian Xia Northeastern University

CSE 326: Data Structures Lecture #14

Presentation transcript:

Searching Similar Segments over Textual Event Sequences Liang Tang*, Tao Li*, Shu-Ching Chen* and Shunzhi Zhu+ *Florida International University +Xiamen University of Technology 12/28/2018 ACM CIKM 2013

What is a Textual Event Sequence? An event sequence, where each event is textual. For instances, log sequence. A textual log message 12/28/2018 ACM CIKM 2013

Why Searching Similar Segments? In system diagnosis, analyzing logs is a common approach. But the log files are usually huge. Compare similar segments to identify the abnormal (or “error”) operation. 2013-10-11 23:10:00 server process X starts with aa …. 2013-10-11 23:10:01 client process Y1 starts… 2013-10-11 23:10:20 client process Y1 started successfully… 2013-10-11 23:10:20 client process Y2 starts… ... 2013-10-23 05:59:00 server process X starts with bb …. 2013-10-11 05:59:01 client process Y1 starts… 2013-10-11 05:59:20 process Y1 is stopped by unknown exceptions… 2013-10-11 06:01:05 client process Y2 starts… … “error” operation 12/28/2018 ACM CIKM 2013

Problem Statement Given a textual event sequence S and a query sequence Q, find all segments with length |Q| in S that are similar to Q. Definition of Dissimilarity: Definition of Similar segments: , l = |Q| , e1i, e2i are their i-th events. In other words, similar segments have at most k dissimilar events, also called k-dissimilar. 12/28/2018 ACM CIKM 2013

Related Solutions Text Similarity Search Substring Match Locality Sensitive Hash (A. Gionis et al., 1999) Min-Hash(A. Z. Broder et al., 1998) Substring Match Suffix Tree Suffix Arrays(U. Manber, 1993) For unordered data sets For code sequences or numeric sequences 12/28/2018 ACM CIKM 2013

Potential Solutions based on LSH LSH-DOC: each segment is a small document, ignore the order information of events LSH-SEP: each segment is a small document, but using different hash functions for different regions Indexed segment length l. Q is given by users. If |Q| >= |L|, split Q into multiple segments of length l. If |Q| < |L|, does not work. 12/28/2018 ACM CIKM 2013

Suffix Matrix = LSH + Suffix Arrays Suffix Tree/Arrays hand variable-length queries for code sequences, such as DNA sequences, substring search. Our idea Combine LSH with suffix arrays (Suffix arrays are better than suffix tree because of smaller memory consumption). 12/28/2018 ACM CIKM 2013

Example of Suffix Matrix Offline Indexing: Step 1. Construct m random hash functions Step 2. For each hash function, compute the hash value of each event. Step 3. For each hash value sequence, build the suffix array as a row of the suffix matrix. Online Search: Step 1. Use the m hash functions to hash query Q and get m hash value query sequences. Step 2. Use every hashed query sequence to do binary search over suffix arrays and get candidate segment positions. Step 3. If one segment appears in many candidate sets, pick it as the final candidate. S = e1e2e3e4, is a textual event sequence. h1,h2,and h3 are 3 independent hash functions. The i-th row of is the suffix array of the i-th hashed sequence. 12/28/2018 ACM CIKM 2013

Reaching Probability & Collusion Probability Cumulative probability of Binomial distribution Lower bound for reaching probability Upper bound for collusion probability 12/28/2018 ACM CIKM 2013

Problem of Dissimilar Events In Suffix Search If the dissimilar event is at the middle of the segments, the binary search for suffixes will fail. dissimilar event 9 is not equal to 1. L and Q are not in the same partition in suffix array. Binary search fails. Why? “1933” are in the interval [“1133”, “1134”] How to solve it? Ignore the second position of the segments. However, we do not know which positions are placed dissimilar events. 12/28/2018 ACM CIKM 2013

Random Mask Idea: create hash-value sequences and randomly ignore some positions. Done by Random Mask Original Hash Value Sequence Random Mask Masked Hash Value Sequence Using M1(h(S)) will NOT hurt the binary searches for suffixes. 12/28/2018 ACM CIKM 2013

Reaching Probability for k-dissimilar segments Lower bound for reaching probability The upper bound for the collision probability can be obtained in the analogue way 12/28/2018 ACM CIKM 2013

Experiments for online search Compare with LSH-DOC and LSH-SEP Indexed segment length = |Q|/(k+1)= 3 Datasets Apache logs (236,055), ThunderBid Logs(350,000). Measure All methods can achieve 100% precision. They all have a validation step to validate all candidates by computing actual dissimilarity score focuses on recall and time cost. Ground truth is obtained by the brute-force algorithm. 0.5 12/28/2018 ACM CIKM 2013

Recall/Search Time The score is higher, the performance is better When the query sequence is short, LSH-DOC, LSH-SEP can beat SuffixMatrix. But when query sequence is long, their performance is bad. 12/28/2018 ACM CIKM 2013

Number of Probed Segment Candidates The number is smaller, the performance is better 12/28/2018 ACM CIKM 2013

Using “stricter” hash function) Use n independent hash function to construct a “stricter” hash function. SuffixMatrix(Strict): use more hash functions and make the search condition “stricter” (from locality sensitive hashing) The collusion probability becomes smaller. 12/28/2018 ACM CIKM 2013

Time for building index Indexed segments in LSH-DOC and LSH-SEP are overlapped. One event is indexed in multiple overlapped segments. 12/28/2018 ACM CIKM 2013

Summary K-dissimilar segment search problem for textual event sequences Suffix Matrix = LSH + Suffix Arrays Random Mask for Suffix Matrix 12/28/2018 ACM CIKM 2013

End & Question Thank you! 12/28/2018 ACM CIKM 2013

Suffix Array A sequence S = 3200113$ Suffix Array Substring match is done by a binary search on the suffix array. Suffix Position 3200113 200113 1 00113 2 0113 3 113 4 13 5 6 $ 7 Suffix Position $ 7 00113 2 0113 3 113 4 13 5 200113 1 6 3200113 sort By using “string compare” method. From the suffix array and the sequence S, we can retrieve all suffixes without additional space cost. 12/28/2018 ACM CIKM 2013

Locality Sensitive Hashing (LSH) LSH family is a family of hash functions, such that those hash functions have relationships with the similarity score. sim(p,q) > c, then h(p)=h(q) with probability at least P1. sim(p,q) < c/k, then h(p)=h(q) with probability at most P2. P1 > P2. This kind of hash functions is an approximate representation of similarities. 12/28/2018 ACM CIKM 2013

Alignment Problem: Gap in Similar Events Word methods (FASTA, BLAST) Split the query sequence into a series of short, nonoverlapping subsequences(“words”) that are then matched to candidate database sequences. Our problem is a sub-problem for handling gap=0. Gap 12/28/2018 ACM CIKM 2013