Download presentation
Presentation is loading. Please wait.
1
Fail Signature Konsep Asas - superimposed coding Pemetakan secara Menegak (Vertical partitioning) Pemetakan Secara Melintang (Horizontal partitioning)
2
Fail Signatur merupakan struktur indeks berorientasikan perkataan berdasarkan cincangan Idea utamanya adalah untuk memulangkan paten bit (deskriptor atau signatur) dan menyimpannya dalam fail berasingan yang bertindak sebagai penapis untuk menghapuskan item data yang tidak bertepatan dengan permintaan maklumat Kaedah berasaskan signatur adalah lebih pantas daripada imbasan teks penuh Fail Signature
3
Fail signatur selalunya menggunakan pengkodan superimposed untuk menghasilkan fail signatur. Dokumen disimpan secara berturutan dalam “fail teks”. “Signatur” disimpan dalam “fail signatur”. Blok signatur – setiap satu perkataan dalam blok dicincang kepada paten bit dengan panjang tetap, F dengan m bit diset kepada 1 dan yang lainnya diset kepada 0. Kedudukan m ditentukan oleh fungsi cincangan. Berdasarkan algorithma Karp-Rabin Fail Signature
4
Algorithma Karp-Rabin “A string matching algorithm that compares string's hash values, rather than the strings themselves. For efficiency, the hash value of the next position in the text is easily computed from the hash value of the current position” Fungsi cincang akan mengurangkan perbandingan m huruf kepada perbandingan satu integer, Karp dan Rabin mencadangkan penggunaan fungsi cincang h(x) = x mod q di mana q adalah nombor prime yang besar Bagi perkataan w dengan panjang m maka hash(w) hash(w[0.. m-1])=(w[0]*2 m-1 + w[1]*2 m-2 +···+ w[m-1]*2 0 ) mod q rehash(a,b,h)= ((h-a*2 m-1 )*2+b) mod q
5
ascii(e) string = abcde ascii(a) = 97 m = 4 hash(“abcd”) = (97*2 3 + 98*2 2 + 99*2 1 + 100*2 0 ) % q = 1466 % q hash(“bcde”) = [ ( 1466 - 97*2 3 ) * 2 + 101 ] % q hash(“bcde”) = (98*2 3 + 99*2 2 + 100*2 1 + 101*2 0 ) % q = 1481 % q hash(w[0.. m-1])=(w[0]*2 m-1 + w[1]*2 m-2 +···+ w[m-1]*2 0 ) mod q rehash(a,b,h)= ((h-a*2 m-1 )*2+b) mod q
6
Penggunaan pengekodan superimposed bagi menjana signature Setiap teks dibahagi kepada blok logikal Setiap blok mengandungi perbezaan n perkataan Setiap perkataan ditukar dalam bentuk “signature perkataan” Signature perkataan adalah dalam bentuk B-bit dengan 1-bit m Setiap perkataan dipecah kepada jujukan dan tindanan huruf-huruf e.g. free --> fr, fre, ree, ee Setiap perkataan signature adalah dalam bentuk OR bagi membentuk block signature Blok signature dicantumkan bagi membentuk signature dokumen Fail Signature : Struktur
7
Signature perkataan merupakan vektor bit bagi F bit yang mengandungi m bit, 1. Jika 12 bit digunakan bagi 4 bit 1, maka bagi menentukan kedudukan bit 1 satu maka langkah berikut perlu dilakukan 1.Jadi perkataan ke dalam bentuk triplet bertindan ( 3 – 3 huruf) Contoh free *fr, fre, ree, ee* (* mewakili ruang kosong) Words hash(word signatures) free 001 000 110 010 text 000 010 101 001 free text Text string: F = 12 m = 4 Signature Perkataan
8
Tukar huruf diatas (dalam hexadecimal) dalam bentuk ASCII *fr 206672ree 726565 fre 667265ee* 656520 Mod kan nilai di atas dengan 12 dan dapatkan bakinya 206672 % 12 = 8726565 % 12 = 7 667265 % 12 = B656520 % 12 = 0 rehash 3 (rehash(a,b,h) = ((h-a*2 m-1 )*2+b) mod q Maka kedudukan bit 1 ditentukan berdasarkan langkah diatas. Signature Perkataan 1 2 34 5 67 8 910 11 12 free0 0 10 0 01 1 00 1 0
10
block signatures Block 2 Block 1 Dokumen dibahagikan kepada blok yang saiznya ditetapkan (satu blok mungkin menagndungi 2 perkataan bersebelahan) superimposed coding bit-wise OR bagi semua signature perkataan di dalam blok. Words Word signatures free 001 000 110 010 text 000 010 101 001 information 000 100 011 110 retrieval 101 000 100 001 free text information retrieval Text string: 001 010 111 011 101 100 111 111 Block Signature
11
Diberi perkataan, q, penjanaan kueri signature ( F bit dengan m, 1) S q Keadaan dikatakan berpadanan (match) bilamana S q S b = S q Example: query signature 000 010 101 001 block 1 001 010 111 011 block 2 101 100 111 111 Retrieval Using Signature File 000 010 101 001 … other signatures in block 2 Block 1: 000 010 101 001 001 010 111 011 000 010 101 001 Block 2: 000 010 101 001 101 100 111 111 = 000 000 101 001 000 010 101 001 Berpadanan dokumen dan kueri X Berpadanan dokumen dan kueri
12
Copyright Bruce Croft and/or James Allan
13
Signature File
14
This is a text. A text has many words. Words are made from letters. Block 1Block 2Block 3Block 4 000101110101100100101101 Text Text Signature h(text)=000101 h(many)=110000 h(words)=100100 h(made)=001100 h(letters)=100001
15
Kebarangkalian False Drop TETAPI… jika berpadanan antara kueri dan blok, ianya belum menjamin yang ianya betul-betul berpadanan Contoh kueri adalah free query signature 001 000 110 010 block 1 001 010 111 011 block 2 101 100 111 111 Kedua blok didapati berpadanan, namun perkataan free tidak terdapat pada blok 2 Blok 2 dikatakan false drop. Apakah yang menyebabkan false drops? –Alasan utama : superimposition merupakan cantuman bit dari beberapa perkataan signature –Alasan Minor : perlanggaran cincang (2 different words having same word signature), but if F is large enough hash collision probability is very low
16
Saiz fail signature adalah kecil dan mudah dikawal Kos penyelenggaraan kecil (kemaskini dan hapus) kerana pengorganisasian yang mudah. Signature mudah dijanakan dan kos kemasukkan data baru adalah rendah. Pengkodan Superimposed adalah baik untuk capaian multi- attribute. Kebaikan
17
Pengelintaran adalah lambat berbanding file songsang. False drops adalah mahal untuk dihapuskan kerana semua signature yang berpadanan mesti dipastikan patternnya. Maklumat kekerapan/pemberat susah nak ditentukan di dalam signature. Fungsi queri seperti disjunctive conditions, synonyms, wildcards, proximity operations susah nak dikendalikan Kekurangan
18
The most straightforward way is to store the block signatures sequentially Signature file Pointer file Document file N blocks F bits 001 010 111 011 101 100 111 101 Sequential Signature File (SSF)
19
Walaupun kaedah di atas boleh diimplementasikan terus, prestasi penggelintaran menjadi lambat untuk pangkalan data yang bersar. Metod yang boleh digunakan untuk mengatasi masalah ini ialah: Pemampatan (Compression) example run length coding Pembahagian menegak (Vertical partitioning) Pembahagian melintang (Horizontal partitioning) Sequential Signature File (SSF)
20
data0000 0000 0000 0010 0000 base0000 0001 0000 0000 0000 management0000 1000 0000 0000 0000 system0000 0000 0000 0000 1000 block signature0000 1001 0000 0010 1000 L1L1 L2L2 L3L3 L4L4 L5L5 [L 1 ] [L 2 ] [L 3 ] [L 4 ] [L 5 ] where [x] is the encoded value of x. search: Decode the encoded lengths of all the preceding intervals example: search “data” (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000 disadvantage: search becomes low Compression using run-length encodingrun-length encoding Compression using run-length encodingrun-length encoding
21
Repeated occurrence of the same character is called a run Number of repetition is called the length of the run Run of any length is represented by three characters eeeeeee7tnnnnnnnn @e7t@n8 run-length encoding
22
Vertical Partitioning To avoid bringing useless portions of the document signature in main memory Bit-Sliced Signature Files (BSSF) Frame-sliced signature files (FSSF)
23
Bit-Sliced Signature Files (BSSF) Store the signatures slice by slice. Each slice i is of length N and corresponding to i-th bit in the signatures Pointer file Document file F signature files N 0 0 0 1 1 …. N Signature file N blocks F bits 0 0...1 1 0 1 signature query signature 001 000 110 010... 1
24
Retrieval if i th bit in the query signature is set to 1, retrieve the i th signature block/record. if there is n number of bits are set to 1, only n number of records needs to be retrieved.
25
search:(1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined.
26
Frame-Sliced Signature File (FSSF) Divide a signature into a number of frames, and each word only sets bits in one single frame. Signatures are stored frame-wise; I.e., 1 frame = 1 file 000000 010110 110010 free text No. of frames: 2 Signature length: 12 No. of 1’s: 3
27
Tentukan bit pada frameTentukan frame...... 110110 010001 101010 000111 frame1 N blocks...... 010110 011001 111010 100101 frame2......... 110110 010001 101010 110100 frame k...... pointer file D1 D2 D3 Dn Ex) Query = {free, text} H1(free) = 2, H2(free) = 100001 H1(text) = 1, H2(text) = 000011 Frame-Sliced Signature File (FSSF)
28
FSSF -- Performance Two hash functions are required: one to select the frame and one to set the bits within the frame One keyword searches only one frame If a frame is stored as an individual sequential file, a query with one keyword results in searching one sequential file faster search speed e.g., ‘free’ will search the second frame; ‘text’ will search the first frame Updates cost is proportional to the number of frames in a signature (not the number of bits) e.g., inserting a signature means generating the signature (only main memory accesses) and append the signature to the frames If a frame is a separate file, insertion will invoke a number of disk accesses equal to the number of frames
29
Horizontal Partitioning The methods are to group signatures into sets so that only a few sets are searched. Two-level superimposed coding multi-level superimposed coding
30
Two-level signature files Two levels of signature are used. Here records refer to documents, and blocks refer to a group of documents: First level: "Record signature": stored sequentially like SSF. Second level: "Block signature": superimposing signatures of all words in the block and stored in a bit-sliced form. Each level has it own hashing function. Searching: scan block signature first. And the work on those that qualify
31
Two Level Superimposed Coding Two levels of signature document signature is generated from the document block signature is generated using another hash function from the documents in the block (without document boundaries) block signatures document signatures documents
32
Two Level Superimposed Coding (2) Searching: scan the block signature first scan the document signature files for the promising blocks only Comments Searching is faster the second level signatures have to be longer
33
Multi-Level Superimposed Coding A generalization of the two level superimposed coding method. (Lee, et. al., 1995) Signatures at each level are generated directly from the text using a different hash function. document signatures documents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.