Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.

Similar presentations


Presentation on theme: "Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits."— Presentation transcript:

1 Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits Compression techniques take advantage of: Sparse coverage Repetitive scan lines Large smooth gray areas ASCII code, always 8 bits per character Long words frequently used

2 Entropy Entropy is a quantitative term used for amount of information in a string 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.80 0.60 0.40 0.20 0.00 H(1)+H(0) H(1) H(0) For N clusters, where l i is the length of the i th cluster

3 Binary Image Compression Techniques Packing: 8 pixels per byte Run Length Encoding: Assume 100 dpi, 850 bits per line encode only the white bits as they are long runs Top part of a page could be 0(200)111110(3)111110(3) …. Huffman Coding: use short length codes for frequent messages MessageProbability A0.30 B0.25 C0.20 D0.10 E0.15 A.30.45.55 B.25.30.45 C.20.25 E.15.20 D.10 A00.30 00.30 1.45 0.55 B01.25 00.30 1.45 C11.20 10.25 01.25 E100.15 11.20 D101 10 EncodeDecode

4 Huffman Encoding 0 (2,7) (13,2) 0 (2,2) (7,2) (13,2) 0 (2,7) (13,2) 0 (2,2)(7,2)(13,2) 0 0 Bit map: 160 bits 50 numbers in range 0-15 Use 4 bits per number: 200 bits 2 bits per symbol: 100 bits HC: 1.84 x 50 = 92 bits

5 Predictive Coding Most pixels in adjacent scan lines s1 and s2 are the same S 2 ’ is the predicted version x4x4 x3x3 x2x2 x1x1 x0x0 x 2 dimensional prediction X4X4 X3X3 X2X2 X1X1 X0X0 p(0)p(1) XpXp 000000.990.010 000010.400.601 00010 0.400 000110.250.751 Probabilities gathered from document collections Tradeoff between context size and table size; Context size of 12 pixels common which uses a 4096 entries table

6 Group III Fax White runs and black runs alternate All lines begin with a white run (possibly length zero) There are 1728 pixels in a scan line Makeup codes encode a multiple of 64 bits Terminating codes encode the remainder (0 to 63) EOL for each line CCITT lookup tables Example, White run of 500 pixels would be encoded as 500 = 7x 64 + 52 Makeup code for 7x 64 is 0110 0100 Terminating code for 52 is 0101 0101 Complete code is 0110 0100 0101 0101

7

8 Group IV READ 111100001110000 110000111111000 Reference Coding a0a0 b1b1 a2a2 a1a1 b2b2 a 0 is the reference changing pixel; a 1 is the next changing pixel after a 0 ; and a 2 is the next changing pixel after a 1. b 1 is the first changing pixel on the reference line after a 0 and is of opposite color to a 0 ; b 2 is the next changing pixel after b 1. To start, a 0 is located at an imaginary white pixel point immediately to the left of the coding line. Follow READ algorithm chart

9 Group IV READ

10 Grayscale Compression- JPEG

11

12 Information Retrieval (Typed text documents) IR goal is to represent a collection of documents were a single document is the smallest unit of information Typify document content and present information upon request RequestsDocuments Similarity Measure 1.OCR translates images of text to computer readable form and IR extracts the text upon request 2.Inverted Index: Transpose the document-term relationship to a term-document relationship 3.Remove Stopwords: the, and, to, a, in, that, through, but, etc. 4.Word Stemming: Remove prefixes and suffixes and normalize

13 TermDocid: Frequency character1:13;1 computer1:12:13:1 devices1:23:1 extract2:13:1 form1:13:1 information2:23:2 IR2:13:1 OCR1:13:1 optical1:13:1 printed1:13;1 readable1:12:13:1 recognition1:13:1 request3:1 retrieval2:13:2 sequentially3:1 system2:23:2 text1:12:13:2 translate1:13:1 Query 1 : recognition or retrieval Response: 1 2 3 Query 2: sequentially and readable Response: 3 Query 3: not translate Response: 2 Query: character and recognition or retrieval

14 Vector Space Model Each document is denoted by a vector of concepts (index terms) If the term is present in the document 1 is placed in the vector Vector of document 1 from table: (1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1) Weighting: Favor terms with high frequency in a few documents Term iDocument jdf i t ij w ij character1210.17609 Computer1310.00000 Retrieval3220.35218 N = total documents Df i = no. of docs containing term i T ij = frequency of term i in doc j Document similarity measure between D j (w i,w 2j,…w mj ) and Q r (q 1r,q 2r,..q mr )

15 Relevance Feedback N = no. of documents in collection R = number of documents relevant to query q N = no. of documents containing t R = no. of relevant documents containing t F = proportion of relevant documents to non-relevant documents in which term occurs F’ = without relevance feedback k = constant, adjusted with collection size c = collection size f i = no. of documents in which term i occurs t ij = frequency term i in document j Maxtf j = maximum term frequency in document j

16 Precision and Recall Coverage: extent to which system includes relevant documents Time lag: average time it takes to produce an answer to a search request Presentation: quality of the output Effort: energies put forth by user to obtain information sought Recall: proportion of relevant material received from a query Precision: proportion of retrieved documents actually relevant Recall= Precision= RelevantNot relevant Retrievedab Not retrieved Cd


Download ppt "Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits."

Similar presentations


Ads by Google