The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Outline Definition of PAT Tree PAT Tree on Chinese Document Modified structure of PAT Tree Application examples Conclusion
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT tree Definition: Patricia Tree that storing every semi-infinite string (sistring) of a document Two things we have to know – PATRICIA TREE – SISTRING
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PATRICIA TREE A particular type of “trie” Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PATRICIA TREE Therefore, PATRICIA TREE will have the following attributes in its internal nodes: – Index bit (check bit) – Child pointers (each node must contain exactly 2 children) On the other hand, leave nodes must be storing actual content for final comparison
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering SISTRING Sistring is the short form of ‘Semi-Infinite String’ String, no matter what they actually are, is a form of binary bit pattern. (e.g ) One of the sistring in the above example is … There are totally 5 sistrings in this example
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering SISTRING Sistrings are theoretically of infinite length … … … … 10000… Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering SISTRING Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! – e.g. CUHK – Corresponding sistrings would be CUHK000… UHK000… HK000… K000… – We require each should be at least 4 characters long. – (Why we pad 0/NULL at the end of sistring?)
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering SISTRING (USAGE) SISTRINGs are efficient in storing substring information. A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n 3 ) – e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK. – Storage requirement is O(n 2 )max(length) -> O(n 3 )
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering SISTRING (USAGE) We may instead storing the sistrings of ‘CUHK’, which requires O(n 2 ) storage. – CUHK <- represent C CU CUH CUHK at the same time – UHK0 <- represent U UH UHK at the same time – HK00 <- represent H HK at the same time – K000 <- represent K only A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. Conclusion, sistrings is better representation for storing sub-string information.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT Tree Now it is time for PAT Tree again – PAT Tree is a PATRICIA TREE store every sistrings of a document What if the document is now contain simply ‘CUHK’? – We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result – It looks frustrating for even small example, but it is how PAT tree works!
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT Tree (Example) By digitalizing the string, we can manually visualize how the PAT Tree could be. Following is the actual bit pattern of the four sistrings – Once we understand how the PAT-tree work, we won’t detail it in later examples.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT Tree In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.” In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word- by-word, instead of character-by-character.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT Tree (Example) This works! BUT… – We still need O(n 2 ) memory for storing those sistrings We may reduce the memory to O(n) by making use of points.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering PAT Tree (Actual Structure) We need to maintain only the document itself The PAT Tree acts as an index structure Memory requirement – Document, O(n) – PAT Tree index, O(n) – Leaves pointers, O(n) Therefore, PAT Tree is a linear data structure that contains sub-strings, O(n 3 ), information
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering The Chinese PAT tree we can built PAT tree for english easily. Sistrings are decomposed word by word. for Chinese document, the document layout shows no idea about words. Sadly, they packed together. – e.g. “ 香港體育館 ” – We know there are 5 characters, what’s more? – In fact, there are 2 words “ 香港 ” and “ 體育館 ”, but we have no way to KNOW about this by just reading the text without any other supporting knowledge.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Semi-Infinite String (Sistring) Sistrings are null padding string The sistrings becomes: – 香港體育館 – 港體育館 00 – 體育館 0000 – 育館 – 館 This make sistrings comparable to each others We can examine a particular bit of a sistring and there will not have ‘missing-bit’ in any sistrings
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering The Chinese PAT tree In the research of Chinese information processing, researchers suggest to have sistrings for Chinese document in “sentense level” – i.e. each documents decompose into many sentences by their punctuation marks. – “ 各位同學,早安。 ” will be viewed as 2 sentences “ 各位同學 ” and “ 早安 ” – For each sentences, their sistrings can be obtained liked “ 各位同學 ”, “ 位同學 ”, “ 同學 ”, etc.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering The Chinese PAT tree By this way, Chinese PAT tree is built. Since Chinese words must be a sub-string of the document, all Chinese words can still be found in the Chinese PAT tree efficiently. Therefore, Chinese word segmentation is one of the most important application using the PAT tree.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering The Chinese PAT Tree Structure In Chinese PAT tree, a document is decomposed into sentences. It is possible that sistrings of one sentence will be a subset of another sentence. – e.g. “ 中文大學,香港大學 ”. Sistrings “ 大學 ” appear twice. Once of them will be eaten by another. – Therefore, we usually have a frequency count attached to each leave node of the tree.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering The Chinese PAT Treee Structure Internal node remains the same. It has check- bit information Leave node will now have a frequency count attribute The document is decomposed into a number of sentences. Storage complexity is remains O(n).
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Structure modification We can see that node structure for internal node and leave node are not the same – tree will be more flexible if their nodes are generic (have a universal node structure) – Trade off: generic node structure will enlarge the individual node size – But.. Memory are cheap now Even the low end computer can support hundreds MB of RAM The modified tree is still a O(n) structure
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Structure of the modified node 1. Check Bit 2. Frequency Count 3. Link to a sistring 4. Pointers to the child nodes
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Example of our Modified Version Chinese Text “ 香港體育館 ” SistringsBit Pattern 香港體育館 … 港體育館 … 體育館 … 育館 … 館 …
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Essential Length Essential Length is the number of Chinese character a tree node can represent In general, Chinese characters is a double-byte character (16-bit) The essential length equal to the check bit, truncated to the nearest Chinese character – e.g. a node with check bit = 53 – It can represent only 3 Chinese characters (48 bits) but not 4 Chinese characters (64 bits) – Its essential length = 48
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Essential Node We call a node “Essential Node” (EN) if and only if its, – Essential Length >= 32 – Essential Length is at least 16 more than the previous ancestral EN Each Essential Node can uniquely represent a sub-string(phrase).
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Essential Node With the definition of “Essential Node”(EN) – Each essential node will represent a possible Chinese substring, e.g. “ 香港體育館 ”, “ 體育館 ” With the generalized structure, each EN will also have the frequency count, which reflect the occurrence of the particular associated sub-string.
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Essential Node The essential node with – Check bit 80 Essential length is 80 Representing the phrase “ 香港體育館 ” – Check bit 48 Essential length is 48 Representing the phrase “ 體育館 ”
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Applications PAT tree may embedded more information depends on the application Famous Chinese information processing applications include – Keyword extractions – Sentences Segmentation – Document Classification – … These show the importance of PAT tree structure on those applications
The Chinese University of Hong Kong A Novel PAT-Tree Approach to Chinese Document Clustering Conclusion – PAT tree is a O(n) data structure for document indexing – PAT tree is good for solving sub-string matching problem – Chinese PAT tree has sistrings in sentence level. Frequency count is introduced to overcome the duplicate sistrings problem – On generalizing the node structure, the modified version increase the pat tree capability for varies applications