Hefei Normal University

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
Optimizing search engines using clickthrough data
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
WMES3103 : INFORMATION RETRIEVAL
1 Fuzzy Signatures in SARS Student: Bai Qifeng Client: Prof. Tom Gedeon.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Scalable Text Mining with Sparse Generative Models
Chapter 5: Information Retrieval and Web Search
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Brain Wave Analysis in Optimal Color Allocation for Children’s Electronic Book Design Wu, Chih-Hung Liu, Chang Ju Tzeng, Yi-Lin.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
A Language Independent Method for Question Classification COLING 2004.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Support Vector Machines Tao Department of computer science University of Illinois.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Research Methodology II Term review. Theoretical framework  What is meant by a theory? It is a set of interrelated constructs, definitions and propositions.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.
Queensland University of Technology
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
Text Based Information Retrieval
The Next Frontier in TAR: Choose Your Own Algorithm
Understanding Results
Montenegro Statistical Training
Multimedia Information Retrieval
WHAT IS READING COMPREHENSION?
Statistical NLP: Lecture 9
Information Retrieval
Introduction into Knowledge and information
Large Scale Support Vector Machines
Introduction of KNS55 Platform
The HTS Law School Guide to
Review of Week 1 Database DBMS File systems vs. database systems
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Spreadsheets, Modelling & Databases
Support Vector Machines and Kernels
Synthesis.
Visualization of Temporal Difference of BGP Routing Information
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
East China Normal University
Unsupervised learning of visual sense models for Polysemous words
Dan Roth Department of Computer Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Introduction to Search Engines
Extracting Why Text Segment from Web Based on Grammar-gram
Instructor: Vincent Conitzer
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Hefei Normal University A Discourse Analysis Based Approach to Automatic Information Tagging in Chinese Judicial Discourses Bo SUN Hefei Normal University

Contents 1. Introduction 2. Information Features of Judicial Discourses 3. Information Identification based on Information Rules 4. Evaluation of Automatic Information Identification 5. Limitations and Suggestions for Further Research

Why information identification? 1-1 Research Background Why information identification? Become more and more important; Case guidance has been introduced into Chinese legal system; Provide reference for parties in a lawsuit; Litigants, attorneys, judges Facilitate legal writing education Law school students and teachers

Why automatic information identification? Information identification is a necessary nuisance in reading judicial discourse. not all the information is needed; time-consuming and labour-exhausting;

Search engines usually present the whole discourse, not the very information in need. 百度 Bing Google … Such deficiency can also be found in some famous databases, such as “北大法宝”.

Some legal databases are so complicated that users need to be trained, such as Westlaw (Dai, 2013).

Discourse Information Theory Research on mass communication generalizes information by five interrogative words, namely “who”, “says what”, “to whom”, “in what channel”, and “with what effect” . (Lasswell, 1948) This division is developed by adding “under what circumstances” and “for what purpose”. (Braddock (1958)) This treatment of information accords with our intuitive understanding of information, but it outlines only the linear structure of language information.

Discourse Information Theory pays more attention to the hierarchical structure of discourse and perceives clause as the basic unit, which is called information unit (IU). IUs can be represented by 15 interrogative expressions: What Attitude (WA), What Basis (WB), What Condition (WC), What Effect (WE), What Fact (WF), What Change (WG), How (HW), What Inference (WI), What Judgement (WJ), When (WN), Who (WO), What Disposal (WP), Where (WR), What Thing (WT), and Why (WY) [7].

Example: 1WB|In accordance with Articles 348 and 357 of the Criminal Law of People’s Republic of China, the decision is made as follows.2WJ||The defendant Liu X is guilty of the crime of illegally holding drugs, 3WP|||so he is sentenced to nine months’ imprisonment and 4WP|||fined 3000 Yuan.

1-2 Research Objective and Research Questions to develop an automatic approach to identification discourse information in Chinese judicial texts Automatic Method Legal Discourse User

Research questions: What are the information features of Chinese judicial discourse? What information processing rules can be constructed according to the features? How does computer perform in information identification when equipped with the rules?

2. Information Features of Judicial Discourses

Inter-type constancy Macro features Micro features Initiating with kernel information; Ending with formulation date; Setting formulator prior to formulation date Micro features Words that occur in more than 80% of the collected texts provide a clue for finding constant elements

Intra-type Constancy Macro features Micro features Same keyword in kernel information Obligatory upper-level information units Obligatory lower-level information units Micro features Similar processes, participants and circumstances are likely to be shared by obligatory upper-level units

Intra-type variation Macro features Optional upper-level information units Optional lower-level information units Micro features (shared by optional units) Process elements in optional upper-level units tend to be implicit, while those in optional lower-level units may overlap Participant elements can be shared by both upper and lower level units Few common circumstance elements can be found

3. Information Processing Rules Automatic processing Automatic text classification Automatic identification of upper-level units Automatic identification of lower-level units

(Halliday & Matthiessen, 2004) Automatic text classification 《刑事判决书的写作要领》 (Halliday & Matthiessen, 2004)

Automatic text classification Key word in kernel information Process elements in obligatory upper-level units as the primary variable Participant as a substitute for process The order of these elements

Pretreatment of discourse information Information Element Participant Process Circumstance

Identification Rules of Information Units Three dimensions of the identification rules Which elements should be selected? What’s the logical relationship among the selected elements? How they stand to each other?

Identification rules (upper-level units) 1) Combination of process and participant as the main indicator a) The synonyms of process verbs or participant nouns should be included. b) Where one IU contains more than one common process verb, all these verbs probably need to be considered. 2) Absolute position as the complementary indicator

Identification rules (lower-level units) Obligatory lower-level information items as the default value Process plus participant as the main indicator Relevant position as the alternative indicator

Realization of automatic information identification

4. Evaluation of Automatic Information Processing The information rule based text classification vs. an SVM classifier preprocessing includes word segmentation and stop-word removal, with only nouns and verbs kept we represent these words with Vector Space Model and calculate their tf-idf value 1000 features with the highest 2 values are selected Directed Acyclic Graph SVMs (DAGSVM) is adopted for this multiclass classification

The information rule based information identification vs The information rule based information identification vs. a Viterbi Identifier λ = (Q, O, A, B, ) Q denotes the information category; O is the category of process verbs; A means the probability of information category transition; B refers to the likelihood that a category of process verb occurs in the case of a given information category;  is the initial probability of information category.

SVM and Viterbi algorithm are very successful in NLP. Their weak performance might be attributed to the feature (observation) selection. Both of them ignore the hierarchical structure of discourse.

5. Limitations and Suggestions for Further Research More types of legal texts should be explored The rules require a very detailed discourse analysis Suggestions for further research: The rule-based approach can cooperate with statistically based approaches

Thank You !