Hefei Normal University

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

Improved TF-IDF Ranker

Optimizing search engines using clickthrough data

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

WMES3103 : INFORMATION RETRIEVAL

1 Fuzzy Signatures in SARS Student: Bai Qifeng Client: Prof. Tom Gedeon.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Scalable Text Mining with Sparse Generative Models

Chapter 5: Information Retrieval and Web Search

Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Brain Wave Analysis in Optimal Color Allocation for Children’s Electronic Book Design Wu, Chih-Hung Liu, Chang Ju Tzeng, Yi-Lin.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

A Language Independent Method for Question Classification COLING 2004.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Support Vector Machines Tao Department of computer science University of Illinois.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Research Methodology II Term review. Theoretical framework  What is meant by a theory? It is a set of interrelated constructs, definitions and propositions.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.

Queensland University of Technology

Sentiment analysis algorithms and applications: A survey

School of Computer Science & Engineering

Text Based Information Retrieval

The Next Frontier in TAR: Choose Your Own Algorithm

Understanding Results

Montenegro Statistical Training

Multimedia Information Retrieval

WHAT IS READING COMPREHENSION?

Statistical NLP: Lecture 9

Information Retrieval

Introduction into Knowledge and information

Large Scale Support Vector Machines

Introduction of KNS55 Platform

The HTS Law School Guide to

Review of Week 1 Database DBMS File systems vs. database systems

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Spreadsheets, Modelling & Databases

Support Vector Machines and Kernels

Visualization of Temporal Difference of BGP Routing Information

A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.

East China Normal University

Unsupervised learning of visual sense models for Polysemous words

Dan Roth Department of Computer Science

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Introduction to Search Engines

Extracting Why Text Segment from Web Based on Grammar-gram

Instructor: Vincent Conitzer

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Hefei Normal University A Discourse Analysis Based Approach to Automatic Information Tagging in Chinese Judicial Discourses Bo SUN Hefei Normal University

Contents 1. Introduction 2. Information Features of Judicial Discourses 3. Information Identification based on Information Rules 4. Evaluation of Automatic Information Identification 5. Limitations and Suggestions for Further Research

Why information identification? 1-1 Research Background Why information identification? Become more and more important; Case guidance has been introduced into Chinese legal system; Provide reference for parties in a lawsuit; Litigants, attorneys, judges Facilitate legal writing education Law school students and teachers

Why automatic information identification? Information identification is a necessary nuisance in reading judicial discourse. not all the information is needed; time-consuming and labour-exhausting;

Search engines usually present the whole discourse, not the very information in need. 百度 Bing Google … Such deficiency can also be found in some famous databases, such as “北大法宝”.

Some legal databases are so complicated that users need to be trained, such as Westlaw (Dai, 2013).

Discourse Information Theory Research on mass communication generalizes information by five interrogative words, namely “who”, “says what”, “to whom”, “in what channel”, and “with what effect” . (Lasswell, 1948) This division is developed by adding “under what circumstances” and “for what purpose”. (Braddock (1958)) This treatment of information accords with our intuitive understanding of information, but it outlines only the linear structure of language information.

Discourse Information Theory pays more attention to the hierarchical structure of discourse and perceives clause as the basic unit, which is called information unit (IU). IUs can be represented by 15 interrogative expressions: What Attitude (WA), What Basis (WB), What Condition (WC), What Effect (WE), What Fact (WF), What Change (WG), How (HW), What Inference (WI), What Judgement (WJ), When (WN), Who (WO), What Disposal (WP), Where (WR), What Thing (WT), and Why (WY) [7].

Example: 1WB|In accordance with Articles 348 and 357 of the Criminal Law of People’s Republic of China, the decision is made as follows.2WJ||The defendant Liu X is guilty of the crime of illegally holding drugs, 3WP|||so he is sentenced to nine months’ imprisonment and 4WP|||fined 3000 Yuan.

1-2 Research Objective and Research Questions to develop an automatic approach to identification discourse information in Chinese judicial texts Automatic Method Legal Discourse User

Research questions: What are the information features of Chinese judicial discourse? What information processing rules can be constructed according to the features? How does computer perform in information identification when equipped with the rules?

2. Information Features of Judicial Discourses

Inter-type constancy Macro features Micro features Initiating with kernel information; Ending with formulation date; Setting formulator prior to formulation date Micro features Words that occur in more than 80% of the collected texts provide a clue for finding constant elements

Intra-type Constancy Macro features Micro features Same keyword in kernel information Obligatory upper-level information units Obligatory lower-level information units Micro features Similar processes, participants and circumstances are likely to be shared by obligatory upper-level units

Intra-type variation Macro features Optional upper-level information units Optional lower-level information units Micro features (shared by optional units) Process elements in optional upper-level units tend to be implicit, while those in optional lower-level units may overlap Participant elements can be shared by both upper and lower level units Few common circumstance elements can be found

3. Information Processing Rules Automatic processing Automatic text classification Automatic identification of upper-level units Automatic identification of lower-level units

(Halliday & Matthiessen, 2004) Automatic text classification 《刑事判决书的写作要领》 (Halliday & Matthiessen, 2004)

Automatic text classification Key word in kernel information Process elements in obligatory upper-level units as the primary variable Participant as a substitute for process The order of these elements

Pretreatment of discourse information Information Element Participant Process Circumstance

Identification Rules of Information Units Three dimensions of the identification rules Which elements should be selected? What’s the logical relationship among the selected elements? How they stand to each other?

Identification rules (upper-level units) 1) Combination of process and participant as the main indicator a) The synonyms of process verbs or participant nouns should be included. b) Where one IU contains more than one common process verb, all these verbs probably need to be considered. 2) Absolute position as the complementary indicator

Identification rules (lower-level units) Obligatory lower-level information items as the default value Process plus participant as the main indicator Relevant position as the alternative indicator

Realization of automatic information identification

4. Evaluation of Automatic Information Processing The information rule based text classification vs. an SVM classifier preprocessing includes word segmentation and stop-word removal, with only nouns and verbs kept we represent these words with Vector Space Model and calculate their tf-idf value 1000 features with the highest 2 values are selected Directed Acyclic Graph SVMs (DAGSVM) is adopted for this multiclass classification

The information rule based information identification vs The information rule based information identification vs. a Viterbi Identifier λ = (Q, O, A, B, ) Q denotes the information category; O is the category of process verbs; A means the probability of information category transition; B refers to the likelihood that a category of process verb occurs in the case of a given information category;  is the initial probability of information category.

SVM and Viterbi algorithm are very successful in NLP. Their weak performance might be attributed to the feature (observation) selection. Both of them ignore the hierarchical structure of discourse.

5. Limitations and Suggestions for Further Research More types of legal texts should be explored The rules require a very detailed discourse analysis Suggestions for further research: The rule-based approach can cooperate with statistically based approaches

Thank You !