Download presentation
Presentation is loading. Please wait.
Published byAustin Goodwin Modified over 8 years ago
1
Reflex 2: a Look at the Internals of an Automated Legislative C itator Marc-André Morissette morissette@lexum.com Daniel Shane shaned@lexum.com Valentin Bujanca bujancav@lexum.com
2
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Context First, a look at CanLII The most popular resource for Canadian primary legal material > 1M court and tribunal decisions Federal statutes and regulations + statutes and regulations for 13 provinces and territories, most point in time > 30,000,000 pages of legal text Citator Automatically recognizes legal citations and adds hyperlinks Convenience Note-up Future improvements to search
3
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Traditional Architecture of a Citator Phase 1: Recognition of Citation Elements A) Titles B) Section Numbers C) Chapter Number / Formal Citations Phase 2: Heuristics Example: of the tie section and legislation together Phase 3: Markup Add hyperlinks
4
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Titles (1) More than 40,000 titles in our database Those words are composed together to create a Nondeterministic Finite Automaton Agreement marketing cooperative act marketing act Agriculture programs products start on implementation trade internal act marketing act
5
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Titles (2) The text of the document is split into words For each word, a new path trough the automaton is attempted If a path is completed, then we have found the title of a document
6
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Sections Regular Expressions Based on formal language and automata theory Determine whether a given string of text matches a given pattern Examples: \d matches any digit \d+ matches any number of digits (s\.|ss\.|section|subsection) (\d+) matches a reference to a section (hopefully)
7
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Chapter Numbers (1) Typically done with Regular Expressions too RSC (\d{4}), c ([A-Z]\-\d+) matches RSC 1985, c C-46 For every match, check legislative databases for matches …but becomes a bother really quick Because citations vary greatly across jurisdiction Because they even vary greatly across the same jurisdiction RSC 1985, c C-46 (codified in 1985) SRC 1970, c P-33 (codified in 1970) SC 1997, c 36 (annual statute) RSC 1985, c 32 (4 th suppl.) (codified in the 1985-1988 period) SC 1926-27, c 37 (rare) SC 1992, c 46, Sch II (rare) SC 2003, c 22, s. 6 (damn)
8
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Chapter Numbers (2) Complexity lots of errors by judges in citing statutes and regulations RSC 1985, c C-46 RS 1985, C-46 SC 1985, c46 RSC, 85, C,46 1985 C46 More complex regular expressions can deal with this… to some degree
9
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Chapter Numbers (3) Things start to break down when all these variations come together # jurisdictions X # citation forms X # acceptable user errors = Massive headache Solution: invert the problem Don’t try to match every possible citation in the text Instead, generate every possible acceptable variation
10
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Chapter Numbers (4) For every citation in our database, add rules For every “RSC” in citation, generate a variation with “RS” instead RSC 1985, c C-46 RS 1985, c C-46 For every “c” after the year, generate a variation with “chapter” or “ch” RS 1985, c C-46 RSC, 1985 c C-46 RS 1985, ch C-46 RSC, 1985 ch C-46 RS 1985, chapter C-46 RSC, 1985 chapter C-46
11
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Recognition of Chapter Numbers (5) Better than Regular Expressions because We can limit what variations can combine together RSC C-46 1985 c. C46 C-46 One variation rule can be written to cover all jurisdictions every variation within that jurisdiction No need to know there are rare forms such as SC 1992, c 46, Sch II The variations are fed into the Nondeterministic Finite Automaton RSC 1985 chapter C c ch 46 RS C C 1985 start
12
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Phase 1: Recognition of Citation Elements A) Titles (NFA) B) Section Numbers (Regular Expressions, a form of Automata) C) Chapter Number / Formal Citations (NFA) Done using an implementation of Pike’s VM Creates a large virtual machine out of any DFA or Regular Expression Created by Russ Cox (Bell Labs)
13
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Phase 2: Association Heuristics Our objective for these rules: Be conservative and minimize the number of false positives A sample If multiple overlapping citations are recognized, use the longest one Criminal Code / Order Designating Saskatchewan for the Purposes of the Criminal Interest Rate Provisions of the Criminal Code Learn shorthand aliases (the “Act”): ExampleExample
14
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Phase 2: Association Heuristics (2) Section association Some sections are strongly associated Section 12 of the Criminal Code, RSC 1985, c C-46 Others are weakly associated 1.If one section is strongly associated, then every other section with the same number has the same association ExampleExample 2.If a section is followed by the words “of the” without a citation, then do not associate ExampleExample 3.If a section number follows another citation close enough
15
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Phase 2: Association Heuristics (3) Popular, alternative and previous legislation titles and citations added to our databases PIPEDA – Personal Information Protection and Electronic Documents Act Unemployment Insurance Act – Employment Insurance Act Gazette numbers for certain regulation collections Resolve ambiguous citations using basic jurisdictional rules Example
16
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 The End
17
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Conclusion CanLII’s rate of recognition for legislative citations massively improved Harder numbers forthcoming
18
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Legislative Citations Simple example Criminal Code, RSC 1985, c C-46, ss. 3.1-7 Not so simple examples Mangled chapter numbers 1985 C-46 RSC C46 RS, (1985), chapter C 46 Section numbers s., ss., sec., section, subsec, sub-sec, para, alinea, etc.
19
Reflex 2: A Look at the Internals of an Automated Legislative Citator at LVI 2012 Legislative Citations (2) Ambiguous citations Family Law Act (which jurisdiction?) Familiar names PIPEDA - Personal Information Protection and Electronic Documents Act Obamacare – PPACA – Patient Protection and Affordable Care Act Substitution Acronyms […] pursuant to s. 31.42 of the Environment Quality Act (“EQA”), to […] […] interpretation of s. 31.42 EQA, but […] The vagaries of human language […] not apply to section 25 of the Criminal Code […] Section 37, however […]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.