Download presentation
Presentation is loading. Please wait.
Published byCharla Black Modified over 8 years ago
1
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1
2
05/03/03-06/03/03 7 th Meeting Edinburgh FE as a Classification Task BWI: Classify segment from i to j as an entity if F(i)A(j)H(j-i) > , where –F and A are sets of detectors matching entity start and end boundaries, and –H(k) is a function reflecting the probability that a field has length k. NB: Choose the label v j that maximises P(v j ) i P(a i |v j ), where –the a j are the active features
3
05/03/03-06/03/03 7 th Meeting Edinburgh NBFE v.1 Specifics Naïve Bayes software is a C binary All shell scripts have been converted to Perl scripts! Waiting on demarcator to integrate this component into the system…
4
05/03/03-06/03/03 7 th Meeting Edinburgh NBFE Pipeline Predict fact labels from NERC output –Create classifier feature file from XHTML –Predict labels for examples in feature file –Substitute predicted fact labels into XHTML output Prepare training file for classifier –Create classifier feature file
5
05/03/03-06/03/03 7 th Meeting Edinburgh NBFE Modes Training –Transform handcrafted fact-labelled XHTML training files into nb-formatted training set Labelling –Transform NERC-labelled XHTML into nb feature file –Substitute fact labels into XHTML for output Testing –MUC7: align system output and gold standard –xmlperl: add gold standard fact label into XML markup
6
05/03/03-06/03/03 7 th Meeting Edinburgh Pipeline Diagram NERC- labelled XHTML NERC- labelled XHTML Auxiliary File Pipeline Auxiliary File Pipeline nb feature file nb feature file nb (classifier) nb (classifier) Fact- labelled feature file Fact- labelled feature file Fact Label Substitution Fact Label Substitution Fact- labelled XHTML Fact- labelled XHTML Handcrafted, fact-labelled training set Auxiliary File Pipeline Auxiliary File Pipeline nb-formatted training set nb-formatted training set Handcrafted, fact-labelled XHTML MUC7 Scorer MUC7 Scorer NBFE Performance measures NBFE Performance measures Training file creation Labelling mode Evaluation
7
05/03/03-06/03/03 7 th Meeting Edinburgh Feature Extraction LT XML tool sggrep used to extract entity elements from the NERC phase e.g. 15.0inch Regular expressions used to extract features from the resulting text e.g. LENGTHFACT, INCHFACT, FIRSTLENGTHFACT, …
8
05/03/03-06/03/03 7 th Meeting Edinburgh Example Feature File Excerpt from input file to nb : Output file has predicted fact labels in place of the PFACT label: DUMMY, PROCESSORFACT, FIRSTPROCESSOR, PFACT DUMMY, SPEEDFACT, FIRSTSPEED, PFACT DUMMY, PROCESSORFACT, FIRSTPROCESSOR, processorName DUMMY, SPEEDFACT, FIRSTSPEED, processorSpeed
9
05/03/03-06/03/03 7 th Meeting Edinburgh Preliminary Evaluation Evaluation on all languages using an internal scoring mechanism of the nb software –does not account for small differences in the testing set feature extraction or possible errors introduced in substituting the fact labels into the XHTML –does evaluate the effectiveness of the naïve Bayes model This produces accuracy scores defined as: accuracy = # correct / total predictions
10
05/03/03-06/03/03 7 th Meeting Edinburgh Evaluation Results Accuracy# Train# Test English89.344248 French90.844438 Greek94.013736 Italian93.303742
11
05/03/03-06/03/03 7 th Meeting Edinburgh Evaluation Results FACTPRECISIONRECALLF =1 manufacturerName 100 95 97.43 modelName 96 100 97.95 processorName 99 98 98.49 screenType 0 0 0 batteryType 12 71 20.53 batteryLife 0 0 0 warranty 64 100 78.04 price 72 100 83.72 preinstalledSoftware 93 69 79.22 preinstalledOS 79 97 87.07 hdCapacity 95 96 95.49 ram 67 96 78.92 1 2
12
05/03/03-06/03/03 7 th Meeting Edinburgh Results (continued) FACTPRECISIONRECALLF =1 processorSpeed 87 97 91.72 modemSpeed 88 96 91.82 cdromSpeed 49 86 62.42 dvdSpeed 0 0 0 width 35 18 23.77 height 71 30 42.17 depth 0 0 0 screenSize 60 73 65.86 weight 93 100 96.37 screenResolution 80 97 87.68 AVERAGE 75 82 78.34 3
13
05/03/03-06/03/03 7 th Meeting Edinburgh Preliminary Analysis 1.… screenType never correctly classified NO TRAINING EXAMPLES!!! 2.… batteryType label assigned to far more examples than it should be OVER-PRODUCTIVE FEATURE EXTRACTION. 3.…confusion amongst numeric fact types (e.g. cdromSpeed, dvdSpeed, modemSpeed, width, height ) CURRENT FEATURES DO NOT CAPTURE ENOUGH INFORMATION TO DIFFERENTIATE
14
05/03/03-06/03/03 7 th Meeting Edinburgh Analysis No access to contextual features –e.g. order of l x w x h triples Improvements? –modify our feature extraction technique –include complex relative size features
15
05/03/03-06/03/03 7 th Meeting Edinburgh File Problems HTML –No automatic way to deal with ill-formed HTML MERGE –Current version cannot deal with entities belonging to multiple product/job descriptions TIDY –Sometimes makes incorrect fixes to crossed brackets
16
05/03/03-06/03/03 7 th Meeting Edinburgh Bad Files II TRAINTEST English4248 French4438 Greek3736 Italian3742
17
05/03/03-06/03/03 7 th Meeting Edinburgh Conclusion Disadvantages: –access to contextual features –more supervision required for new domains than is required for wrapper induction Advantages: –not repeating work done in NERC modules –scores are very promising
18
05/03/03-06/03/03 7 th Meeting Edinburgh To do… Complete evaluation –dependant on the merge tool Integration –dependant on the demarcator –need to test on Windows
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.