Download presentation
Presentation is loading. Please wait.
1
Boosting Textual Source Attribution Foaad Khosmood Department of Computer Science University of California, Santa Cruz Winter 2006
2
HEY??? What’s so funny? ► What makes something funny? ► Can we tell by just reading? Can a computer? ► Shakespeare’s Comedies and Tragedies. Actually, Comedies, Tragedies, Historical Plays and Sonnets.
3
High Level Source Attribution Process
4
Experimenting with Boosting Most work done on binary classification. Needs lots of “weak” learners. Some variants work well with limited Data Set. Will provide knowledge about importance of features.
5
Data Set (Training) ► Comedies Measure for Measure Much Ado about Nothing Merchant of Venice Midsummer Night’s Dream Taming of the Shrew Twelfth Night ► Tragedies Anthony and Cleopatra Titus Andronicus Hamlet Julius Caesar Romeo and Juliet
6
Data Set (Test) ► All’s Well that End’s Well [c] ► Comedy of Errors [c] ► As You Like It [c] ► The Tempest [c] ► Mary Wives of Windsor [c] ► King Lear [t] ► Macbeth [t] ► Coriolaunus [t] ► Othello [t]
7
Feature Selection ► Features: words ► Selection method: picked 2500 most common words in the Training Set ► Preprocessing: 300 common English words and grammar operators removed HTML and stage directions removed ► 429 out of 2500 words were not common to all plays. Chose the 429 for weak learner functions. (this particular run)
8
TRAGEDY WORDS COMEDY WORDS 429 Words: 225(Com.), 204(Trag.) Data: Vector of 2500 words: X= [X1, X2…X2500] Weak Learners F1(X)…F429(X), each returning 1 for a positive hit.
9
PlayInput S [n]Ynd[t]h1[t]u[n]e1[t]d[t+1] Measure [ 2500 words ]1 +1/811 3/8 * 8/10e Much Ado [ 2500 words ]1 +1/8 3/8 * 8/6e Merchant [ 2500 words ]1 +1/8 3/8 * 8/6e Midsumme r [ 2500 words ]1 +1/8 3/8 * 8/6e Romeo [ 2500 words ] +1/81 3/8 * 8/10e Antony [ 2500 words ] +1/81 3/8 * 8/10e Titus [ 2500 words ] +1/81 3/8 * 8/10e Hamlet [ 2500 words ] +1/81 3/8 3/8 * 8/10e
10
Boosting ► A mix of LP Boost and TotalBoost ► No Termination (finite weak learners) ► Didn’t have a Gamma function, used Eta (error) instead. ► Didn’t use Zero Sum constraint on normalization of weight updates.
11
Classification Used Accumulated weights at the very end Every presense in Test Corpus means (1*W) added to totalW, some W’s negative At the end it was a simple matter of observing if the results were positive or negative and by how much.
12
13
14
15
Program Output ► [root@localhost output]#./classify.sh ► 00_allswell.html-ratio.txt: 14.6807 ► 01_comedyErrors.html-ratio.txt: 13.2634 ► 02_measure.html-ratio.txt: 34.2748 ► 03_muchAdo.html-ratio.txt: -6.43018 ► 04_asyoulikeit.html-ratio.txt: 18.8413 ► 05_cleopatra.html-ratio.txt: 14.1148 ► 06_lear.html-ratio.txt: 32.2858 ► 07_macbeth.html-ratio.txt: -21.095 ► 08_coriolanus.html-ratio.txt: 43.5599 ► 09_titus.html-ratio.txt: -3.31167 ► 10_cleopatraFull.html-ratio.txt: -300.179 ► 11_learFull.html-ratio.txt: 356.504 ► 13_tempestFull.html-ratio.txt: 454.171 ► 14_marryWivesFull.html-ratio.txt: 147.738 ► 15_measure2.html-ratio.txt: 39.0357 ► 16_measureFull.html-ratio.txt: 112.527 ► 17_muchAdoFull.html-ratio.txt: 256.078 ► 18_veronaFull.html-ratio.txt: -222.444 ► 19_othelloFull.html-ratio.txt: -433.769 ► 20_titusFull.html-ratio.txt: -564.977
16
Results ► All’s Well that End’s Well [c][1] ► Comedy of Errors [c][1] ► As You Like It [c][1] ► The Tempest [c][1] ► Mary Wives of Windsor [c][1] ► King Lear [t][0] ► Macbeth [t][1] ► Coriolaunus [t][0] ► Othello [t][1] 2/9 mistakes, 7/9 or 77%, (also 66% and 69%) Previous run on Neural Net (different setup: 5/13 61%) - With no proportionals!
17
Challenges ► Natural language has a lot nuances that could make a difference (preprocessing methods, “common word” sets, adaptations) ► Boosting has great potential in this area ► Words provide easy method for coming up with (many) weak learners
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.