Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural.

Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman NIPS Deep Learning Workshop, 2014 1

Outline Model Architecture (and a drawback: ‘word classifier’ model too large) What does SVT dataset and synthetic dataset look like Failure Cases and Special Cases of ‘word classifier’ model analysis of the inferior ‘char classifier’ and ‘n-gram’ classifier. 2

Note! CNN is trained on completely synthetic data – No real data The three different models uses the same CNN as the feature extractor That CNN is trained using the 1 st model’s architecture, and used for fine-tuning when training the 2 nd, 3 rd model. The three models are: “word classifier”, “char classifier”, and “n-gram classifier”. 3

Model Architecture and Size Green: word classifier Blue:char classifier Black: n-gram classifier 4 Every model shares weights except the last layer. (credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014) 4

Model Architecture and Size Green: word classifier Blue:char classifier Black: n-gram classifier (credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014) 5 Size: 485MB Size: 1918MB, 1340MB for the last softmax layer Size: 667MB Every model shares weights except the last layer. word classifier Although the word classifier performs the best, the last layer is too big. Every model shares weights except the last layer. word classifier Although the word classifier performs the best, the last layer is too big.

What does SVT Dataset look like? 6

What does the synthetic training set look like? 8 Generating process: (credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014)

Successful Cases Remind: in the experiment, this model(“word classifier”) performs well 93.1% on IC03 dataset, 80.7% on SVT dataset Those that look like images in the synthesis training set 9

Successful Cases Remind: in the experiment, this model(“word classifier”) performs well 93.1% on IC03 dataset, 80.7% on SVT dataset Slightly Harder: 10 Tabu: 0.89 Tdd: 0.045 Burbank: 0.99 Bureaus: 0.0016

Failure Cases --- need some context information 11

Failure Cases – need some context information 12 Context: (Predicted: AIR) (True: )

Failure Cases – need some context information 13 I don’t think it is correct, and it is used to derive the formula below. 0.1003 (Predicted: AIR with score 0.1003) 0.0046 (True: BAR with score 0.0046)

Failure Cases – need some context information 14 I don’t think it is correct, and it is used to derive the formula below. 0.5286 (Predicted: Contort with score 0.5286) 0.4690 (True: Comfort with score 0.4690)

Accuracy on SVT Dataset Trained Lexicon: 90k TOP-1: 83.59% TOP-5: 89.47% TOP-10: 91.18% Quick Reminder: What does 90k and 50 mean? Trained Lexicon: 50 TOP-1: 95.4% ↑Much higher than the above“83.59%”, which shows the power of “context” Other Methods: (copied from the paper) PhotoOCR: 90.4% Almazan: 89.2% Gordo: 90.7% 15

Special Cases: slanted words These three images look hard! Because the text is slanted. Top-5 predictions for the “city” image: '19' 'iq' 'nsi' '12' 'isl‘ For the “lights” image: 'sights' 'insults' 'pews' 'resits' 'jesuits‘ For the “bookstore” image: 'predispositions' 'prolongations' 'pseudonymous' 'buys' 'preconceptions‘ None of them are correct! The same thing happens for these images: 16

Special Cases: slanted words Test images: What are these words like in the synthetic training set? ↑ Don’t have much variation in the orientation of the text. 17

Special Cases: vertical words Can it handle vertical words? No, because No vertical words in the training set. The CNN input is 32*100. If vertical, the word will be elongated and only span 32 pixels. example: (32*100) Top-5 Predition: 'e' 'b' 'is' '8' '19‘ (totally wrong) 18

“character classifier” on SVT dataset 19 Prediction Examples: (acc: 72.91%) Pred: mountann True: mountain Pred: apartments True: amartments Pred: tie True: the (top image credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014)

Accuracy of the “character classifier” on SVT dataset (find the closest word in the 90k lexicon) 20

Accuracy of the “character classifier” on SVT dataset (find the closest word in the 90k lexicon) For the “without” model: 0.608 Average edit distance to the true word: 0.608 2.24 (quite high) Average edit distance to the true word, for the wrongly predicted words: 2.24 (quite high) 1.9 (also high) On IC13 dataset: 1.9 (also high) 21 Pred: motomborts (×:2) True: motorsports Pred: araaery (×:3) True: brewery Pred: anngll (×:3) True: angelo

Accuracy of the “character classifier” on SVT dataset (find the closest word in the 90k lexicon) For the “without” model: 0.608 Average edit distance to the true word: 0.608 2.24 (quite high) Average edit distance to the true word, for the wrongly predicted words: 2.24 (quite high) 1.9 (also high) On IC13 dataset: 1.9 (also high) Conclusion: Doesn’t perform well compared to the ‘word classifier’ model (83.59%). Computing edit-distance correction on 90k lexicon is very slow. (9.5s per word) 22

Accuracy of the “N-Gram classifier” on SVT dataset 23 Input: Image. Output: N-gram encoding To make a prediction: find the lexicon word that have the closest N-gram encoding. e, h, l, o, t, el, ho, ot, te, hel, hol, hot, lof, lte, oel, oet, ohe, olt, oot, ote, otl, ott, tel, tet, tte, tely, otte, oote, oter, oted, hoot, rott, toot, lott, mote, otle, ottl f, m, o, r, u, fm, fo, fu, or, rm, ru, um, uo, for, fou, fum, fur, ium, miu, orm, oru, otu, our, rmu, rum, uor, uru, form, ourn, rium, rums (top image credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014)

Accuracy of the “N-Gram classifier” on SVT dataset 24 Input: Image. Output: N-gram encoding To make a prediction: find the lexicon word that have the closest N-gram encoding. Accuracy on the 90k lexicon: 60.22% Conclusion: Worse than the ‘word classifier’ (83.59%) and ‘char classifier’ (72.91%). Computing the closest N-gram encoding on the 90k lexicon is very slow. (3.5s per word) (top image credit: Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg et al. Deep Learning Workshop, 2014)

Summary 25

Summary N-gram classifier is not useful in the more general 90k lexicon And the paper only reports it on the smaller 50 lexicon Neither does char classifier But maybe useful for recognizing any char sequence Word classifier is both fast and accurate, but it’s big. 26

Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural.

Similar presentations

Presentation on theme: "Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural.

Similar presentations

Presentation on theme: "Zhuode Liu University of Texas at Austin CS 381V: Visual Recognition Experiment Presentation of Synthetic Data and Artificial Neural Networks for Natural."— Presentation transcript:

Similar presentations

About project

Feedback