Entity Extraction by Deep Learning

Entity Extraction by Deep Learning
From Research to Production Hila Zarosim and Noam Rotem

Agenda A Tale of taking deep learning to production:
The requirements The debates and decisions The challenges The selected technology The implementation The results The next steps Session focus: productionization! The audience today is mixed in experience and interest. We’ll try to talk to everybody, and not to lose everybody…

The Mission Extracting people names from news documents in French.
Replacing a legacy rule-based solution (reasonable quality, old methods). Should be language agnostic (solution per language, but one code). Runtime is in Scala. RAM consumption must be low. Tagging latency must be low. Extraction quality should be good enough. Readiness for production - within 4 months from research start time. The meaning of all that: decisions should be taken fast; no much time for experimenting; we go with “good enough”, and quality improvement cycles will come later. Language agnostic – specific

The Mission – Quick Context – TRIT

The Mission – Quick Context – Eikon

The Mission – Quick Context – World-Check One

Input Example - Text

Output Example – Refinitiv RDF

Debate: Candidate Elimination or Sequence Labeling?
First, make a list of candidate names: By lexicon By appearance (e.g. - capitalized) Then eliminate by algorithm. Sequence Labeling – Run a whole text segment through an algorithm, that classifies each word / token. Classification could be: Is it part of a name? or: What part of a name is it?

Debate: Candidate Elimination or Sequence Labeling?
In favor of Candidate Elimination: Works well for us with company name extraction Is usually easier than doing sequence labeling Is faster than sequence labeling (tagging only candidates, and not every word) Is less likely to embarrass us with awfully wrong tagging But we decided to use Sequence Labeling: Highly recommended by the academia for named entity recognition Not restricted to name lists or name detection rules

Debate: Which Machine Learning Algorithm to Use?
Classic machine learning? We specify features - hints in the text that may lead to the correct labeling. The algorithm finds the optimal way to use the features for decision. Deep Learning? We build a Neural Network, that – like the brain - can hold the correct wiring for knowing how to find the right sequence labeling. We give the network a lot of examples of texts, and their correct labeling. The network learns by itself what to look for, and how to find the correct labeling.

Debate: Which Machine Learning Algorithm to Use?
In favor of classic machine learning: It is our comfort zone It needs low volumes of training data Training is quick Runtime is very quick We control the features, so we can tune and debug and improve… But we decided to use Deep Learning: State of the art Corresponds well with sequence labeling Easier to be language agnostic Cool ;)

Debates: Which Neural Network Technology?
We looked into many Deep Learning technologies: PyTorch by Facebook – convenient, mature, supported by a large community. No runtime in Java/Scala. DL4J by Eclipse – built for Java/Scala. Still quite difficult to use. Unclear maturity. Supported by small community. DyNet by Carnegie Mellon University – written in C++, with binding in Python. Looks promising, but not mature yet, and supported by only a small community. MXNet by Apache – C++ with Python wrapper. Also looks promising, but very new. Tensorflow by Google – has wrappers for Java/Scala and Python. Supported by the community, and mature. When used in Python, may be combined with Keras for high level and convenient programming.

Debate: Which Neural Network Technology?
We decided to go with Tensorflow, because: Its maturity Its large community Its wrapper for Java/Scala Keras – a high level API on top of Tensorflow, that makes it very easy to program.

Tensorflow – ID Open source software Released by Google on 2015
Mostly for Deep Learning Written in C++ Wrappers in Python, Java, Javascript(!) Runs on GPU or CPU Caveat: training API is not accessible from Java, only from Python and C++.

So Python AND Scala?? Yes. We have to. Tensorflow’s training can be done only in Python (or C++, perish the thought…). Our runtime is in Scala. Why not taking Python to production (my personal opinion)? I did that in the past. I took Tensorflow in Python to production (an image processing deep learning solution). There were memory management issues, multithreading problems, latency troubles. Python is not type safe. It is a duck-type language. Many errors that could have been prevented at compile-time, will fail a customer at runtime. Python is wonderful for research and training. I wouldn’t take Python 3.7 to production again. It may change in the future.

Project specifics: Neural Network Input
Rolling windows of 1000 tokens: (1000 is arbitrary. We will reduce it to ~100). Wide enough for context Low enough to keep model training time reasonable Good enough to cover typical documents without sliding Shorter documents are padded. Rollling is with overlapping. Word embedding: Tokens are converted into vectors that capture the meaning of the word, in many aspects (=dimensions). Briefly tried Google’s word2vec and Facebook’s Fasttext off the shelf. Fasttext gave us better results. Vectors of 300 dimensions. Unknown words (also names) are mapped to a constant vector. We could use other techniques as well. So: model input is 1000 vectors of 300 numbers.

Project specifics: Neural Network Output
BILOU Markup: Beginning Inside (Internal) Last Outside (Other) Unit (and markup for padded tokens) So model output is 1000 times 6 probabilities that complement to 1.0.

BILOU Markup Example O B L President Donald Trump tweeted today . O B
John Trump tweeted today . O U President Trump tweeted today .

The Neural Network - Intuition
Intuitively we would like to go over all tokens in the text in their given order We would like to learn connection between the words We would like to learn the context in which a name appears We would like to learn morphological characteristics of person names

RNNs RNNs allow working with arbitrarily sized sequences
A good choice when the input is text They perform the same task for every element of the sequence, with the output being depended on the previous computations In case of text, every element of the sequence is a token We can think of them as having a “memory” which captures information about what has been calculated so far They can remember what we have seen in the sentence so far

RNNs and bi-directional RNNs

The Basic Neural Network - Structure

Adding Character Embeddings to the Network
Network to include morphological information Many times, a prefix or a suffix of a word contain information about its meaning The idea is to add another layer that sees the character Since a word in a sequence of characters, an RNN is suitable for this layer Helpful also with unseen words and misspellings

Adding Character Embeddings to the Network

Adding CRF (Conditional Random Field)
A CRF can take context into account It uses contextual information from previous labels, thus increasing the amount of information the model has to make a good prediction State of the art NER system combine CRFs with biLSTMs Keras itself does not implement CRF An implementation is given by the keras-contrib package (python) How to take it to production? Took some time to understand how to use it properly Surprisingly, results did not improve

The Neural Network – Loss Function
Roughly speaking, measures how much we lose by outputting the network’s output instead of the correct label The goal of the network is to minimize the loss Should all mistakes be weighted the same? “O” tag is very frequent, other tags are rare Outputting always “O” gives a classifier with 97% accuracy Outputting “O” instead of BILU tags should be more expensive We give different weights to each tag by their relative frequency

Technicalities with Keras
Generators Simple way to use keras on text: First layer is an embedding layer, tokens are converted into their vectors by this layer Embeddings are part of the model parameters This results in models of size 1.5 – 3 Gb We removed the embedding layer – now we should generate input vector by ourselves Input size is now huge (each token  300-dim vector) and we cannot hold it in memory We should use generators

Generators

Adding Character Embeddings Should use keras Functional API instead of sequential models This is the way to implement complex networks In our case, we have multiple inputs: char-level, and token-level

Training Challenges: We Need 100k Documents
Why 100k? It’s an arbitrary number… Deep learning is learning by example… 280k documents – no major improvement. 50k documents – low results. Magic number? How to get 100k news documents? Hey, we are Refinitiv! And we happen to know people in Thomson Reuters! ;)

Training Challenges: We Need the Documents Labeled
Labeling: tokenizing, then marking each token with B, I, L, O or U. Our legacy component has tagged people names in French documents for us. Labeling quality is only ‘ok’. But ‘good enough’.

What if We did not have a Previous Tagger?
Manual labeling of 100k documents?? Manually tagging contradiction between two 3rd party NLP tools. Using the phonebook of Paris for tagging (“weak supervision”)

Training Challenges: Training in Python, Runtime in Scala
Why is this a challenge? Tensorflow runs from Scala! Yes, but: text pre-processing! Tokenization, sentence splitting, … Solution: pre-process in Scala, train/test in Python. Use in Scala.

Results? After training, the model is tested against various test-sets, validation-sets, gold-sets. We test success per class (B, I, L, O and U), and per full name. Quality: Good enough! Let’s go production!

Productionization Challenges: Tensorflow – Model Compatibility
Train Tensorflow in Python, run the model in Scala. Is it that simple? Keras generates its own models (h5  pb). Model input serialization and output deserialization is challenging. It’s all SMOP. But it takes time and effort…

Productionization Challenges: Tensorflow – Memory Management
And then – there was this memory leak… close() And then – there was only this stable memory over-allocation… CPU/GPU core formula Protobuf And nothing is visible in Java monitors

Productionization Challenges: Tensorflow – Additional Aspects
Multithreading and thread safety Mostly thread safe Resources are allocated in both sides. Maybe can be pooled. Latency: We currently don’t have GPUs in production environments Transactions take about 200ms Can be significantly improved by reducing input size, adding memory allocation and tuning Tensorflow’s behavior Stability: No convenient indication when the C++ library “freezes”

Productionization Challenges: Word Embedding Size
Tensorflow model: With embedding layer: 2.5GB Without embedding layer: 4MB Scaling roadmap: Tensorflow model per language per concept Embedding lookup per language Production team says: nay! Solution: Lucene Index, on disk Client cache Bloom filter

Become Language Agnostic
We immediately reused the training / testing / running code for English and Spanish It required: Training data in specific language Separation of logic (“The NLP Story”) from the language specifics when preparing the texts for the neural network. We are developing LATI – the Language Agnostic Tagging Infrastructure.

Summary Taking Deep Learning to production is challenging but doable
“Good enough” is defined differently in the academia and the market Much further tuning is possible for both quality and latency With Deep Learning, language could become a non-issue

Thank you!

Entity Extraction by Deep Learning

Similar presentations

Presentation on theme: "Entity Extraction by Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Entity Extraction by Deep Learning

Similar presentations

Presentation on theme: "Entity Extraction by Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback