Download presentation
Presentation is loading. Please wait.
Published byMarvin Ellis Modified over 9 years ago
1
SCALING THE KNOWLEDGE BASE FOR THE NEVER-ENDING LANGUAGE LEARNER (NELL): A STEP TOWARD LARGE-SCALE COMPUTING FOR AUTOMATED LEARNING Joel Welling PSC 4/10/2012
2
Automated Learning from Text is not a supercomputing problem But it really should be. Simultaneously infer patters and identify data matching those patterns using large collections of text Corpus size is supercomputing-scale (~7TB compressed for the ClueWeb corpus) Learning algorithms (like clustering and SVD) are appropriate to supercomputers. Largest scale projects typically use Hadoop
3
NELL: The Never-Ending Language Learner NELL is a project of the ReadTheWeb group at CMU: Tom Mitchell and William Cohen Based on 500M web pages (ClueWeb09) and access to the rest of the web via search engine APIs You can follow NELL on twitter as “cmunell”
5
NELL's tasks Extract new instances of categories and relations. In other words, find noun phrases that represent new examples of the input categories (e.g., "Barack Obama" is a person and politician), and find pairs of noun phrases that correspond to instances of the input relations (e.g., the pair "Jason Giambi" and "Yankees" is an instance of the playsOnTeam relation). These new instances are added to the growing knowledge base of structured beliefs. Learn to read better than yesterday. NELL uses a variety of methods to extract beliefs from the web. These are retrained, using the growing knowledge base as a self-supervised collection of training examples. The result is a semi-supervised learning method that couples the training of hundreds of different extraction methods for a wide range of categories and relations. Much of NELL’s current success is due to its algorithm for coupling the simultaneous training of many extraction methods.
6
NELL Learns In Cycles Each requires pouring the corpus disk-to-disk, applying templates and checking against the currently known KB. Very slow. We'd like to hold everything in memory, and learn templates and facts simultaneously. Speed things up by a factor of 100 or more.
7
Typical Scaling Work-around Currently the text corpus gets streamed disk-to- disk and a co-occurence matrix is built. Stats are taken from this, but it can't distinguish between “Bill ate the cow” and “The cow ate Bill”. This is a “bag of words” model. We want to do “deep reading” instead; actually parsing and understanding individual sentences.
8
Barriers To Scaling NELL Java Use of web services Database implementation of the NELL Knowledge Base (KB) Hadoop-based mindset, if they think of scaling at all Single threads, with many JVMs working in parallel
9
Steps to solution Learn to run Java stably: there’s a trick Move to scalable DB: the current project Minimize real-time web service interaction More parallelism. This is most easily achieved with many threads, so Blacklight is good.
10
Java GC Threads On Blacklight, it tries to spawn one garbage collection thread per core. Over the whole machine. To control this: -XX:ParallelGCThreads=16 (for IBM, -Xgcthreads16) Beware of lucene A common DB server written in Java; you must arrange to start it up with the same flag.
11
Knowledge Base Representations The KB knows the ontology: what is what. It is referenced continuously as the program runs, maybe across multiple threads NELL’s KB is described in 3 ways: Theo: syntactic, and evolving TokyoCabinet representation Graph Databases and Neo4J Why change? Scaling, explicit transactions, natural structure
12
Theo Theo is the abstract syntax of the knowledge base. It's still evolving. Value : { string | number | list | pointer to entity } Entity : { PrimitiveEntity | query | belief } PrimitiveEntity : string Slot : one of a subset of the set of PrimitiveEntities Query : ( entity, slot ) Belief : query = value
13
Theo Examples (Tom wife) = Sally (Tom wife =Sally haircolor) = blonde (Tom wife haircolor) = blonde Some beliefs are currently valid but being dropped: Tom = blonde (primitive entity treated like a slot) Note the difference!
14
TokyoCabinet TokyoCabinet is a fast, light key-value pair database. It was used for the first KB representation because it was handy. Theo examples become DB key-value pairs: KeyValue Tom wifeSally Tom wife =Sally haircolorblonde Tom wife haircolorblonde
15
Graph Databases A number exist: Neo4J was chosen for: Compact Flexible Not specialized for web apps Matches common structure, like Tinkerpop Blueprints Popular There are other options- substitution would be easy
16
Neo4J Elements Node Nodes and relationships can be inserted into indices Given a node, get its relationships and properties by name. Given a relationship, get either node. Nodes and relationships can be inserted into indices Properties are just scalars or arrays, accessed by name Properties Relationship
17
Problems Ambiguity: Earlier version doesn't make a clear distinction between strings and entities. Ambiguity: Should a value consisting of a list in TokyoCabinet get mapped to one graph element or many? Meanwhile: The ReadTheWeb group is adding 'contexts'- the time intervals in which things happen. The Neo4J representation will need to be updated to support this.
18
Neo4J Reference node ‘Entity’ node With value “hillary” ‘Slot’ node Relationships Identified by name
19
“billclinton dog =rover haircolor =blonde according to =tom” “billclinton dog haircolor =gray according to =sam”
20
This arc can only exist because of the ambiguity between entities and slots The syntax needs to indicate when nodes (like these two) need to merge
21
Status The graph database project is feeding back into the rest of the software base, helping to inform new design changes. But we're about a month behind: The target is changing Effect of ambiguities was not anticipated An implementation exists. The next step is timing tests.
22
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.