Download presentation
Presentation is loading. Please wait.
Published byDustin Cameron Modified over 9 years ago
1
Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy
2
Inxight Confidential 2 20+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discovery Silicon Valley HQ; offices in US, Europe 250 major customers Seasoned management team Solid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank, Dresdner Bank, Xerox, In-Q-Tel Inxight at a Glance Inxight provides the only complete solution for organizing and accessing unstructured data to increase the speed and accuracy of information discovery
3
Inxight Confidential 3 What we mean by language support Not pure statistics “Language independence” is a fallacy when it comes to text Whitespace parsing + algorithmic stemming is a cheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages Phrases, terms and named entities can’t be extracted effectively by n-gram indexing or pure machine learning
4
Inxight Confidential 4 Text analysis fundamentals Base layer Language and character set identification Document analysis Tokenization Stemming/normalization Contextual analysis Part-of-speech tagging “Grouping” Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification, subject/object identification, etc.) Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)
5
Inxight Confidential 5 Don’t ignore statistics Feed linguistic markup into probabilistic processing Categorization (choose your algorithm) Search/relevance ranking Summarization Co-occurrence analysis/entity resolution Link analysis Predictive analysis/data mining
6
Inxight Confidential 6 Base layer (LinguistX Platform) Morphological analyzer Lexicon + rules Compiled as a finite-state machine Resource efficient, very fast French lexicon recognizes 5M words; takes up 300K on disk/RAM, and runs at over 2 GB/hr on a low-end machine Xerox finite-state tools tested on many languages (Inxight’s 27 + others in research) Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model for POS tagger Groupers Finite-state “chunkers” – compiled regex
7
Inxight Confidential 7 Named entity extraction (ThingFinder) Builds on base platform Requires additional resources Enhanced lexicon (POS tagset insufficient for high quality extraction) Entity-specific groupers Tagged corpus for accuracy testing Sometimes you need more Genre-specific document analysis Specialized tokenization, tagging Knowledge base (“Name Catalog”) Custom groupers
8
Inxight Confidential 8 Statistical models Summarization Base layer + feature model (feature weights, stop words, cue phrases) Categorization Labeled training data …and lots of interactive tools
9
Inxight Confidential 9 Fact extraction Builds on base of linguistic markup + named entities Modeled on specific templates Rules populate the templates Additional linguistic resources Intra-document Document analysis/genre identification Subject/object identification Anaphora resolution Inter-document Entity resolution
10
Inxight Confidential 10 Developing a new language Resource acquisition Corpora Lexicon Team Computation linguist familiar with tools Native speaker Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps Build, test and refine Soup to nuts: $500K to $1M for V1.0
11
Inxight Confidential 11 Challenge of low-density languages Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficient speakers Greed
12
Inxight Confidential 12 Future developments on the language frontier New languages Increased depth in existing languages Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ Fact extraction Other challenges Name transliteration Translation/glossing Question answering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.