Presentation is loading. Please wait.

Presentation is loading. Please wait.

Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.

Similar presentations


Presentation on theme: "Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy."— Presentation transcript:

1 Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy

2 Inxight Confidential 2 20+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discovery Silicon Valley HQ; offices in US, Europe 250 major customers Seasoned management team Solid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank, Dresdner Bank, Xerox, In-Q-Tel Inxight at a Glance Inxight provides the only complete solution for organizing and accessing unstructured data to increase the speed and accuracy of information discovery

3 Inxight Confidential 3 What we mean by language support Not pure statistics  “Language independence” is a fallacy when it comes to text  Whitespace parsing + algorithmic stemming is a cheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages  Phrases, terms and named entities can’t be extracted effectively by n-gram indexing or pure machine learning

4 Inxight Confidential 4 Text analysis fundamentals Base layer Language and character set identification Document analysis Tokenization Stemming/normalization Contextual analysis Part-of-speech tagging “Grouping” Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification, subject/object identification, etc.) Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)

5 Inxight Confidential 5 Don’t ignore statistics Feed linguistic markup into probabilistic processing  Categorization (choose your algorithm)  Search/relevance ranking  Summarization  Co-occurrence analysis/entity resolution  Link analysis  Predictive analysis/data mining

6 Inxight Confidential 6 Base layer (LinguistX Platform) Morphological analyzer  Lexicon + rules  Compiled as a finite-state machine  Resource efficient, very fast French lexicon recognizes 5M words; takes up 300K on disk/RAM, and runs at over 2 GB/hr on a low-end machine  Xerox finite-state tools tested on many languages (Inxight’s 27 + others in research) Corpora to produce statistical models  Language and character set detection  Tagged corpus to produce Hidden Markov Model for POS tagger Groupers  Finite-state “chunkers” – compiled regex

7 Inxight Confidential 7 Named entity extraction (ThingFinder) Builds on base platform Requires additional resources  Enhanced lexicon (POS tagset insufficient for high quality extraction)  Entity-specific groupers  Tagged corpus for accuracy testing Sometimes you need more  Genre-specific document analysis  Specialized tokenization, tagging  Knowledge base (“Name Catalog”)  Custom groupers

8 Inxight Confidential 8 Statistical models Summarization  Base layer + feature model (feature weights, stop words, cue phrases) Categorization  Labeled training data …and lots of interactive tools

9 Inxight Confidential 9 Fact extraction Builds on base of linguistic markup + named entities Modeled on specific templates Rules populate the templates Additional linguistic resources Intra-document  Document analysis/genre identification  Subject/object identification  Anaphora resolution Inter-document  Entity resolution

10 Inxight Confidential 10 Developing a new language Resource acquisition  Corpora  Lexicon  Team Computation linguist familiar with tools Native speaker Resource enhancement  Label tagged truth sets  Build out morphological classes  Fill lexical gaps Build, test and refine Soup to nuts: $500K to $1M for V1.0

11 Inxight Confidential 11 Challenge of low-density languages Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficient speakers Greed

12 Inxight Confidential 12 Future developments on the language frontier New languages Increased depth in existing languages  Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ  Fact extraction Other challenges  Name transliteration  Translation/glossing  Question answering


Download ppt "Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy."

Similar presentations


Ads by Google