Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell

Similar presentations


Presentation on theme: "Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell"— Presentation transcript:

1 Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell jgc@cs.cmu.edu

2 School of Computer Science at Carnegie Mellon University Computer Science Department (theory, systems) Robotics Institute (space, industry, medical) Language Technologies Institute (MT, speech, IR) Human-Computer Interaction Inst. (Ergonomics) Institute for Software Research Int. (SE) Machine Learning Department (ML theory) Entertainment Technologies (Animation, graphics)

3 Language Technologies Institute Founded in 1986 as the Center for Machine Translation (CMT). Became Language Technologies Institute in 1996, unifying CMT, Comp Ling program. Current Size: 197 FTEs 27 Faculty (including joint appointments) 25 Staff 125 Graduate Students (90 PhD, 40 MLT) 10 Visiting Scholars

4 LTI Bill of Rights right Get the right information To the right people At the right time On the right medium In the right language At the right level of detail

5 Slogan Challenges … right information … right people … right time … right medium … right language … right detail IR, filtering, TC, … routing, personalization, … anticipatory analysis, … text, speech, video, … translation, bio, … summarization, expansion

6 “… on the Right Medium ” Speech Recognition SPHINX (Reddy, Rudnicky Rosenfeld, … ) JANUS (Waibel, Schultz, … ) Speech Synthesis Festival (Black, Lenzo) Handwriting & Gesture Recognition ISL (Waibel, J. Yang) Multimedia Integration (CSD) Informedia (Wactlar, Hauptmann, … )

7 “… in the Right Language ” High-Accuracy Interlingual MT KANT (Nyberg, Mitamura) Parallel Corpus-Trainable MT Statistical MT (Lafferty, Vogel) Example-Based MT (Brown, Carbonell) AVENUE Instructible MT (Levin, Lavie, Carbonell) Multi-Engine MT (Lavie, Frederking) Speech-to-speech MT JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin, Schultz, Vogel, Lafferty, Black, … )

8 We also Engage in: Tutoring Systems (Eskenazi, Callan) Linguistic Analysis (Levin, Mitamura … ) Dialog Systems (Rudnicky, Waibel, … ) Computational Biology Protein structure/function (Carbonell, Langmead) DNA seq/motifs (Yang, Xing, Rosenfeld) Complex System Design (Nyberg, Callan) Machine Learning (Carbonell, Lafferty, Yang, Rosenfeld, Xing, Cohen, … ) Question Answering (Nyberg, Mitamura, … )

9 How we do it at LTI Data-driven methods Statistical learning Corpora-based Examples: Statistical MT Example-based MT Text categorization Novelty detection Translingual IR Knowledge-based Symbolic learning Linguistic analysis Knowledge represent. Examples: Interlingual MT Parsing & generation Discourse modeling Language tutoring

10 MMR Ranking vs Standard IR query documents MMR IR λ controls spiral curl

11 Adaptive Filtering over a Document Stream On-topic documents Test documents Current document: On-topic? Training documents (past) time Off-topic documents Unlabeled documents RF Topic 1 Topic 2 Topic 3 …

12

13 Types of Machine Translation Interlingua Syntactic Parsing Semantic Analysis Sentence Planning Text Generation Source (Arabic) Target (English) Transfer Rules Direct: SMT, EBMT

14 EBMT Example English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

15 Ambiguity Makes MT Hard Word Senses for “ line ” (52 senses in Random House English-Japanese Dictionary) Power line – densen ( 電線 ) Subway line – chikatetsu ( 地下鉄 ) (Be) on line – onrain ( オンライン ) (Be) on the line – denwachuu ( 電話中 ) Line up – narabu ( 並ぶ ) Line one ’ s pockets – kanemochi ni naru ( 金持ちになる ) Line one ’ s jacket – uwagi o nijuu ni suru ( 上着を二重にする ) Actor ’ s line – serifu ( セリフ ) Get a line on – joho o eru ( 情報を得る )

16 CONTEXT: More is Better “ The line for the new play extended for 3 blocks. ” “ The line for the new play was changed by the scriptwriter. ” “ The line for the new play got tangled with the other props. ” “ The line for the new play better protected the quarterback. ”

17 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Normal P ROTEIN S Sequence  Structure  Function (Borrowed from: Judith Klein-Seetharaman)

18 Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA 3D Structure Folding Complex function within network of proteins Disease P ROTEIN S Sequence  Structure  Function

19 Predicting Protein Structures Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico

20 Linked Segmentation CRF Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and intra- chain interactions L-SCRF: conditional probability of y given x is defined as Joint Labels

21 Discriminative Semi-Markov Model for Parallel Right-handed β-Helix Prediction Structures A regular super secondary structure with an an elongated helix whose successive rungs are composed of beta-strands Conserved T2 turn Computational importance Long-range interactions Biological importance functions such as the bacterial infection of plants, binding the O- antigen, antifreeze,...

22 Some LTI Accomplishments First large-scale web-spider (LYCOS) First speech-speech MT (JANUS) First high-accuracy text MT (KANT) First minority-language MT (DIPLOMAT) First high-accuracy translingual IR First multidocument summarizer (MMR)


Download ppt "Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell"

Similar presentations


Ads by Google