CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM ICSLP’ 98 CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM J. Ferreiros, J. Colás, J. Macías-Guarasa, A. Ruiz, J. M. Pardo Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación - Universidad Politécnica de Madrid Ciudad Universitaria s/n, Madrid Spain
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM General Architecture Speech Recogniser TaggerTags RefinerUnderstandingActuator Speech Generation Module Text to Speech IR- LED Alternative Expresions SCHMM + Word Pair Tagged Dictionary Context Dependent Rules Context Dependent Rules HIFI Status
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech recogniser l Characteristics: –Continuous speech commands –One-pass search with word-pair grammar –163 words –SCHMM phone models l Implementation: –Front-end: DSP LSI board –Rest of processing: PC
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech understanding (I) l TAGGER: –78 semantic tags –several tags applied to each word –“garbage” tag used for no-meaning words l Gives robustness against speech recogniser errors l Will allow OOV in the recognised string “Please, set the volume higher” –Tagging directly specified in the lexicon
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech understanding (II) l TAGS REFINER: –Aims: l Numbers processing l Disambiguation of words with several tags l “garbage” removal –May change the literal of the words “two five” “25” –May introduce new refined semantic tags –Context dependent rules word: “right” tags: “position increment” rule: “if there exists any other word tagged as a tape parameter, then the word right is the position of this tape else it is a increment indicator”
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech understanding (III) l UNDERSTANDING STAGE: –Context dependent rules l Gives independence on the order of the concepts –Trying to fill in frames: SUBSYSTEM=(radio,cd-player,cassette,...) PARAMETER=(volume,tone,broadcast station,song,...) VALUE=(higher,number,...) –One or several frames for each command –More specific rules: first to be executed –We also fill in message strings l With the “reasoning” l With the problems in the understanding stage
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech understanding (IV) l ACTUATOR: –Sends IR commands to the HIFI set –Keeps track of the set status –Informs the user of the actions performed or the problems found USER: “switch the radio on” ACTUATOR: “The radio was already on”
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM Speech generation –Input: pattern string of both literals and concepts coming from the rest of the architecture –Performs random concepts substitution by text to achieve a certain degree of naturalness / variety Input: “C_SEEING the word higher with an increment meaning, C_THINK that put means an increasing action” C_SEEING “As I can see", "As I have discovered", "As It appears",... C_THINK "I think", "I imagine", "I suppose"... –Output through a text-to-speech subsystem
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM CONCLUSIONS & FUTURE WORK –Supporting ideas of the system: l Semantic-like tagging l Context dependent rules l “garbage” tag l pattern-based generation l random concepts substitution for generation –Desirable new aspects: l Use of more information of the recognised sentences l Handle more complex commands Introducing semantic-syntactic parsing of the sentence structure l Introduce dialogue to complete not understood or not given information and as a confirmation strategy