Manolis Perakakis world

News, diary, journal, whatever

Multimodal mobile interaction – blending speech and GUI input (iphone demo) October 15, 2010

Update: Since Apple yesterday (Oct 5, 2011) announced full integration of Siri personal assistant in IOS 5, I think the title of this post could become: A Siri like (personal assistant) interface developed as part of my PhD research (focus on mutlimodal interaction), circa 2009 🙂

Well, it was about time for a new blog post after errrr…. almost 2 years!

These recent years were so exciting regarding mobile interaction, … I wonder how cool-est(!) the following years may be.

A few years ago I envisioned how speech modality would enrich (or almost supersede) the poor (of that time) mobile interaction experience by working on distributed speech recognition. Look ma(!) touch modality just won the game; it was so much simpler as a technology (well by today’s standards), error free & intuitive. iPhone really revolutionized the mobile interface by exploiting multi-touch input but speech as modality still has a bright future, not by replacing but by enriching mobile interaction.

So the question is: how to build interfaces that combine more than one modalities? Generally speaking, to successfully combine multiple modalities, one has to exploit the synergies that emerge when mixing these modalities. For example, in blending speech and GUI (touch) modalities the following synergies arise:

  • visual output (GUI) is much faster (and informative) than speech output (sequential); this is due to information bandwidth of visual and audio channels of human brain
  • speech input is usually much faster than GUI input (and also the more natural form of communication) A speech sentence can reveal info that would require many GUI actions to complete, e.g. I want to fly from Athens to London
  • speech input is inconsistent due to recognition errors! The same utterance spoken twice can yield different recognition results & fixing errors solely through speech may be difficult. Allow for easy error correction through extra modality instead! (e.g. GUI input)

Multimodal interfaces (interfaces that support more than 1 interaction modalities) thus may offer a richer user experience; they are more flexible and robust at the cost of greater design and implementation complexity.

The video is about a multimodal mobile interaction application demonstrating how to exploit speech and GUI (touch) modalities to enrich user experience. The application scenario is a travel reservation service. The user can use either GUI or speech input at each interaction turn, that is, selecting values from a list by touch or directly speaking, e.g. “I want to fly from Orlando to Chicago next Friday evening“.

This specific demonstration showcases 4 different interactions modes, one unimodal (GUI only input) and 3 different multimodal ones:

  • “Click-to-Talk”: user clicks speech button to talk
  • “Open-Mike”: speech input using voice activity detection
  • “Modality-selection”: default input modality chosen on modality efficiency; the system switches between
    “Click-to-Talk” & “Open-Mike” depending on current context to favor GUI or speech input respectively, .e.g. GUI input might be faster for short lists like date.

Note that the same (and also the simpest possible, e.g. one way trip without car/hotel reservation) scenario (New-York to Chicago, etc.) is demonstrated for all different interaction modes (Of course everything you can do with GUI you can do with speech). This video was shot to showcase the porting to iphone platform (with the help of V Kouloumenta); the platform also runs on PCs and various PDAs (e.g. Zaurus), since 2006.

This demo is part of my PhD work at Electronics & Computer Engineering Dept, Technical University Crete under the supervision of A. Potamianos. For more info you may refer to:
M. Perakakis and A. Potamianos. A study in efficiency and modality usage in multimodal form filling systems. IEEE Transactions on Audio, Speech and Language Processing, 2008.

 

Prime time for Distributed Speech Recognition? February 23, 2009

While an undergraduate student a few years ago I worked on Distributed Speech Recognition (DSR). The main purpose of DSR is to compress the acoustic features used by a speech recognizer, over a data (instead of voice) network, thus saving bandwidth (cost effective) and allowing the use of full speech recognition in mobile terminals. As it compresses acoustic features for speech recognition (not speech signal transmission/reproduction) purposes it can achieve very low bit rates. You can think of it as analogous of what mp3 is for music transmission and storage.

Depicted next is a simple overview of a DSR architecture (model 2). Note that the mobile terminals depicted are Symbian’s reference devices corresponding to smartphone, handheld and PDA respectively (Ooops too old images – it should be back in 2001; should upgrade to something like iPhone or Android …)

My work with Prof. V.Digalakis concluded that one can successfully take advantage of DSR with only a 2 kbps coding, which is an extremely low data rate. After that i ported the DSR engine to a Zaurus Linux PDA and made it work in real-time (a 16MB, 200 MHz StrongArm processor).

Although my recent work focus is now on Multi-modal (speech) interfaces I still keep an eye on DSR. It seems that with the emergence of powerful mobile terminals and the announcement of speech recognition support for Android and iPhone by Google, DSR might become soon a hot topic!

P.S. I just found out my DSR page is ranked 3rd by Google after W3C and ETSI. Holy moly!

Coolness factor: ?

 

The year of Augmented Reality

Filed under: android,augmented reality,mobile — perak @ 5:24 am
Tags:

Wikitude AR Travel Guide

Untill now there was too much hype around augmented reality since except for some really cool demos and research prototypes no real end user apps existed. Well, it seems that with the emergence of power mobile devices, augmented reality will find it’s way to the public with mobile users to be the first. Wikitude Android App is one of the first ones, with many more following this year.

Coolness factor 5/5!

 

My 15 minutes of fame! March 11, 2008

Filed under: HCI,interfaces,Multimodal,Speech,technology — perak @ 8:41 pm

Our work in Telecommunications Lab, at Technical University Crete (TUC) was featured in “Orizontes” documentary series of Kydon TV channel. We demonstrated some of our demos :

  • My work on multimodal interfaces (part of my PhD), including a travel. reservation multimodal (GUI + speech) application running on a Zaurus Linux PDA
  • The automatic video summarizer system (part of MUSCLE NOE european research project showcases).
  • An audio-visual (AV) recognition system (also part of MUSCLE NOE european research project showcases).
  • The multi-mic robust speech recognition demo (part of Hiwire european research project showcases).

We could not showcase the augmented-reality demo, we developed in cooperation with VTT (speech recognition integration), since we currently miss the appropriate hardware, hope we get it soon.

Some of these demos will go public, either by posting videos on YouTube or by releasing the source as open source in Sourceforge/Google code.

More on this as well as a more detailed description of the demos in following future posts!

Stay tuned!

 

Aibo, Lego mindstorms, Wii remote (wiimote), iPhone & Google’s Android!

Filed under: HCI,interfaces,Multimodal,programming,robotics,Speech — perak @ 8:12 pm

What all these have in common? They will be my playground for a while …

I will have the chance to play with all of them during this samester!

As far as aibo and mindstorms are concerned, i will use them for the two robotics related courses i have enrolled in. Some possible projects I am thinking of :

  • Distributed speech recognition (DSR) : enchance the limited speech recognition capabilities of the aibo by exploiting the wireless link and a  speech recognition server.
  • Distributed image processing : enchance aibo’s limited machine vision capabilities by exploiting the wireless link and a machine vision server (similarly to DSR)
  • robot localization using multiple input modalities : machine vision + audio
  • enchanced gesture based interface or multimodal (speech + gesture interfaces)

Wiimote hacks for enchanced HCI, similar to these demos from CMU.

iPhone will be used,  to augment my speech & GUI multimodal interface prototype already  running on the Zaurus PDA, with the gesture modality.

Finally, i can’t resist from playing with Google’s  new Android platform,  for porting  various apps  I have in mind.

Whoa, my hacker alter ego will be definetely be back for good!!!

 

 

gaming tech update January 19, 2008

Filed under: HCI,interfaces,technology — perak @ 10:50 am

Sure, i was an addicted game player during the glory days of the Amstrad 6128 era, but this was almost two decades ago! Doom and Quake were the hits of the mid 90s, that is more than a decade ago. Recently, a friend of mine got this Wii device to play with his dad (OK he needed some fun after finishing his PhD). I was sure i would never spend again my time gaming, but now i might reconsider …

I came across this impressive PS3 eyetoy demo. The graphics realism left me wondering … these ducks are absolutely cool!

Apart from the graphics, the interesting part is in the interaction techniques. Wii I guess is the leader in this area with Wii Fit and Wii Remote (wiimote). Wii remote uses accelometers, motion and optical sensing! Both are too impressive, gamepads are over!

coolness factor : 5!

 

Augmented reality rocks!

Filed under: HCI,interfaces — perak @ 8:18 am

During MMSP 2007 conference,  I came along with an augmented reality demo, which i tried out and it was really fun. I tell you, augmented reality will soon become a very hot topic, specially for mobiles.

Two impressive augmented reality demos in youtube :

An introductory demo about the technolgy
The total immersion demo, really cool!

coolness factor : 5!