Manolis Perakakis world

News, diary, journal, whatever

Prime time for Distributed Speech Recognition? February 23, 2009

While an undergraduate student a few years ago I worked on Distributed Speech Recognition (DSR). The main purpose of DSR is to compress the acoustic features used by a speech recognizer, over a data (instead of voice) network, thus saving bandwidth (cost effective) and allowing the use of full speech recognition in mobile terminals. As it compresses acoustic features for speech recognition (not speech signal transmission/reproduction) purposes it can achieve very low bit rates. You can think of it as analogous of what mp3 is for music transmission and storage.

Depicted next is a simple overview of a DSR architecture (model 2). Note that the mobile terminals depicted are Symbian’s reference devices corresponding to smartphone, handheld and PDA respectively (Ooops too old images – it should be back in 2001; should upgrade to something like iPhone or Android …)

My work with Prof. V.Digalakis concluded that one can successfully take advantage of DSR with only a 2 kbps coding, which is an extremely low data rate. After that i ported the DSR engine to a Zaurus Linux PDA and made it work in real-time (a 16MB, 200 MHz StrongArm processor).

Although my recent work focus is now on Multi-modal (speech) interfaces I still keep an eye on DSR. It seems that with the emergence of powerful mobile terminals and the announcement of speech recognition support for Android and iPhone by Google, DSR might become soon a hot topic!

P.S. I just found out my DSR page is ranked 3rd by Google after W3C and ETSI. Holy moly!

Coolness factor: ?

 

My 15 minutes of fame! March 11, 2008

Filed under: HCI,interfaces, Multimodal, Speech, technology — perak @ 8:41 pm

Our work in Telecommunications Lab, at Technical University Crete (TUC) was featured in “Orizontes” documentary series of Kydon TV channel. We demonstrated some of our demos :

  • My work on multimodal interfaces (part of my PhD), including a travel. reservation multimodal (GUI + speech) application running on a Zaurus Linux PDA
  • The automatic video summarizer system (part of MUSCLE NOE european research project showcases).
  • An audio-visual (AV) recognition system (also part of MUSCLE NOE european research project showcases).
  • The multi-mic robust speech recognition demo (part of Hiwire european research project showcases).

We could not showcase the augmented-reality demo, we developed in cooperation with VTT (speech recognition integration), since we currently miss the appropriate hardware, hope we get it soon.

Some of these demos will go public, either by posting videos on YouTube or by releasing the source as open source in Sourceforge/Google code.

More on this as well as a more detailed description of the demos in following future posts!

Stay tuned!

 

Aibo, Lego mindstorms, Wii remote (wiimote), iPhone & Google’s Android! March 11, 2008

Filed under: HCI,interfaces, Multimodal, Speech, programming, robotics — perak @ 8:12 pm

What all these have in common? They will be my playground for a while …

I will have the chance to play with all of them during this samester!

As far as aibo and mindstorms are concerned, i will use them for the two robotics related courses i have enrolled in. Some possible projects I am thinking of :

  • Distributed speech recognition (DSR) : enchance the limited speech recognition capabilities of the aibo by exploiting the wireless link and a  speech recognition server.
  • Distributed image processing : enchance aibo’s limited machine vision capabilities by exploiting the wireless link and a machine vision server (similarly to DSR)
  • robot localization using multiple input modalities : machine vision + audio
  • enchanced gesture based interface or multimodal (speech + gesture interfaces)

Wiimote hacks for enchanced HCI, similar to these demos from CMU.

iPhone will be used,  to augment my speech & GUI multimodal interface prototype already  running on the Zaurus PDA, with the gesture modality.

Finally, i can’t resist from playing with Google’s  new Android platform,  for porting  various apps  I have in mind.

Whoa, my hacker alter ego will be definetely be back for good!!!

 

 

Opera prepares for version 10? July 29, 2006

Filed under: HCI,interfaces, Multimodal, technology, web — perak @ 8:42 am

According to this C|net article, Opera prepares version 10 of it’s successful browser.

Wow, it’s not a long time since I updated to version 9. Opera is the ONLY non-open source program I use on my Linux boxes for many years now! It combines rock-solid functionality with an excellent interface.

I really like it’s simple yet intuitive and extremely configurable interface.
(Well in terms of design, it reminds me of Google’s simple interface – simple is beautiful!)
Opera was one of the first apps to use mouse gestures and they have also build a multimodal-enabled version in cooperation with IBM (should test this sometime on my Zaurus)

It’s standard-compliant and it’s blazing fast (compare to pre-firefox mozilla days) and secure (vs IE).I have got 5000 opera bookmarks & a big mailbox and I can quickly find anything in msecs!

In terms of functionality, a jabber IM plug-in would make it almost complete!

<The company expects version 10 to work on and across any platform>
I have already Opera running on my Zaurus PDA and my K700 mobile! – Opera mini is really cool! This gives a strategic advantage for Opera. I think desktop browser wars don’t matter any more, mobile browsing is the next frontier.

<Opera is aiming for a day when people needn’t use a full desktop operating system, instead using a browser and Web applications for most tasks>
This is another cool idea, especially for the mobile space, browser is the computer! And this widget idea is really promising, especially for the mobile space.

<There is also a big push in the company toward creating developer tools>
Attracting developers to it’s already small but dedicated community would be a huge plus. Go for it Opera!

These Norwegian trolls are really cool!

Coolness factor : 4.5