relatively easy to manage. For more information see [Vogler, 2000].
Thad Starner and his group from Georgia University of Technology, Atlanta USA,
are working on several projects in American Sign Language recognition. They use
multiple sensors for the recognition, among others a hat-mounted video camera
and accelerometers with three degrees of freedom mounted on the wrist and torso
to increase the information of the video camera. For control reasons, the deaf user
has a head-mounted display which shows what the camera captures [Brasher,
2003]. The aim of the activities is a flexible mobile system for the output of text or
speech, depending on the application. Figure 2.8 shows the head-mounted camera
and a recorded gesture.
Figure 2.8 Base-cab-mounted camera and a recorded gesture
(with kind permission of Thad Starner, Media Lab, MIT).
Visual and audio-visual speech recognition based on
face or lip reading
A methodology which is quite similar to gesture recognition, mentioned before, is
automatic facial reading or lip reading. The result is a text sequence which
represents the content of the utterance. Figure 2.9 shows the region that is
investigated for lip reading.
33
2.2. New technologies to help people with disabilities
and elderly people
Figure 2.9 The region of interest of the video facial image.
The automatic recognition of facial images has been used for a number of years
for the improvement of a (spoken) speech recognition under noisy conditions and
it has been proved to be very successful [Kraiss, 2006], [Moura, 2006], although
the accuracy, obtained with purely visual speech recognition, is not as high as in
audio speech recognition. There are a number of reasons for this; one is that visual
speech is partially phonetically ambiguous.
Nevertheless, for the communication between deaf and normal hearing persons,
facial or lip reading is a very valuable help and, as previously mentioned, the
human face can optimally express emotions and this information is dectectable for
the visual recognizer.
Small-vocabulary preliminary trials have been reported [Moura, 2006] to obtain
word recognition rates of about 65% for a one speaker lip-reading task with
grammar correction. Interestingly, the performance of professional observer was in
the range of 70%-80% for the same corpus. Figure 2.10 shows the situation under
remarkable noise conditions and it demonstrates the advantage (in terms of
recognized words error rate – WER) of a simple combination in a multi-stream
recognition approach [Moura, 2006].
60,00%
50,00%
40,00%
30,00%
20,00%
10,00%
0,00%
Original
-5dB
-10dB
-15dB
-20dB
WER-Audio
WER-Video
WER-AV
Figure 2.10 Variation of the total word error rate in function of the signal-to-noise ratio.
34
2.2. New technologies to help people with disabilities
and elderly people
Correction of speech defects, unintelligible speech
If a person is unable to speak ‘normally’ resulting in unsatisfactory intelligibility, a
speech recognition and synthesis system can be a valuable aid. The impaired
speech is the input for the recognizer, which converts it into text and the text is
then converted into clean synthetic speech.
It is very important to state that even totally unintelligible speech or any acoustic
utterance can be recognized, the only prerequisite is the ability of the ‘speaker’ to
reproduce utterances with sufficient similarities and to train the recognizer with
this kind of ‘vocabulary’. As a matter of fact, even emotions can be expressed,
using emotional speech synthesis. Finally, visual speech recognition, as mentioned
before, can significantly contribute to better speech recognition.
A system for speech therapy
It is well known that many deaf persons have fully functioning speech organs but
the problem is that they cannot control articulation because they do not have
acoustic feedback through the ears.
When the deafness occurred after the complete language/speech acquisition, the
deaf person can maintain (with restrictions) his/her speaking ability with the help
of a speech therapist. But there is the necessity of a permanent training with a
therapist which is obviously not always possible.
Many attempts have been made to develop systems which perform a visual control
of a spoken utterance. The time signal or the spectrum of the speech are not very
suitable because the relation between the sound production and the resulting
signal is rather complicated and abstract.
A better solution is obviously a face animation showing two speaking faces: the
‘reference’ face and the (deaf) speaker’s face. Thus the deaf person can directly see
deviations between the two faces and he or she can try to adapt. Since some
sounds are produced invisibly inside the mouth, as mentioned earlier, a useful help
is a transparent mouth region (figure 2.11).
Figure 2.11 Face animation with a transparent area of the mouth region [Pritsch, 2005].
35
2.2. New technologies to help people with disabilities
and elderly people
Screen readers for blind or partially sighted persons
The usual computer desktop metaphor practically leaves blind persons out because
it is a Graphical User Interface (GUI), based on a more or less rich graphic display
of icons, windows, pointers and text. Since blind persons require non-visual media,
the alternative is, among tactile information (Braille), primarily an aural interface
which can be called, analogous to GUI, Aural User Interface (AUI), based on the
terminology supported by many authors including T.V.Raman [Raman, 1997].
Since the early 80’s, after some trials with special versions of self-voicing software,
capable of driving a speech synthesizer and so providing access for blind persons,
a more general concept appeared and a family of applications, called screen-
readers, was initiated with the purpose of creating a vocal rendering of the
contents of the screen under user control through the keyboard, using a text-to-
speech converter [Wikipedia]. In this way, properly installed screen reader software
stays active in the operating system and operates in the background, analysing the
actual contents of the screen. From the initial command-line interface (CLI) to the
now existing ubiquitous graphical user interface (GUI) screen reader software has
evolved much in 2,5 decades.
Screen readers can also analyse many visual constructs like menus and alert or
dialogue boxes and transform them into speech to allow interaction with a blind
user.
Navigation in the screen is possible as well, to allow a non-linear or even random
exploration and acquisition of the depicted information. Control of the produced
speech is normally given to the user so that quite fast navigation becomes possible
when the user works with shortcuts. A simulation of a screen reader is available at
the WebAIM website [WebAIM].
Although many screen reader applications exist, there are many limitations that
current screen readers cannot overcome per se, for instance those related to
images and structured text (tables etc.). Screen readers cannot describe images,
they can only produce a readout of a textual description of these and the user has
problems to realize how the page is organized.
The basic requirement in terms of speech processing for screen reader applications
is a robust text-to-speech converter with the possibilities of spelling and reading
random individual characters and all kinds of text elements that may appear like
numeric expressions, a b b r e v i a t i o n s, acronyms and other coded elements.
Punctuation is also spoken in general, besides being determinant in introducing
some prosodic manipulation in the synthetic voice.
36
2.2. New technologies to help people with disabilities
and elderly people
Following this idea, the World Wide Web Consortium (W3C) in 1998, with the issue
of the Cascading Style Sheet 2 (CSS2) recommendation, has introduced the Aural
Cascading Style Sheet (ACSS); a chapter respective to the acoustical rendering of
a web page is presented in [WDAC].
Auditory icons, sometimes also called earcons, are made audible to the user by
means of a loudspeaker or earphone system that should have advanced acoustic
features (high quality, stereo etc.). The acoustic elements contain voice properties
like speech-rate, voice-family, pitch, pitch-range, stress, and others that are used as
command parameters to the speech synthesizer.
An extended investigation of spatial acoustic features as a component of a screen
reader was performed in the GUIB (Graphical User Interfaces for the Blind) project
in the framework of the European TIDE initiative [Crispien, 1995]. The idea was to
generate an acoustic screen in front of the user on which windows, icons and other
graphic elements are audible on different places, and the mouse position is also
audible when the mouse is moving.
In a former project (AudioBrowser, 2003-2005, see [Repositorium]), developed for
Portuguese, but applicable for most other languages, the structure or outline of a
web page can be discovered and used as a table of contents, and it was
implemented successfully. The user in this application can freely navigate inside the
contents of each window or jump between windows from contents to tables of
contents or vice-versa in order to scan or navigate through the page in a more
structured and friendly way. The blind or low-vision user is constantly helped by the
text-to-speech device that follows the navigation accurately.
The W3C consortium, through its Web Accessibility Initiative (WAI) has been
issuing a relevant set of Web contents accessibility guidelines (WCAG), now in
version 2. These guidelines are greatly helpful in orienting web page design for
accessibility [WAI]. Authoring Tool Accessibility Guidelines (ATAG), nowadays in
version 2.0, are also important for developers of authoring tools.
Reproduction of complex documents for blind persons
Complex documents like mathematical and other scientific, technical or even
didactic documents are usually equipped with graphical representations. Above all,
equations and other mathematical expressions have posed a substantial barrier to
the access by visually impaired persons. Most representations and charts may also
be included in this group.
37
2.2. New technologies to help people with disabilities
and elderly people
Representation in special Braille codes of complex mathematical elements can
almost totally solve the problem for blind persons. The LAMBDA project [LAMBDA,
2005] has produced a mathematical rendering package using such a system.
In the case of more lengthy mathematical objects, more refined solutions might be
p r e f e rable using audio rendering of the mathematical expressions through
synthetic speech. Using the codification of the expression in MathML, a browsable
textual description of the expression can be automatically derived from the
MathML code by means of a special lexicon and a grammar. Both must be specially
designed for the purpose according to the mathematical conventions and concerns
of non-ambiguity of the textual description. This work has been carried out in the
AUDIOMATH project [Ferreira, 2005] carried out at the Faculdade de Engenharia
da Universidade do Porto. A demonstration page is available at [Ferreira].
Acoustical cues, contributing to the clearness of the speech rendering, are also
important. Previous authors have used, for instance, prosodic modifications such
as raising or lowering the pitch of the synthetic voice to signal upper or lower parts
of the expression, respectively. In the work of AUDIOMATH the influence of pitch
movements as well as of pauses during description of expressions was studied and
rules were extracted. An intra-formula navigation mechanism was designed in
order to allow the user to explore the formula at her/his own will thereby not
putting too much stress on audio memory in the case of longer formulas.
2.2.2.3 Conclusions and future developments
The aim of this chapter was to show how electronic speech processing works and
how persons with disabilities can benefit from it.
Since speech is man’s most important form of communication, all efforts must be
done to make speech communication possible, and if the speech channel is
disturbed, technical solutions have to be found to overcome the obstacles.
The accuracy and quality of modern speech recognition systems as well as
synthesis systems has reached a state of maturity which allows the development
of very poweful support systems for persons with disabilities and to bridge the gap
between these persons and those without disabilities, as was shown, for example,
between deaf persons and the rest of the world.
Looking into the future of speech technology, some important research areas can
be identified as follows:
38
2.2. New technologies to help people with disabilities
and elderly people
• Improving the robustness of speech recognition systems. Although the
robustness has been remarkably improved over the last years, the systems
are still far behind human capabilities. Noise, especially non-stationary noise,
background speakers or music can still reduce the recognition reliability well
below an acceptable error rate. Improvement is expected (and partly proven,
as has been seen) from a multimodal recognition which includes also visual
information (above all, mimics, facial and hand gestures)
• A more extended use of semantic and pragmatic information. When the
system (recognizer or synthesizer) ‘knows’ what the speaker wants to
express, which covers both, the content and the emotion, then the
recognizer can usefully complete a spoken message which has recognition
errors. A synthesizer could automatically generate the right accentuations
and emotional ‘colouring’ of the speech. For the sake of completeness it has
to be mentioned here that the permanent improvement of the quality of
synthetic speech also includes multilinguality as well as speaker-specific
synthesis and will remain within the scope of research. Audio rendering of
complex documents through synthetic speech is also a very important
development area where document description strategies, their conversion
into full text form and intra-document navigation or browsing are the crucial
steps
• A challenge and wide field of research is sign language recognition. As
mentioned earlier, there are several research activities but much more work
has to be done. More needs to be known about structures of sign
languages (and there are very many and all are different!) and their relations
to spoken and written languages. Automatic translations should be possible
in different directions (sign language into speech and vice versa, sign
language into another sign language, speech into a foreign sign language
and vice versa, for example German speech into American sign language).
Also the technical part of the problem is challenging. Using the Ambient
Intelligence (AmI) approach, we can expect micro cameras in the clothes or
in a pendant as well as position sensors in finger rings etc., and the
environment will have enough intelligence to take on most of the processing
activities needed for recognition and translation
• For blind persons, screenreaders and the automatic recognition of graphics,
pictures and the environment are a never ending research area. As a matter
of fact, for blind persons a verbal (spoken) description of the recognition
result is, in many cases, the best solution. As before, AmI will be of crucial
importance here.
39
2.2. New technologies to help people with disabilities
and elderly people
It should be mentioned here that the enumeration given in this chapter from being
complete. Further examples will be given in other chapters, showing that speech
technology and speech applications will play a dominant role whenever
communication is discussed.
2.2.2.4 References
BARROS M.J., MAIA R., TOKUDA, K. RESENDE, F.G., FREITAS, D., (2005). HMM-
based European Portuguese TTS System, artigo apresentado e publicado nas
actas da Interspeech'2005 - Eurospeech — 9th European Conference on Speech
Communication and Technology, Lisboa.
BOTINIS (ed.) et al., (1997). Intonation: Theory, Models and Applications.
Proceedings of the ESCA Worksop Sept. 18-20 Athens, Greece.
BRASHER, H., STARNER, T. et al., (2003). Using Multiple Sensors for Mobile Sign
Language Recognition. ISCW White Plains, WA,
Also: http://www-static.cc.gatech.edu/~thad/031_research.htm
BURGHARDT, F. et al., ( 2006). Examples of synthesized emotional speech
http://emosamples.syntheticspeech.de/
CRISPIEN, K., FELLBAUM, K. (1995). Use of Acoustic Information in Screen
Reader Programs for Blind Computer Users: Results from the TIDE Project GUIB.
In: Placencia Porrerro, I.,& de la Bellacasa, R. P., (Eds.): The European Context for
Assistive Technology - Proceedings of the 2nd TIDE Congress, Paris, , IOS Press,
Amsterdam.
DELLER, J.R., (2000). Discrete-time processing of speech signals.
New York : Institute of Electrical and Electronics Engineers.
DRAGON Naturally Speaking Professional Engine, (2006). NUANCE
communications http://www.nuance.com/naturallyspeaking/.
FERREIRA, H., FREITAS, D., (2005). AudioMath—Towards Automatic Readings of
Mathematical Expressions”, 11th International Conference on Human Computer
Interaction, Las Vegas, EUA.
FERREIRA. http://lpf-esi.fe.up.pt/~audiomath
FURUI, S., (2001). Digital speech processing, synthesis, and recognition
2nd ed., rev. and expanded. New York : Marcel Dekker.
40
2.2. New technologies to help people with disabilities
and elderly people
GARDNER-BONNEAU, D., (1999). Human Factors and Voice Interactive Systems.
Kluwer Academic Publishers, Boston.
HUMANE, Network of Excellence. http://emotion-research.net/aboutHUMAINE.
iCommunicator homepage. http://www.myicommunicator.com/].
IIDA, A., CAMPBELL, N., YASUMURA, M. (1998)., Emotional Speech as an
Effective Interface for People with Special Needs, apchi, p. 266, Third Asian
Pacific Comp. and Human Interaction.
JEKOSCH, U., (2005). Voice and Speech Quality Perception. Springer-Verlag
Berlin, Heidelberg.
KRAISS, K.F., (ed.), (2006). Advanced Man-Machine Interaction. Springer Berlin
Heidelberg, New York.
SYNFACE project research page http://www.speech.kth.se/synface/.
LEE, C.M., PIERACCINI, R., (2002). Combining Acoustic and Language
Information for Emotion Recognition. Proc. of the International Conference on
Speech and Language Processing (ICSLP 2002). Denver, Co.
LAMBDA (2005). http://www.lambdaproject.org/.
MBROLA website http://tcts.fpms.ac.be/synthesis/.
MOESLUND, T., NORGAARD, L., (2003). A Brief Overview of Hand Gestures used
in Wearable Human Computer Interfaces. Technical Report CVMT 03-02,
Computer Vision and Media Technology Lab., Aalborg University, DK.
MOURA A., PÊRA V., FREITAS, D., (2006). (in Portuguese) Um Sistema de
Reconhecimento Automático de Fala para Pessoas Portadoras de Deficiência”,
artigo publicado nas actas da conferência IBERDISCAP’06, realizada em Vitória-
ES, Brasil.
MUSSLAP. University of West Bohemia, MUSSLAP website
http://www.musslap.zcu.cz/en/audio-visual-speech-recognition/.
PRITSCH, M., (2005). Visual speech training system for deaf persons. Proceedings
of the 16th Conference Joined with the 15th Czech-German Workshop “Speech
Processing, Prague, Sept. 26-28, 2005. TUD press Dresden, Germany.
RAMAN, T.V., (1997). Auditory User Interfaces, Kluwer Academic Publishers,
August.
41
2.2. New technologies to help people with disabilities
and elderly people
RAMAN, T.V., (1998). Conversational gestures for direct manipulation on the
audio desktop, Proceedings of the third international ACM SIGACCESS
Conference on Assistive Technologies, Marina del Rey, California, United States,
pgs 51 – 58. ISBN:1-58113-020-1.
REPOSTIRORUIM. https://repositorium.sdum.uminho.pt/bitstream/
1822/761/4/iceis04.pdf#search=%22audiobrowser%22 SPROAT, R. (ed.)
(1998).: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers.
Dordrecht, Boston, London.
SYNFACE - Synthesised talking face derived from speech for hard of hearing
users of voice channels
http://www.speech.kth.se/synface/ and http://www.synface.net/.
SYNTHESIS TESTSITE, AT&T. http://www.research.att.com/~ttsweb/tts/demo.php.
VARY, P., MARTIN, R., (2006). Digital Speech Transmission. Enhancement, Coding
and Error Concealment. J. Wiley&Sons.
VOGLER, C. et al. A Framework for Motor Recognition with Applications to
American Sign Language and Gait Recognition.
http://www.cis.upenn.edu/~hms/2000/humo00.pdf
see also Vogler’s homepage http://gri.gallaudet.edu/~cvogler/research/.
WAI. Web accessibility homepage. http://www.w3.org/WAI/
WDAC (1999). Aural Cascading Style Sheets (ACSS), W3C Working Draft
http://www.w3.org/TR/WD-acss.
WebAIM Screen Reader Simulation.
http://www.webaim.org/simulations/screenreader.php
Wikipedia about screenreader http://en.wikipedia.org/wiki/Screen_reader
WISDOM project page. http://www.bris.ac.uk/news/2001/wisdom.htm.
42
2.3.
New remote services
2.3 New remote services
2.3.1 Novel broadband-based services: new opportunities
for people with disabilities
Broadband trials by the National Post and Telecom Agency
(Post- och telestyrelsen PTS), in Sweden
Patrik Bystedt
PTS seven broadband trials
Broadband technology has become accessible for a steadily increasing proportion
of the population in Sweden. With the aid of more rapid data transmission it has
become possible to send and receive large quantities of information via computer
ne