Towards an Inclusive Future by Patrick RW Re - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

relatively easy to manage. For more information see [Vogler, 2000].

Thad Starner and his group from Georgia University of Technology, Atlanta USA,

are working on several projects in American Sign Language recognition. They use

multiple sensors for the recognition, among others a hat-mounted video camera

and accelerometers with three degrees of freedom mounted on the wrist and torso

to increase the information of the video camera. For control reasons, the deaf user

has a head-mounted display which shows what the camera captures [Brasher,

2003]. The aim of the activities is a flexible mobile system for the output of text or

speech, depending on the application. Figure 2.8 shows the head-mounted camera

and a recorded gesture.

Figure 2.8 Base-cab-mounted camera and a recorded gesture

(with kind permission of Thad Starner, Media Lab, MIT).

Visual and audio-visual speech recognition based on

face or lip reading

A methodology which is quite similar to gesture recognition, mentioned before, is

automatic facial reading or lip reading. The result is a text sequence which

represents the content of the utterance. Figure 2.9 shows the region that is

investigated for lip reading.

33

index-40_1.png

2.2. New technologies to help people with disabilities

and elderly people

Figure 2.9 The region of interest of the video facial image.

The automatic recognition of facial images has been used for a number of years

for the improvement of a (spoken) speech recognition under noisy conditions and

it has been proved to be very successful [Kraiss, 2006], [Moura, 2006], although

the accuracy, obtained with purely visual speech recognition, is not as high as in

audio speech recognition. There are a number of reasons for this; one is that visual

speech is partially phonetically ambiguous.

Nevertheless, for the communication between deaf and normal hearing persons,

facial or lip reading is a very valuable help and, as previously mentioned, the

human face can optimally express emotions and this information is dectectable for

the visual recognizer.

Small-vocabulary preliminary trials have been reported [Moura, 2006] to obtain

word recognition rates of about 65% for a one speaker lip-reading task with

grammar correction. Interestingly, the performance of professional observer was in

the range of 70%-80% for the same corpus. Figure 2.10 shows the situation under

remarkable noise conditions and it demonstrates the advantage (in terms of

recognized words error rate – WER) of a simple combination in a multi-stream

recognition approach [Moura, 2006].

60,00%

50,00%

40,00%

30,00%

20,00%

10,00%

0,00%

Original

-5dB

-10dB

-15dB

-20dB

WER-Audio

WER-Video

WER-AV

Figure 2.10 Variation of the total word error rate in function of the signal-to-noise ratio.

34

index-41_1.png

2.2. New technologies to help people with disabilities

and elderly people

Correction of speech defects, unintelligible speech

If a person is unable to speak ‘normally’ resulting in unsatisfactory intelligibility, a

speech recognition and synthesis system can be a valuable aid. The impaired

speech is the input for the recognizer, which converts it into text and the text is

then converted into clean synthetic speech.

It is very important to state that even totally unintelligible speech or any acoustic

utterance can be recognized, the only prerequisite is the ability of the ‘speaker’ to

reproduce utterances with sufficient similarities and to train the recognizer with

this kind of ‘vocabulary’. As a matter of fact, even emotions can be expressed,

using emotional speech synthesis. Finally, visual speech recognition, as mentioned

before, can significantly contribute to better speech recognition.

A system for speech therapy

It is well known that many deaf persons have fully functioning speech organs but

the problem is that they cannot control articulation because they do not have

acoustic feedback through the ears.

When the deafness occurred after the complete language/speech acquisition, the

deaf person can maintain (with restrictions) his/her speaking ability with the help

of a speech therapist. But there is the necessity of a permanent training with a

therapist which is obviously not always possible.

Many attempts have been made to develop systems which perform a visual control

of a spoken utterance. The time signal or the spectrum of the speech are not very

suitable because the relation between the sound production and the resulting

signal is rather complicated and abstract.

A better solution is obviously a face animation showing two speaking faces: the

‘reference’ face and the (deaf) speaker’s face. Thus the deaf person can directly see

deviations between the two faces and he or she can try to adapt. Since some

sounds are produced invisibly inside the mouth, as mentioned earlier, a useful help

is a transparent mouth region (figure 2.11).

Figure 2.11 Face animation with a transparent area of the mouth region [Pritsch, 2005].

35

2.2. New technologies to help people with disabilities

and elderly people

Screen readers for blind or partially sighted persons

The usual computer desktop metaphor practically leaves blind persons out because

it is a Graphical User Interface (GUI), based on a more or less rich graphic display

of icons, windows, pointers and text. Since blind persons require non-visual media,

the alternative is, among tactile information (Braille), primarily an aural interface

which can be called, analogous to GUI, Aural User Interface (AUI), based on the

terminology supported by many authors including T.V.Raman [Raman, 1997].

Since the early 80’s, after some trials with special versions of self-voicing software,

capable of driving a speech synthesizer and so providing access for blind persons,

a more general concept appeared and a family of applications, called screen-

readers, was initiated with the purpose of creating a vocal rendering of the

contents of the screen under user control through the keyboard, using a text-to-

speech converter [Wikipedia]. In this way, properly installed screen reader software

stays active in the operating system and operates in the background, analysing the

actual contents of the screen. From the initial command-line interface (CLI) to the

now existing ubiquitous graphical user interface (GUI) screen reader software has

evolved much in 2,5 decades.

Screen readers can also analyse many visual constructs like menus and alert or

dialogue boxes and transform them into speech to allow interaction with a blind

user.

Navigation in the screen is possible as well, to allow a non-linear or even random

exploration and acquisition of the depicted information. Control of the produced

speech is normally given to the user so that quite fast navigation becomes possible

when the user works with shortcuts. A simulation of a screen reader is available at

the WebAIM website [WebAIM].

Although many screen reader applications exist, there are many limitations that

current screen readers cannot overcome per se, for instance those related to

images and structured text (tables etc.). Screen readers cannot describe images,

they can only produce a readout of a textual description of these and the user has

problems to realize how the page is organized.

The basic requirement in terms of speech processing for screen reader applications

is a robust text-to-speech converter with the possibilities of spelling and reading

random individual characters and all kinds of text elements that may appear like

numeric expressions, a b b r e v i a t i o n s, acronyms and other coded elements.

Punctuation is also spoken in general, besides being determinant in introducing

some prosodic manipulation in the synthetic voice.

36

2.2. New technologies to help people with disabilities

and elderly people

Following this idea, the World Wide Web Consortium (W3C) in 1998, with the issue

of the Cascading Style Sheet 2 (CSS2) recommendation, has introduced the Aural

Cascading Style Sheet (ACSS); a chapter respective to the acoustical rendering of

a web page is presented in [WDAC].

Auditory icons, sometimes also called earcons, are made audible to the user by

means of a loudspeaker or earphone system that should have advanced acoustic

features (high quality, stereo etc.). The acoustic elements contain voice properties

like speech-rate, voice-family, pitch, pitch-range, stress, and others that are used as

command parameters to the speech synthesizer.

An extended investigation of spatial acoustic features as a component of a screen

reader was performed in the GUIB (Graphical User Interfaces for the Blind) project

in the framework of the European TIDE initiative [Crispien, 1995]. The idea was to

generate an acoustic screen in front of the user on which windows, icons and other

graphic elements are audible on different places, and the mouse position is also

audible when the mouse is moving.

In a former project (AudioBrowser, 2003-2005, see [Repositorium]), developed for

Portuguese, but applicable for most other languages, the structure or outline of a

web page can be discovered and used as a table of contents, and it was

implemented successfully. The user in this application can freely navigate inside the

contents of each window or jump between windows from contents to tables of

contents or vice-versa in order to scan or navigate through the page in a more

structured and friendly way. The blind or low-vision user is constantly helped by the

text-to-speech device that follows the navigation accurately.

The W3C consortium, through its Web Accessibility Initiative (WAI) has been

issuing a relevant set of Web contents accessibility guidelines (WCAG), now in

version 2. These guidelines are greatly helpful in orienting web page design for

accessibility [WAI]. Authoring Tool Accessibility Guidelines (ATAG), nowadays in

version 2.0, are also important for developers of authoring tools.

Reproduction of complex documents for blind persons

Complex documents like mathematical and other scientific, technical or even

didactic documents are usually equipped with graphical representations. Above all,

equations and other mathematical expressions have posed a substantial barrier to

the access by visually impaired persons. Most representations and charts may also

be included in this group.

37

2.2. New technologies to help people with disabilities

and elderly people

Representation in special Braille codes of complex mathematical elements can

almost totally solve the problem for blind persons. The LAMBDA project [LAMBDA,

2005] has produced a mathematical rendering package using such a system.

In the case of more lengthy mathematical objects, more refined solutions might be

p r e f e rable using audio rendering of the mathematical expressions through

synthetic speech. Using the codification of the expression in MathML, a browsable

textual description of the expression can be automatically derived from the

MathML code by means of a special lexicon and a grammar. Both must be specially

designed for the purpose according to the mathematical conventions and concerns

of non-ambiguity of the textual description. This work has been carried out in the

AUDIOMATH project [Ferreira, 2005] carried out at the Faculdade de Engenharia

da Universidade do Porto. A demonstration page is available at [Ferreira].

Acoustical cues, contributing to the clearness of the speech rendering, are also

important. Previous authors have used, for instance, prosodic modifications such

as raising or lowering the pitch of the synthetic voice to signal upper or lower parts

of the expression, respectively. In the work of AUDIOMATH the influence of pitch

movements as well as of pauses during description of expressions was studied and

rules were extracted. An intra-formula navigation mechanism was designed in

order to allow the user to explore the formula at her/his own will thereby not

putting too much stress on audio memory in the case of longer formulas.

2.2.2.3 Conclusions and future developments

The aim of this chapter was to show how electronic speech processing works and

how persons with disabilities can benefit from it.

Since speech is man’s most important form of communication, all efforts must be

done to make speech communication possible, and if the speech channel is

disturbed, technical solutions have to be found to overcome the obstacles.

The accuracy and quality of modern speech recognition systems as well as

synthesis systems has reached a state of maturity which allows the development

of very poweful support systems for persons with disabilities and to bridge the gap

between these persons and those without disabilities, as was shown, for example,

between deaf persons and the rest of the world.

Looking into the future of speech technology, some important research areas can

be identified as follows:

38

2.2. New technologies to help people with disabilities

and elderly people

• Improving the robustness of speech recognition systems. Although the

robustness has been remarkably improved over the last years, the systems

are still far behind human capabilities. Noise, especially non-stationary noise,

background speakers or music can still reduce the recognition reliability well

below an acceptable error rate. Improvement is expected (and partly proven,

as has been seen) from a multimodal recognition which includes also visual

information (above all, mimics, facial and hand gestures)

• A more extended use of semantic and pragmatic information. When the

system (recognizer or synthesizer) ‘knows’ what the speaker wants to

express, which covers both, the content and the emotion, then the

recognizer can usefully complete a spoken message which has recognition

errors. A synthesizer could automatically generate the right accentuations

and emotional ‘colouring’ of the speech. For the sake of completeness it has

to be mentioned here that the permanent improvement of the quality of

synthetic speech also includes multilinguality as well as speaker-specific

synthesis and will remain within the scope of research. Audio rendering of

complex documents through synthetic speech is also a very important

development area where document description strategies, their conversion

into full text form and intra-document navigation or browsing are the crucial

steps

• A challenge and wide field of research is sign language recognition. As

mentioned earlier, there are several research activities but much more work

has to be done. More needs to be known about structures of sign

languages (and there are very many and all are different!) and their relations

to spoken and written languages. Automatic translations should be possible

in different directions (sign language into speech and vice versa, sign

language into another sign language, speech into a foreign sign language

and vice versa, for example German speech into American sign language).

Also the technical part of the problem is challenging. Using the Ambient

Intelligence (AmI) approach, we can expect micro cameras in the clothes or

in a pendant as well as position sensors in finger rings etc., and the

environment will have enough intelligence to take on most of the processing

activities needed for recognition and translation

• For blind persons, screenreaders and the automatic recognition of graphics,

pictures and the environment are a never ending research area. As a matter

of fact, for blind persons a verbal (spoken) description of the recognition

result is, in many cases, the best solution. As before, AmI will be of crucial

importance here.

39

2.2. New technologies to help people with disabilities

and elderly people

It should be mentioned here that the enumeration given in this chapter from being

complete. Further examples will be given in other chapters, showing that speech

technology and speech applications will play a dominant role whenever

communication is discussed.

2.2.2.4 References

BARROS M.J., MAIA R., TOKUDA, K. RESENDE, F.G., FREITAS, D., (2005). HMM-

based European Portuguese TTS System, artigo apresentado e publicado nas

actas da Interspeech'2005 - Eurospeech — 9th European Conference on Speech

Communication and Technology, Lisboa.

BOTINIS (ed.) et al., (1997). Intonation: Theory, Models and Applications.

Proceedings of the ESCA Worksop Sept. 18-20 Athens, Greece.

BRASHER, H., STARNER, T. et al., (2003). Using Multiple Sensors for Mobile Sign

Language Recognition. ISCW White Plains, WA,

Also: http://www-static.cc.gatech.edu/~thad/031_research.htm

BURGHARDT, F. et al., ( 2006). Examples of synthesized emotional speech

http://emosamples.syntheticspeech.de/

CRISPIEN, K., FELLBAUM, K. (1995). Use of Acoustic Information in Screen

Reader Programs for Blind Computer Users: Results from the TIDE Project GUIB.

In: Placencia Porrerro, I.,& de la Bellacasa, R. P., (Eds.): The European Context for

Assistive Technology - Proceedings of the 2nd TIDE Congress, Paris, , IOS Press,

Amsterdam.

DELLER, J.R., (2000). Discrete-time processing of speech signals.

New York : Institute of Electrical and Electronics Engineers.

DRAGON Naturally Speaking Professional Engine, (2006). NUANCE

communications http://www.nuance.com/naturallyspeaking/.

FERREIRA, H., FREITAS, D., (2005). AudioMath—Towards Automatic Readings of

Mathematical Expressions”, 11th International Conference on Human Computer

Interaction, Las Vegas, EUA.

FERREIRA. http://lpf-esi.fe.up.pt/~audiomath

FURUI, S., (2001). Digital speech processing, synthesis, and recognition

2nd ed., rev. and expanded. New York : Marcel Dekker.

40

2.2. New technologies to help people with disabilities

and elderly people

GARDNER-BONNEAU, D., (1999). Human Factors and Voice Interactive Systems.

Kluwer Academic Publishers, Boston.

HUMANE, Network of Excellence. http://emotion-research.net/aboutHUMAINE.

iCommunicator homepage. http://www.myicommunicator.com/].

IIDA, A., CAMPBELL, N., YASUMURA, M. (1998)., Emotional Speech as an

Effective Interface for People with Special Needs, apchi, p. 266, Third Asian

Pacific Comp. and Human Interaction.

JEKOSCH, U., (2005). Voice and Speech Quality Perception. Springer-Verlag

Berlin, Heidelberg.

KRAISS, K.F., (ed.), (2006). Advanced Man-Machine Interaction. Springer Berlin

Heidelberg, New York.

SYNFACE project research page http://www.speech.kth.se/synface/.

LEE, C.M., PIERACCINI, R., (2002). Combining Acoustic and Language

Information for Emotion Recognition. Proc. of the International Conference on

Speech and Language Processing (ICSLP 2002). Denver, Co.

LAMBDA (2005). http://www.lambdaproject.org/.

MBROLA website http://tcts.fpms.ac.be/synthesis/.

MOESLUND, T., NORGAARD, L., (2003). A Brief Overview of Hand Gestures used

in Wearable Human Computer Interfaces. Technical Report CVMT 03-02,

Computer Vision and Media Technology Lab., Aalborg University, DK.

MOURA A., PÊRA V., FREITAS, D., (2006). (in Portuguese) Um Sistema de

Reconhecimento Automático de Fala para Pessoas Portadoras de Deficiência”,

artigo publicado nas actas da conferência IBERDISCAP’06, realizada em Vitória-

ES, Brasil.

MUSSLAP. University of West Bohemia, MUSSLAP website

http://www.musslap.zcu.cz/en/audio-visual-speech-recognition/.

PRITSCH, M., (2005). Visual speech training system for deaf persons. Proceedings

of the 16th Conference Joined with the 15th Czech-German Workshop “Speech

Processing, Prague, Sept. 26-28, 2005. TUD press Dresden, Germany.

RAMAN, T.V., (1997). Auditory User Interfaces, Kluwer Academic Publishers,

August.

41

2.2. New technologies to help people with disabilities

and elderly people

RAMAN, T.V., (1998). Conversational gestures for direct manipulation on the

audio desktop, Proceedings of the third international ACM SIGACCESS

Conference on Assistive Technologies, Marina del Rey, California, United States,

pgs 51 – 58. ISBN:1-58113-020-1.

REPOSTIRORUIM. https://repositorium.sdum.uminho.pt/bitstream/

1822/761/4/iceis04.pdf#search=%22audiobrowser%22 SPROAT, R. (ed.)

(1998).: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers.

Dordrecht, Boston, London.

SYNFACE - Synthesised talking face derived from speech for hard of hearing

users of voice channels

http://www.speech.kth.se/synface/ and http://www.synface.net/.

SYNTHESIS TESTSITE, AT&T. http://www.research.att.com/~ttsweb/tts/demo.php.

VARY, P., MARTIN, R., (2006). Digital Speech Transmission. Enhancement, Coding

and Error Concealment. J. Wiley&Sons.

VOGLER, C. et al. A Framework for Motor Recognition with Applications to

American Sign Language and Gait Recognition.

http://www.cis.upenn.edu/~hms/2000/humo00.pdf

see also Vogler’s homepage http://gri.gallaudet.edu/~cvogler/research/.

WAI. Web accessibility homepage. http://www.w3.org/WAI/

WDAC (1999). Aural Cascading Style Sheets (ACSS), W3C Working Draft

http://www.w3.org/TR/WD-acss.

WebAIM Screen Reader Simulation.

http://www.webaim.org/simulations/screenreader.php

Wikipedia about screenreader http://en.wikipedia.org/wiki/Screen_reader

WISDOM project page. http://www.bris.ac.uk/news/2001/wisdom.htm.

42

2.3.

New remote services

2.3 New remote services

2.3.1 Novel broadband-based services: new opportunities

for people with disabilities

Broadband trials by the National Post and Telecom Agency

(Post- och telestyrelsen PTS), in Sweden

Patrik Bystedt

PTS seven broadband trials

Broadband technology has become accessible for a steadily increasing proportion

of the population in Sweden. With the aid of more rapid data transmission it has

become possible to send and receive large quantities of information via computer

ne