There's been an industry shift from using proprietary ap-proaches for developing speech-enabled applications to using strategies and architectures based on industry standards. The latter offer developers of speech software a number of advantages, such as application portability and the ability to leverage existing Web infrastructure, promote speech vendor interoperability, increase developer productivity (knowledge of speech vendor's low-level API and resource management is not required), and easily accommodate, for example, multimodal applications. Multimodal applications can overcome some of the limitations of a single mode application (GUI or voice), thereby enhancing a user's experience by allowing the user to interact using multiple modes (speech, pen, keyboard, etc.) in a session, depending on the user's context.

VoiceXML, Call Control eXtensible Markup Language (CCXML), and Speech Application Language Tags (SALT) are emerging XML specifications from standards bodies and industry consortia that are directed at supporting telephony and speech-enabled applications. The purpose of this article is to present an overview of VoiceXML, CCXML, and SALT and their architectural roles in developing telephony as well as speech-enabled and multimodal applications.

Before I discuss VoiceXML, CCXML, and SALT in detail, let's consider a possible architectural deployment that employs these specifications. At a high level are two main architectural components: document server and speech/telephony platform. Each interfaces with a number of secondary servers (Automated Speech Recognition server (ASR), Text-to-Speech server (TTS), data stores).

In this architecture a document server generates the documents in response to requests from the speech/telephony platform. The document server leverages a Web application infrastructure to interface with back-end data stores (message stores, user profile databases, content servers) to generate VoiceXML, CCXML, and SALT documents. Typically, the overall Web application infrastructure separates the core service logic (the business logic) from the presentation details (VoiceXML, CCXML, SALT, HTML, WML) to provide a more extensible application architecture. The application infrastructure is also responsible for maintaining application dialog state in a form that's separate from a particular presentation language mechanism.

To process incoming calls, the speech/telephony platform requests documents from the document server using HTTP. A VoiceXML or CCXML browser that resides on the platform interprets the VoiceXML and CCXML documents to interact with users on a phone. Typically, the platform interfaces with the PSTN (Public Switched Telephone Network) and media servers (ASR, TTS) and provides VoIP (SIP, H.323) support. An ASR server accepts speech input from the user, uses a grammar to recognize words from the user's speech, and generates a textual equivalent that is used by the platform to decide the next action to take, depending on the script. A TTS server accepts markup text and generates synthesized speech for presentation to a user. In this deployment a SALT browser on a mobile device interprets SALT documents. Figure 1 is a diagram illustrating such an architecture.

Now that you have an overall understanding of the architecture in which these specifications can be used, let's begin by discussing VoiceXML. VoiceXML can be viewed as another presentation language (HTML, WML) in your architecture. VoiceXML is a dialog-based XML language that leverages the Web development paradigm for developing interactive voice applications for devices such as phones and cell phones. It's a self-contained presentation language designed to accept user input in the form of DTMF (touch tones produced by a phone) and speech, and to generate user output in the form of synthesized speech and prerecorded audio. It isn't designed to be embedded in an existing Web language (e.g., HTML, WML) and leverage HTML's event mechanism. At this writing, VoiceXML 2.0 is a W3C working draft (www. VoiceXML is currently used to support a number of different types of solutions - for example, to automate the customer care process for call centers and to support sales force automation to enable a sales agent to access appointments, customer information, and address books by phone. It's also used in unified communications solutions to enable a user to manage his messages (e-mail, voice, fax), including personal information (personal address books).

To appreciate how VoiceXML can be used, it's necessary to understand its structure, elements, and mechanisms. A VoiceXML application comprises a set of documents sharing the same application root document. When any document in the application is loaded, the root document is also loaded. Users are always in a dialog defined in a VoiceXML document (i.e., the interaction between a user and voice/telephony platform is represented in a dialog). Users provide input (DTMF or speech) and then, based on application logic in the document, are shown a new dialog that presents output (audio files or synthesized speech) and accepts further user actions that can result in a new dialog. Execution ends when no further dialog is defined. Transitions between dialogs use URIs.

There are two main types of dialogs: forms and menus. Forms pre-sent output and collect input while menus present a user with choices. Fields are the building blocks of forms and comprise prompts, grammars (describing allowable inputs), and event handlers. VoiceXML's built-in Form Interpretation Algorithm (FIA) determines the control flow of a form. For each form element the main FIA routine involves selecting a form item, collecting input (playing a prompt, activating grammars, waiting for input or event), and then processing the input or event. Examples of actions defined in a form include collecting user input (<field>), playing a prompt (<prompt>), or executing an action when an input variable is filled (<filled>). The FIA's interpretation of a form ends when an <exit> element or transition to another dialog (<submit>, <goto>) is encountered.

Menus (<menu>) can be loosely described as a special case of a form with a single, implicitly defined field composed of a set of choices (<choice>). A choice element (<choice>) defines a grammar that determines its selection and a URI to transition to. Menus can be speech, DTMF, or both.

VoiceXML supports both system-directed (for less experienced users) and mixed-initiative conversations (for more experienced users); <link> is used to support mixed-initiative dialogs. VoiceXML defines specific elements to control dialog flow and handle user interface events. Elements such as <var>, <if>, and<goto> are provided to define application logic, and ECMAScript code can be defined in <script> elements. VoiceXML offers elements to handle situations when there's no user input or the input isn't understandable through its event mechanism (<throw>, <catch>, <noinput>, <error>,<help>, <nomatch>). These and other VoiceXML elements are used in Listing 1 to illustrate how VoiceXML could be used to support a simple-content services application (voice portal for getting stock quotes, news, weather, etc.).

VoiceXML's grammar activation is based on the scope in which the grammar was declared and the current scope of the VoiceXML interpreter. For example, declaring a grammar in the root document means that the grammar will be active throughout the execution of the VoiceXML application. Grammars can be active within a particular document, form, field, or menu, and can be inline, external, or built in (e.g., Boolean, date, phone, time, currency, digits). Although, unlike VoiceXML 1.0, it doesn't preclude other grammar formats, version 2.0 requires support for the XML form of the Speech Recognition Grammar Specification (SRGS). VoiceXML 2.0 interpreters should also be able to support Speech Synthesis Markup Language (SSML) specification for synthesized speech. As I write, both SSML and SRGS are W3C working drafts.

VoiceXML has some other features to promote architectural extensibility and reusability. For example, the <object> element offers facilities for leveraging platform-specific functionality while still maintaining interoperability, and the <subdialog> element can be used to develop reusable speech components.

While it provides a suitable language for developing interactive speech applications, VoiceXML lacks support for features such as conferencing, call control, and the ability to route and accept or reject incoming calls. <transfer>, its tag for transferring calls, is inadequate for these types of features. Further, VoiceXML's execution model isn't well suited to an environment that needs to handle asynchronous events external to the VoiceXML application. VoiceXML can handle synchronous events only - those that occur only when the application is in a certain state.

CCXML addresses some of the telephony/call control limitations of VoiceXML. It enables processing of asynchronous events (events generated from outside the user interface), filtering and routing of incoming calls, and placing outbound calls. It supports multiparty conferencing as well as the creation/termination of currently executing VoiceXML instances and the creation of a VoiceXML instance for each call leg. A CCXML browser on a voice/telephony platform interprets CCXML documents. CCXML is currently a W3C working draft (

Since CCXML is still an emerging specification, few deployed solutions are in the market today. However, it can be used to support a number of different types of applications. Conferencing applications can be developed using it. Voice messaging applications can use it to enable a user to filter incoming calls and route them to a particular application. CCXML can also be used to support notification functions (e.g., place outbound calls to notify a user of a new appointment or a stock alert).

The structure of a CCXML program reflects its fundamental value in being able to handle asynchronous events. Processing of events from external components, VoiceXML instances, and other CCXML instances is central to CCXML. A CCXML program basically consists of a set of event handlers () for processing events in the CCXML event queue. Each element comprises elements. An implicit Event Handler Interpretation Algorithm (EHIA) interprets <event- handler> elements. A CCXML interpreter essentially removes an event from its event queue, processes it (selects <transition> elements in <event- handler> that match the event and performs actions within that <transition> element), and then removes the next item from the event queue.

Within a <transition> element a new VoiceXML dialog can be started (<dialogstart>) and associated with a call leg. Note that the dialog launch is nonblocking, and control is immediately returned to the CCXML script. A CCXML script can also end a VoiceXML dialog instance (<dialogterminate>). Conditional logic (, ) can be used in a <transition> element to accept or reject incoming calls or modify control flow. Incoming calls can be accepted or rejected (<accept>,<reject>), and outbound calls can be placed (<createcall>). A CCXML application can contain multiple documents, and a CCXML execution flow can transition from one document to another (<goto>, <fetch>, <submit>) and end its execution (<exit>). A CCXML instance can also create another CCXML instance (<createccxml>) with a separate execution context and send events (<send>) to other CCXML instances. Multiparty conferences can be created and terminated (<createconference>, <destroyconference>), and call legs can be added to and removed from a conference (<join>, <unjoin>).

For telephony events the JavaSoft call model is used to provide the abstractions (Address, Call, Connection, Provider). A Call object is a representation of a telephone call. A call comprises zero or more Connections (e.g., a conference call typically has three or more connections). A Connection describes the relationship between a Call and an Address (e.g., telephone number). A Connection is in one of a defined set of states at a particular time. For example, when a Connection and an Address are an active part of a call, a specific event is emitted (connection.Connection_CONNECTED), whereas when an address is being notified of an incoming call, a different event is emitted (connection.CONNECTION_ALERTING).

The code snippet in Listing 2 illustrates the main CCXML elements involved in processing an incoming call notification event and then processing a subsequent connected event. Based on a connection.

CONNECTION_ALERTING event, the call could be accepted or rejected according to the called ID using conditional logic (<if>, <else>). When a connection enters the connected state, a VoiceXML dialog (login.vxml) is launched to perform some authentication (e.g., entering a PIN number). Listing 2 also illustrates the main structure of a CCXML script to address this scenario.

Other types of events processed by a CCXML script are sent from VoiceXML instances or from other CCXML instances. For example, a CCXML script can capture status from a terminating VoiceXML dialog because VoiceXML's <exit> element allows variables to be returned to the CCXML script. When a VoiceXML interpreter ends execution, dialog.exit event is received by a CCXML instance. Using a <transition> element, the CCXML script can process the dialog.exit event and access variables returned by the VoiceXML dialog. For example, if we have a VoiceXML dialog that presents a set of menu options, upon termination of the VoiceXML dialog the CCXML script can obtain the selected menu choice and act on it (e.g., perform an outdial, join a conference). Events are also used to communicate between CCXML instances. However, CXXML currently doesn't define a specific transport protocol for communication; SIP and HTTP are possibilities for the underlying transport.

A number of factors currently inhibit the adoption of CCXML.

  • Since it's a new specification still under review, there are few CCXML browser implementations compared to VoiceXML browser implementations. Note that although it's designed to complement a dialog-based language such as VoiceXML, a CCXML system isn't required to support a VoiceXML implementation. If CCXML and VoiceXML are used together, the instances would run separately.
  • While CCXML is a promising start in addressing some of the limitations of VoiceXML, a number of areas remain to be specified in order to meet the needs of call control applications. For example, while CCXML is and should remain uncoupled from a specific underlying protocol, protocol-agnostic mechanisms could be specified to allow passing in protocol-specific parameters (VoIP, SS7) when performing certain functions (e.g., making an outbound call).
  • Other items for future discussion are documented in the current CCXML specification (e.g., communication between different CCXML instances).

    SALT enables speech interfaces to be added to existing presentation languages (e.g., HTML, XHTML, WML) and supports multimodal applications. PDAs, PCs, and phones are examples of devices that can support SALT applications. At the time of this writing, the SALT Forum (, an industry consortium, has published a 0.9 version of the SALT specification.

    Since SALT is an emerging specification, at the time of this writing there are no known deployed solutions. However, SALT can enhance a user's experience on PDAs and other mobile devices that have inherent limitations, such as keyboards that are difficult to use and small visual displays. Multimodal solutions allow a user to use multiple I/O modes concurrently in a single session. Input can be via speech, keyboard, pen, or mouse, and output can be via speech, graphical display, or text. For example, a multimodal application can enable a user to enter input via speech and receive output in the form of a graphic image. Thus a user can select the appropriate interaction mode depending on the type of activity and context. SALT can enable solutions that enhance a user's experience when using applications such as content services (news, stock quotes, weather, horoscope), unified communications, sales force automation, and call center support.

    The core philosophy behind the SALT specification is to use a lightweight approach to speech-enabling applications. This is evident in the small set of tags in the SALT specification that can be embedded in a markup language such as HTML for playing and recording audio, specifying speech synthesis configuration, and recognizing speech. The <listen> element is used to recognize input and define grammars (<grammar>). SALT browsers must support the XML form of the W3C's SRGS.

    This means that the XML form of SRGS can be used in SALT's <grammar> element. The <listen> element also has an element for processing the input (<bind>). The <bind> element processes the input result that's in the form of a semantic markup language document. A SALT browser must support W3C's Natural Language Semantic Markup Language (NLSML) for specifying recognition results. XPath is used to reference specific values in the returned NLSML document that are then assigned to a target defined in the . Thus input text can be obtained from either a graphical display or, now, using a speech interface.

    The <listen> element can also contain a single <record> element for capturing audio input. SALT also supports DTMF input using the <dtmf> element; like <listen>, its main elements are <grammar> and <bind>. For audio output the <prompt> element can represent a prompt using a speech synthesis language format (SALT browsers must support SSML). This means that SSML can be embedded within SALT's <prompt> elements.

    SALT's call control functionality is provided in the form of a call control object model based on the JCP Call Model and portions of Java Telephony API (JTAPI). The callControl object provides an entry point to the call control abstractions and consists of one or more provider objects. Providers represent an abstraction of a telephony protocol stack (SIP, H.323, SS7). Providers create and manage a conference. An address is a communication endpoint (e.g., SIP URL for VoIP), and providers define the available addresses that can be used. A conference is composed of one or more calls that can share media streams. SALT's call control objects define states, transitions, events, and methods for supporting call control. For example, there are methods for beginning a new thread of execution for processing an incoming call (spawn() method of the callControl object), answering an incoming call (accept() method of call object) based on an alerting event (call.alerting), or transferring a call (transfer() method of call object).

    Also, SALT has a <smex> element that can be used to exchange messages with an external component of the SALT platform. Functionality such as Call Detail Record (CDR) generation, and logging or proprietary features, can be leveraged using this extensible mechanism while maintaining interoperability. For example, <smex> can be used to leverage call control functionality on a different platform (e.g., send call control requests to and receive events from a CCXML interpreter on a remote host). In this situation SALT can be used for dialog management while call control is handled by a separate architectural entity (e.g., CCXML interpreter) on a remote host.

    Let's consider how SALT elements can be embedded in an existing Web page to also offer a speech interface. Assume a Web page with SALT elements allows a user to select the latest news for a particular sport (soccer, basketball, football, tennis) by selecting the sport from a drop-down box. To enable speaking the name of the sport, the start() method of listen's DOM object is called to invoke the recognition process. The value returned from the recognition is assigned to an input field (e.g., txtBoxSport). As with other SALT DOM elements, the listen element also defines a number of events (e.g., onreco, onsilence, onspeechdetected) whose handlers may be specified as attributes of the listen object (i.e., event-driven activation is an inherent feature of SALT). The following code snippet illustrates some of the SALT elements involved in this scenario:

    ... <input name="txtBoxSport" type="text" onpendown=
    <salt:listen id="listenSport">
    <salt:grammar name="gramSport" src="/sport.xml"/>
    <salt:bind targetElement="txtBoxSport" value="//sport"/>

    You've probably noted that SALT and VoiceXML can be used to develop dialog-based speech applications, but the two specifications have significant differences in how they deliver speech interfaces. Whereas VoiceXML has a built-in control flow algorithm, SALT doesn't. Further, SALT defines a smaller set of elements compared to VoiceXML. While developing and maintaining speech applications in two languages may be feasible, it's preferable for the industry to work toward a single language for developing speech-enabled interfaces as well as multimodal applications.

    Summary and Conclusion
    This short discussion has provided a brief introduction to VoiceXML, CCXML, and SALT for supporting speech-enabled interactive applications, call control, and multimodal applications and their important role in developing flexible and extensible standards-compliant architectures. This presentation of their main capabilities and limitations should help you determine the types of applications for which they could be used.

    The various languages expose speech application technology to a broader range of developers and foster more rapid development because they allow for the creation of applications without the need for expertise in a specific speech/telephony platform or media server. The three XML specifications offer application developers document portability in the sense that a VoiceXML, CCXML, or SALT document can be run on a different platform as long as the platform supports a compliant browser.

    These XML specifications are posing an exciting challenge for developers to create useful, usable, and portable speech-enabled applications that leverage the ubiquitous Web infrastructure.

    For more information on using these XML specifications in your speech-based system architectures, see the following:

  • Voice browser call control (CCXML v. 1.0):
  • Speech Application Language Tags 0.9 Specification (draft):
  • Voice Extensible Markup Language (VoiceXML v. 2.0):
  • Speech Synthesis Markup Language Specification:
  • Speech Recognition Grammar Specification for W3C speech interface framework:
  • Natural Language Semantics Markup Language for speech interface framework:
  • More Stories By Ian Moraes

    Ian Moraes, Ph.D., is a principal engineer at Glenayre Electronics, where he works on the architecture and design of a unified messaging system. Ian has developed client/server systems for the telecommunications and financial services industries.

    Comments (1) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

    Most Recent Comments
    Kyriakos Mandalas 03/07/05 03:10:25 PM EST

    Congragulations for your article, it is one of the best I have read
    while searching for my thesis.

    I am a postgraduate CS student at Athens University of Economics &

    My thesis is about Modetn Voice Applications Standards & Technologies.

    My question is: In your article "VoiceXML, CCXML, SALT. Architectural
    tools for enabling speech applications",
    you say that CCXML enables each active telephone call to have a
    dedicated VoiceXML interpreter. What about SALT? Does CCXML provide this to SALT applications or SALT uses by default multiple SALT interpreter instances?

    I would be gratefull, if you can help me.