Humans communicate their thoughts, ideas, and feelings verbally using complex muscle movements that result in the tone or what we call “voice”. During the past year, I’ve been experimenting with how we can get that same voice in technologies that we interact with, such as Amazon Alexa or other voice virtual agents.

Working on configuring Text to Speech services in the Accenture Conversational AI platform for clients has introduced me to Speech Synthesis Mark-up Language (SSML). SSML helps us construct the voice of our virtual assistant. It is applicable to almost any voice enabled technology and has many customizable features, making it a powerful tool.

So, what is SSML?

SSML is a mark-up language which is XML-based. It’s used for speech synthesis and provides authors of synthesizable content a standard way to control aspects of speech. These aspects include pronunciation, volume, pitch, rate, etc., across different synthesis-capable or voice-capable platforms.

Why should I use it?

  • SSML gives a standard way to control aspects of the speech in different systems.
  • It can improve the quality of the synthesized content.
  • Flexibility: it can be used either automatically (CSS3 from an XHTML document), or by a human authoring. It can also be present within a complete SSML document or embedded in another language.

What features can the voice be configured with?

  1. Pitch: The baselines pitch for the contained text.

Example:

  • A number followed by “Hz”
  • "X-low", "low", "medium", "high", "x-high", or "default"
  1. Contour: Sets the actual pitch contour for the contained text.
  2. Rate: A change in the speaking rate for the contained text. 

Example:

  • A non-negative percentage
  • "x-slow", "slow", "medium", "fast", "x-fast", or "default"
  1. Duration: A value in seconds or milliseconds for the desired time it takes to read the contained text.

Example:     

  • Number followed by ms (milliseconds)
  • Number followed by s (seconds)
  1. Volume: the volume for the contained text. 

Example:

  • A number preceded by “+” or “-” and followed by dB, or
  • "Silent", "x-soft", "soft", "medium", "loud", "x-loud"

And what elements can I use with it?

<voice>

This is the main element tag for all text reading for the user. It’s used to specify the gender and accent from the vendor (e.g. MS Bing).

An example of reading with an Arabic Saudi male voice: <voice xml:lang='ar-SA' gender='male'></voice>

<say-as>

<say-as> is used to read text/numbers in a specific way based on the content of the tag.

 <sub>

This is used to substitute text with another text shown in the element (alias). It can be used for acronyms and other interpretations to be read differently.

 <emphasis>

<emphasis> specifies how strong the word and letters are emphasized when reading the text.

Example: <p>  Come here, <emphasis level="strong”> NOW </emphasis> </p>

 <break />

This is used to add a pause when reading the text.

 <audio>

You can use this element to insert specific audio files within the text being read. The tag itself can also contain text inside in case the audio file fails to play. It can also contain trimming attributes to control the audio output.

<prosody>

This element can contain the pitch, speaking rate/speed, and volume of the text being read. The attributes shown can be used together and are all optional. Elements:

Rate:

  • Decides the speed of how the text is read. We are currently increasing the overall speech by +15%. But this can also target specific words or sentences.
  • The value should be a non-negative number followed by a % sign.

Pitch

  • Changes the pitch of the voice to show higher or lower pitch (and can show excitement).
  • It accepts a positive number followed by 'Hz'.

Range

  • Controls the pitch range (variability) for the contained text.
  • Increases/decreases the dynamic range of the output pitch.
  • Uses a positive number followed by 'Hz'.

Duration

  • Gives a desired time duration to say the text in the element and change the speed based on the time limit given.
  • Takes a positive number followed by s for seconds.

Volume

  • Controls the volume of the text being read
  • Based on testing with Arabic, the value can range between 0-100; where 100 is the default volume and anything less is lower volume.

But how does it actually work?

There are multiple steps happening in the background when we use SSML to convert a text to synthesized speech with customized voice features. Below is a summary of these steps:

  1. XML parse

An XML parser is used to extract the document tree and content. The structure, tags and attributes it obtains influence the steps below.

  1. Structure analysis

The structure influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

  1. Text normalization

All written languages have special constructs that require a conversion of the written (orthographic) form into the spoken form. Text normalization is an automated process that performs this conversion.

For example, in English, when $300 appears, it would be read as “three hundred dollars”. Also, the orthographic form 1/2 may be pronounced as “one half”, or “January second”, or “February first”.

By the end of this step, the text to be spoken has been converted completely into tokens, based on the language-specifics.

A special element <say-as/> can be used in the input document to explicitly indicate the presence and type of such constructs.

  1. Text-to-phoneme conversion

After the processor has determined the set of words, it must derive the pronunciation of each of the words. Word pronunciation can be described as sequences of phonemes, which are units of sound in a language. Each language has a specific phoneme.

This step can be very complex because of differences between written and spoken forms of a language. For example, in English, the word “read” can be pronounced as “red” or “reed”.

These words can be pronounced correctly based on context, but might be difficult for synthesis processors to pronounce without context.

The element <phoneme/> provides the author with control of how to pronounce the phonemic sequence.

  1. Prosody Analysis

Prosody is a set of features of speech output that includes pitch, timing, pausing, speech rate, and more.

Producing human-like prosody is important for making speech sound natural.

SSML elements for that include <break/>, <emphasis/> and <prosody/>.

  1. Waveform Production

The final step is producing audio waveform output from the phonemes and prosodic information.

The <voice/> element in SSML allows the document creator to request a particular voice to be used (e.g. young/old male/female voice).

These six steps to SSML make provide a great path to convert text into synthesized speech with customized voice features. I hope you now have a better understanding of SSML and how to apply it to your organization.

​Sara Alamoodi​

​Advanced App Engineering Specialist

Subscription Center
Subscribe to Software Engineering Blog Subscribe to Software Engineering Blog