Audio and Voices

Custom audio files

If custom audio files are to be used, they should preferably meet the following specification:


Bit rate	128 kbps
Sample size	16 bit
Channels	1 (Mono)
Audio sample rate	8 kHz (8000 Hz)
Audio codec	PCM
Filename	*.wav

Other formats will be transcoded to the above.

In order to use custom audio files, they need to be available to the Voice API. This is can be facilitated by going to Audio Files within the Voice Management App, which allows you to upload your custom audio files onto our servers.

You are free to create any directory structure on this server, just make sure you supply the whole path from the root of your account when sending an instruction that requires an audio file.

If you want to use custom spelling audio, these files must be placed in the correct structure, as explained in the next chapter.

Custom audio files for the OTP app

In order to use custom audio files with the OTP app, you need to upload a set of audio files in the following directory structure using the Voice Management App:

/spelling/en-GB/*.wav

Where ‘en-GB’ is the language (including locale) of the set. Inside this folder, you need to upload a .wav file for every number or letter you want to be able to read aloud, like:

0.wav
1.wav
2.wav
a.wav
b.wav
c.wav

📘
For these custom audio files to work properly, please make sure that their filenames are in lowercase.

Text-To-Speech

The Voice API supports Text-To-Speech (or TTS) in all instructions where you can provide a prompt to the caller/callee. When using TTS, you can provide the voice you want to use. Currently we support the following voices:

Language / Locale	Code	Code	Number of voices available
German	de-DE	Male	12
German	de-DE	Female	12
English (United States)	en-US	Male	19
English (United States)	en-US	Female	23
Dutch	nl-NL	Male	4
Dutch	nl-NL	Female	6

See more supported languages https://developers.cm.com/voice/docs/supported-languages

When using TTS (or the Spelling Instruction), you can provide the voice to use in the JSON body. The voice part of the JSON body has the following variables:

C#

instruction.Voice.Language = "nl-NL";
instruction.Voice.Gender = Gender.Male;
instruction.Voice.Number = 1;
instruction.Voice.Volume = 2;

JSON

"voice": {
    "language": "nl-NL",
    "gender": "Male",
    "number": 1,
    "volume": 2,
    "premium": false
}

Variable definition

Variable	Data type	Length	Required	Description
language	Alphanumeric	5	No (Default: en-GB)	The language of the voice to use.
gender	Alphanumeric	6	No (Default: Female)	The gender of the voice, either 'Male' or 'Female'.
number	Numeric	3	No (Default: 1)	The number of the voice to use, if the given combination of language and gender provides multiple voices.
volume	Numeric	1	No (Default: 0)	The volume level of the voice. Must be between -4 and 4.
premium	Boolean		false	Option to have premium voices. Please be aware that additional costs will be charged for the use of premium voices.

SSML

With Speech Synthesis Markup Language (SSML) you can enhance your TTS prompts. For example, you can include pause within a prompt or change the speech rate or pitch.

Example

{
    "prompt": "<speak>Hi! I will wait for 2 seconds. <break time='2s'/> Now I will spell the word hello: <break time='1s'/> <say-as interpret-as='characters'>hello</say-as>. <prosody rate='-50%' pitch='-25%' volume='+2dB'>I am speaking a bit loudly, slowly and with a lower pitch.</prosody></speak>"
}

Attributes

Description	Parameters	Values
The root where the SSML start. Required in order to use SSML.	--	--
Adds a pause between words.	time time in milliseconds (ms) or seconds(s)	Ns or Nms Maximum total duration is 10 seconds per API call.
Control how special types of words are spoken.	interpret-as Determines how to say certain characters, words, numbers	cardinal, characters, ordinal, fraction, unit, date, time, expletive
Control the volume, speaking rate and the pitch	rate Determines how fast or slow the text should be spoken. pitch Determines the pitch of the voice. volume Change the volume for the text.	rate +/- n% pitch +/- N% volume +/- NdB
Emphasize words.	level Specify the degree of emphasis	strong, moderate, reduced

More advanced information about SSML and the attributes can be found in the W3C SSML Specification.