Audio and Voices
Custom audio files
If custom audio files are to be used, they should preferably meet the following specification:
Bit rate | 128 kbps |
Sample size | 16 bit |
Channels | 1 (Mono) |
Audio sample rate | 8 kHz (8000 Hz) |
Audio codec | PCM |
Filename | *.wav |
Other formats will be transcoded to the above.
In order to use custom audio files, they need to be available to the Voice API. This is facilitated using the Audio Manager, which allows you to upload your custom audio files onto our servers.
You are free to create any directory structure on this server, just make sure you supply the whole path from the root of your account when sending an instruction that requires an audio file.
If you want to use custom spelling audio, these files must be placed in the correct structure, as explained in the next chapter.
Custom audio files for the OTP app
In order to use custom audio files with the OTP app, you need to upload a set of audio files in the following directory structure using the Audio Manager:
/spelling/en-GB/*.wav
Where βen-GBβ is the language (including locale) of the set. Inside this folder, you need to upload a .wav file for every number or letter you want to be able to read aloud, like:
0
.wav
1
.wav
2
.wav
a
.wav
b
.wav
c
.wav
For these custom audio files to work properly, please make sure that their filenames are in lowercase.
Text-To-Speech
The Voice API supports Text-To-Speech (or TTS) in all instructions where you can provide a prompt to the caller/callee. When using TTS, you can provide the voice you want to use. Currently we support the following voices:
Language / Locale | Code | Code | Number of voices available |
---|---|---|---|
German | de-DE | Male | 12 |
German | de-DE | Female | 12 |
English (United States) | en-US | Male | 19 |
English (United States) | en-US | Female | 23 |
Dutch | nl-NL | Male | 4 |
Dutch | nl-NL | Female | 6 |
See more supported languages https://developers.cm.com/voice/docs/supported-languages
When using TTS (or the Spelling Instruction), you can provide the voice to use in the JSON body. The voice part of the JSON body has the following variables:
C#
instruction.Voice.Language = "nl-NL";
instruction.Voice.Gender = Gender.Male;
instruction.Voice.Number = 1;
instruction.Voice.Volume = 2;
JSON
"voice": {
"language": "nl-NL",
"gender": "Male",
"number": 1,
"volume": 2,
"premium": false
}
Variable definition
Variable | Data type | Length | Required | Description |
---|---|---|---|---|
language | Alphanumeric | 5 | No (Default: en-GB) | The language of the voice to use. |
gender | Alphanumeric | 6 | No (Default: Female) | The gender of the voice, either 'Male' or 'Female'. |
number | Numeric | 3 | No (Default: 1) | The number of the voice to use, if the given combination of language and gender provides multiple voices. |
volume | Numeric | 1 | No (Default: 0) | The volume level of the voice. Must be between -4 and 4. |
premium | Boolean | false | Option to have premium voices. Please be aware that additional costs will be charged for the use of premium voices. |
SSML
With Speech Synthesis Markup Language (SSML) you can enhance your TTS prompts. For example, you can include pause within a prompt or change the speech rate or pitch.
Example
{
"prompt": "<speak>Hi! I will wait for 2 seconds. <break time='2s'/> Now I will spell the word hello: <break time='1s'/> <say-as interpret-as='characters'>hello</say-as>. <prosody rate='-50%' pitch='-25%' volume='+2dB'>I am speaking a bit loudly, slowly and with a lower pitch.</prosody></speak>"
}
Attributes
Type | Description | Parameters | Values |
---|---|---|---|
The root where the SSML start. Required in order to use SSML. | -- | -- | |
Adds a pause between words. | time time in milliseconds (ms) or seconds(s) | Ns or Nms Maximum total duration is 10 seconds per API call. | |
Control how special types of words are spoken. | interpret-as Determines how to say certain characters, words, numbers | cardinal, characters, ordinal, fraction, unit, date, time, expletive | |
Control the volume, speaking rate and the pitch | rate Determines how fast or slow the text should be spoken. pitch Determines the pitch of the voice. volume Change the volume for the text. | rate +/- n% pitch +/- N% volume +/- NdB | |
Emphasize words. | level Specify the degree of emphasis | strong, moderate, reduced |
More advanced information about SSML and the attributes can be found in the W3C SSML Specification.
Updated 4 months ago