Audio and Voices

Custom audio files

If custom audio files are to be used, they should preferably meet the following specification:

Bit rate: 128 kbps

Sample size: 16 bit

Channels: Mono

Audio sample rate: 8 kHz (8000 Hz)

Audio codec: PCM

Filename: *.wav

Other formats will be transcoded to the above.

In order to use custom audio files, they need to be available to the Voice API. This is facilitated using the Audio Manager, which allows you to upload your custom audio files onto our servers.

You are free to create any directory structure on this server, just make sure you supply the whole path from the root of your account when sending an instruction that requires an audio file.

If you want to use custom spelling audio, these files must be placed in the correct structure, as explained in the next chapter.

Custom audio files for the OTP app

In order to use custom audio files with the OTP app, you need to upload a set of audio files in the following directory structure using the Audio Manager:

/spelling/en-GB/*.wav

Where ‘en-GB’ is the language (including locale) of the set. Inside this folder, you need to upload a .wav file for every number or letter you want to be able to read aloud, like:

0.wav

1.wav

2.wav

a.wav

b.wav

c.wav

Note that these file names are all lower case.

Text-To-Speech

The Voice API supports Text-To-Speech (or TTS) in all instructions where you can provide a prompt to the caller/callee. When using TTS, you can provide the voice you want to use. Currently we support the following voices:

Language / LocaleCodeGenderNumber of voices available
Welsh Englishcy-GBFemale1
Danishda-DKFemale1
Danishda-DKMale1
Germande-DEFemale2
Germande-DEMale1
Australian Englishen-AUFemale1
Australian Englishen-AUMale1
UK Englishen-GBFemale2
UK Englishen-GBMale2
Indian Englishen-INFemale1
US Englishen-USFemale5
US Englishen-USMale2
Spanishes-ESFemale1
Spanishes-ESMale1
Spanish Englishes-USFemale1
Spanish Englishes-USMale1
Canadian Frenchfr-CAFemale1
Frenchfr-FRFemale1
Frenchfr-FRMale1
Hindihi-INFemale1
Icelandicis-ISFemale1
Icelandicis-ISMale1
Italianit-ITFemale1
Italianit-ITMale1
Japaneseja-JPFemale1
Japaneseja-JPMale1
Koreanko-KRFemale1
Norwegian Bokmalnb-NOFemale1
Dutchnl-NLFemale1
Dutchnl-NLMale1
Polishpl-PLFemale2
Polishpl-PLMale2
Brazilian Portuguesept-BRFemale1
Brazilian Portuguesept-BRMale1
Portuguesept-PTFemale1
Portuguesept-PTMale1
Romanianro-ROFemale1
Russianru-RUFemale1
Russianru-RUMale1
Swedishsv-SEFemale1
Turkishtr-TRFemale1
Chinesezh-CHSFemale1

When using TTS (or the Spelling Instruction), you can provide the voice to use in the JSON body. The voice part of the JSON body has the following variables:

C#

instruction.Voice.Language = "nl-NL";
instruction.Voice.Gender = Gender.Male;
instruction.Voice.Number = 1;
instruction.Voice.Volume = 2;

JSON

"voice": {
    "language": "nl-NL",
    "gender": "Male",
    "number": 1,
    "volume": 2
}

Variable definition

VariableData typeLengthRequiredDescription
languageAlphanumeric5No (Default: en-GB)The language of the voice to use.
genderAlphanumeric6No (Default: Female)The gender of the voice, either 'Male' or 'Female'.
numberNumeric3No (Default: 1)The number of the voice to use, if the given combination of language and gender provides multiple voices.
volumeNumeric1No (Default: 0)The volume level of the voice. Must be between -4 and 4.

SSML

With Speech Synthesis Markup Language (SSML) you can enhance your TTS prompts. For example, you can include pause within a prompt or change the speech rate or pitch.

Example

{
    "prompt": "<speak>Hi! I will wait for 2 seconds. <break time='2s'/> Now I will spell the word hello: <break time='1s'/> <say-as interpret-as='characters'>hello</say-as>. <prosody rate='-50%' pitch='-25%' volume='+2dB'>I am speaking a bit loudly, slowly and with a lower pitch.</prosody></speak>"
}

Attributes

TypeDescriptionParametersValues
<speak>The root where the SSML start. Required in order to use SSML.----
<break>Adds a pause between words.time
time in milliseconds (ms) or seconds(s)
Ns or Nms

Maximum total duration is 10 seconds per API call.
<say-as>Control how special types of words are spoken.interpret-as
Determines how to say certain characters, words, numbers
cardinal, characters, ordinal, fraction, unit, date, time, expletive
<prosody>Control the volume, speaking rate and the pitchrate
Determines how fast or slow the text should be spoken.

pitch
Determines the pitch of the voice.

volume
Change the volume for the text.
rate
+/- n%

pitch
+/- N%

volume
+/- NdB
<emphasis>Emphasize words.level
Specify the degree of emphasis
strong, moderate, reduced

More advanced information about SSML and the attributes can be found in the W3C SSML Specification.


Did this page help you?