Audio and Voices

Custom audio files

If custom audio files are to be used, they should preferably meet the following specification:

Bit rate128 kbps
Sample size16 bit
Channels1 (Mono)
Audio sample rate8 kHz (8000 Hz)
Audio codecPCM

Other formats will be transcoded to the above.

In order to use custom audio files, they need to be available to the Voice API. This is can be facilitated by going to Audio Files within the Voice Management App, which allows you to upload your custom audio files onto our servers.

You are free to create any directory structure on this server, just make sure you supply the whole path from the root of your account when sending an instruction that requires an audio file.

If you want to use custom spelling audio, these files must be placed in the correct structure, as explained in the next chapter.

Custom audio files for the OTP app

In order to use custom audio files with the OTP app, you need to upload a set of audio files in the following directory structure using the Voice Management App:


Where ‘en-GB’ is the language (including locale) of the set. Inside this folder, you need to upload a .wav file for every number or letter you want to be able to read aloud, like:



For these custom audio files to work properly, please make sure that their filenames are in lowercase.


The Voice API supports Text-To-Speech (or TTS) in all instructions where you can provide a prompt to the caller/callee. When using TTS, you can provide the voice you want to use. Currently we support the following voices:

Language / LocaleCodeCodeNumber of voices available
English (United States)en-USMale19
English (United States)en-USFemale23

See more supported languages

When using TTS (or the Spelling Instruction), you can provide the voice to use in the JSON body. The voice part of the JSON body has the following variables:


instruction.Voice.Language = "nl-NL";
instruction.Voice.Gender = Gender.Male;
instruction.Voice.Number = 1;
instruction.Voice.Volume = 2;


"voice": {
    "language": "nl-NL",
    "gender": "Male",
    "number": 1,
    "volume": 2,
    "premium": false

Variable definition

VariableData typeLengthRequiredDescription
languageAlphanumeric5No (Default: en-GB)The language of the voice to use.
genderAlphanumeric6No (Default: Female)The gender of the voice, either 'Male' or 'Female'.
numberNumeric3No (Default: 1)The number of the voice to use, if the given combination of language and gender provides multiple voices.
volumeNumeric1No (Default: 0)The volume level of the voice. Must be between -4 and 4.
premiumBooleanfalseOption to have premium voices. Please be aware that additional costs will be charged for the use of premium voices.


With Speech Synthesis Markup Language (SSML) you can enhance your TTS prompts. For example, you can include pause within a prompt or change the speech rate or pitch.


    "prompt": "<speak>Hi! I will wait for 2 seconds. <break time='2s'/> Now I will spell the word hello: <break time='1s'/> <say-as interpret-as='characters'>hello</say-as>. <prosody rate='-50%' pitch='-25%' volume='+2dB'>I am speaking a bit loudly, slowly and with a lower pitch.</prosody></speak>"


The root where the SSML start. Required in order to use SSML.----
Adds a pause between words.time
time in milliseconds (ms) or seconds(s)
Ns or Nms

Maximum total duration is 10 seconds per API call.
Control how special types of words are spoken.interpret-as
Determines how to say certain characters, words, numbers
cardinal, characters, ordinal, fraction, unit, date, time, expletive
Control the volume, speaking rate and the pitchrate
Determines how fast or slow the text should be spoken.

Determines the pitch of the voice.

Change the volume for the text.
+/- n%

+/- N%

+/- NdB
Emphasize words.level
Specify the degree of emphasis
strong, moderate, reduced

More advanced information about SSML and the attributes can be found in the W3C SSML Specification.