Stream Speech
Streams the audio bytes with our ultra-fast text-in, audio-out API.
Convert text to speech and receive audio bytes in real-time.
This endpoint supports two models:
- Play 3.0 Mini: Our fast and efficient model for single-voice text-to-speech.
- Dialog 1.0: Our flagship model with best quality and multi-turn dialogue capabilities.
We also offer Dialog 1.0 Turbo which is a faster version of Dialog 1.0 from a separate endpoint.
For more information, see Models.
Check out the How to use Dialog 1.0 Text-to-Speech API guide for a step-by-step approach to using the Dialog 1.0 API to convert text into natural human-like sounding audio.
Make sure to see the Create a Multi-Turn Scripted Conversation with the Dialog 1.0 API guide for examples on how to create a multi-turn scripted conversation between two distinct speakers.
Authorizations
Your secret API key from PlayAI, formatted as Bearer YOUR_SECRET_API_KEY
.
Body
The model used to synthesize the voice.
Play3.0-mini
, PlayDialog
"Play3.0-mini"
The text to be converted to speech. Limited to 20k characters for Play3.0-mini
, 50k characters for PlayDialog
.
"Country Mouse: Welcome to my humble home, cousin! Town Mouse: Thank you, cousin. It's quite... peaceful here. Country Mouse: It is indeed. I hope you're hungry. I've prepared a simple meal of beans, barley, and fresh roots. Town Mouse: Well, it's... earthy. Do you eat this every day?"
The unique ID for a voice to be used. See voice2
for multi-turn dialogue generations.
"s3://voice-cloning-zero-shot/baf1ef41-36b6-428c-9bdf-50ba54682bd8/original/manifest.json"
The unique ID for a voice to be used as second character on multi-turn dialogue generations. Only supported with PlayDialog
model.
"s3://voice-cloning-zero-shot/baf1ef41-36b6-428c-9bdf-50ba54682bd8/original/manifest.json"
The quality of the generated audio. Only supported with Play3.0-mini
model.
draft
, low
, medium
, high
, premium
The format for the output audio.
mp3
, mulaw
, raw
, wav
, ogg
, flac
Control how fast the generated audio should be, where 1.0
is the natural pace.
0 < x <= 5
0.8
The sample rate of the output audio in Hz.
8000 <= x <= 48000
24000
The seed used to generate the audio.
If equal to null
or not provided, a random seed will be used. Useful to control the reproducibility of the generated audio. Assuming all other properties didn't change, a fixed seed should always generate the exact same audio file.
x >= 0
256
The temperature used to generate the audio.
If equal to null
or not provided, the model's default temperature will be used. The temperature parameter controls variance. Lower temperatures result in more predictable results, higher temperatures allow each run to vary more, so the voice may sound less like the baseline voice.
0 <= x <= 2
1.5
The voice guidance used to generate the audio. Only supported with Play3.0-mini
model.
Use lower numbers to reduce how unique your chosen voice will be compared to other voices. Higher numbers will maximize its individuality.
1 <= x <= 6
null
The style guidance used to generate the audio. Only supported with Play3.0-mini
model.
Use lower numbers to reduce how strong your chosen emotion will be. Higher numbers will create a very emotional performance.
1 <= x <= 30
null
The text guidance used to generate the audio. Only supported with Play3.0-mini
model.
This number influences how closely the generated speech adheres to the input text. Use lower values to create more fluid speech, but with a higher chance of deviating from the input text. Higher numbers will make the generated speech more accurate to the input text, ensuring that the words spoken align closely with the provided text.
1 <= x <= 2
1.25
The prefix to indicate the start of a turn in a multi-turn dialogue with voice
. Only supported with PlayDialog
model.
"Country Mouse:"
The prefix to indicate the start of a turn in a multi-turn dialogue with voice2
. Only supported with PlayDialog
model.
"Town Mouse:"
The prompt to be used for voice
. Only supported with PlayDialog
model.
The prompt to be used for voice2
. Only supported with PlayDialog
model.
The number of seconds of conditioning to use from the selected voice
. Only supported with PlayDialog
model.
Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.
20
The number of seconds of conditioning to use from the selected voice2
. Only supported with PlayDialog
model.
Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.
20
The language of the voices. Defaults to english
.
afrikaans
, albanian
, amharic
, arabic
, bengali
, bulgarian
, catalan
, croatian
, czech
, danish
, dutch
, english
, french
, galician
, german
, greek
, hebrew
, hindi
, hungarian
, indonesian
, italian
, japanese
, korean
, malay
, mandarin
, polish
, portuguese
, russian
, serbian
, spanish
, swedish
, tagalog
, thai
, turkish
, ukrainian
, urdu
, xhosa
"english"
Response
The response is of type file
.