POST
/
api
/
v1
/
tts
/
stream

Check out the How to use PlayDialog Text-to-Speech API guide for a step-by-step approach to using the PlayDialog API to convert text into natural human-like sounding audio.

Make sure to see the Create a Multi-Turn Scripted Conversation with the PlayDialog API guide for examples on how to create a multi-turn scripted conversation between two distinct speakers.

Get your Credentials

To use the HTTP API you will need an API Key and a User Id, you can easily get those here.

Explore our models: Play3.0-mini and PlayDialog

Our API currently supports two models: Play3.0-mini and PlayDialog. Use the model parameter to select the model you want to use. Play3.0-mini is a lightweight model that generates high-quality audio with a focus on speed. PlayDialog is a more advanced model that can generate turn-based dialogues with multiple voices.

For details on the specific properties of each, see the examples below.

Example

For code examples, see the interactive code snippets to the right. The provided examples will return an audio buffer stream that you can use to save locally or stream over the network to a browser, app, or telephony system.

For the complete list of supported parameters, see below.

Authorizations

AUTHORIZATION
string
headerrequired

API key required for this endpoint. Use Bearer YOUR_SECRET_API_KEY. Get your key from https://play.ai/developers.

X-USER-ID
string
headerrequired

User ID required for this endpoint. Get it from https://play.ai/developers.

Body

application/json
model
enum<string>
required

The voice engine used to synthesize the voice.

Available options:
PlayDialog
text
string
required

The text to be converted to speech. Limited to 50k characters. See the Create a Multi-Turn Scripted Conversation with the PlayDialog API for instructions on how to best explore our multi-turn capabilities.

voice
string
required

The unique ID for a PlayAI Voice to be used. See voice2 for multi-turn dialogue generations.

voice2
string

The unique ID for a PlayAI Voice to be used as second character on multi-turn dialogue generations. See the Create a Multi-Turn Scripted Conversation with the PlayDialog API for instructions on how to best explore our multi-turn capabilities.

outputFormat
enum<string> | null
default: mp3

The format for the output audio.

Available options:
mp3,
mulaw,
raw,
wav,
ogg,
flac
speed
number

Control how fast the generated audio should be. A number greater than 0 and less than or equal to 5.0

sampleRate
number

A number greater than or equal to 8000, and must be less than or equal to 48000

seed
number | null

An integer number greater than or equal to 0. If equal to null or not provided, a random seed will be used. Useful to control the reproducibility of the generated audio. Assuming all other properties didn't change, a fixed seed should always generate the exact same audio file.

temperature
number | null

A floating point number between 0, inclusive, and 2, inclusive. If equal to null or not provided, the model's default temperature will be used. The temperature parameter controls variance. Lower temperatures result in more predictable results, higher temperatures allow each run to vary more, so the voice may sound less like the baseline voice.

turnPrefix
string | null

The prefix to indicate the start of a turn in a multi-turn dialogue with voice.

turnPrefix2
string | null

The prefix to indicate the start of a turn in a multi-turn dialogue with voiceId2.

prompt
string | null

The prompt to be used for the PlayDialog model with voice.

prompt2
string | null

The prompt to be used for the PlayDialog model with voiceId2.

voiceConditioningSeconds
number | null
default: 20

The number of seconds of conditioning to use from the selected voice. Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.

voiceConditioningSeconds2
number | null
default: 20

The number of seconds of conditioning to use from the selected voice2. Lower values generate audio less similar to the cloned voice, but lead to more model stability and expressiveness. Higher values create output more similar to the cloned voice, but can lead to model instability and reduced expressiveness.

Response

200 - audio/mpeg

The response is of type file.