POST
/
api
/
v1
/
tts
/
stream

Play3.0-mini is a lightweight model that generates high-quality audio with a focus on speed.

Check out the How to use PlayDialog Text-to-Speech API guide for a step-by-step approach to using the PlayDialog API to convert text into natural human-like sounding audio.

Get your Credentials

To use the HTTP API you will need an API Key and a User Id, you can easily get those here.

Explore our models: Play3.0-mini and PlayDialog

Our API currently supports two models: Play3.0-mini and PlayDialog. Use the model parameter to select the model you want to use. Play3.0-mini is a lightweight model that generates high-quality audio with a focus on speed. PlayDialog is a more advanced model that can generate turn-based dialogues with multiple voices.

For details on the specific properties of each, see the examples below.

Example

For code examples, see the interactive code snippets to the right. The provided examples will return an audio buffer stream that you can use to save locally or stream over the network to a browser, app, or telephony system.

For the complete list of supported parameters, see below.

Authorizations

AUTHORIZATION
string
header
required

API key required for this endpoint. Use Bearer YOUR_SECRET_API_KEY. Get your key from https://play.ai/developers.

X-USER-ID
string
header
required

User ID required for this endpoint. Get it from https://play.ai/developers.

Body

application/json
model
enum<string>
required

The voice engine used to synthesize the voice.

Available options:
Play3.0-mini
text
string
required

The text to be converted to speech. Limited to 20k characters.

voice
string
required

The unique ID for a PlayAI Voice.

language
enum<string> | null
default:
english

The language of the voice.

Available options:
afrikaans,
albanian,
amharic,
arabic,
bengali,
bulgarian,
catalan,
croatian,
czech,
danish,
dutch,
english,
french,
galician,
german,
greek,
hebrew,
hindi,
hungarian,
indonesian,
italian,
japanese,
korean,
malay,
mandarin,
polish,
portuguese,
russian,
serbian,
spanish,
swedish,
tagalog,
thai,
turkish,
ukrainian,
urdu,
xhosa
outputFormat
enum<string> | null
default:
mp3

The format for the output audio.

Available options:
mp3,
mulaw,
raw,
wav,
ogg,
flac
quality
enum<string>
Available options:
draft,
low,
medium,
high,
premium
sampleRate
number

A number greater than or equal to 8000, and must be less than or equal to 48000

Required range: 8000 < x < 48000
seed
number | null

An integer number greater than or equal to 0. If equal to null or not provided, a random seed will be used. Useful to control the reproducibility of the generated audio. Assuming all other properties didn't change, a fixed seed should always generate the exact same audio file.

Required range: x > 0
speed
number

Control how fast the generated audio should be. A number greater than 0 and less than or equal to 5.0

Required range: 0.1 < x < 5
styleGuidance
number | null

A number between 1 and 30. Use lower numbers to to reduce how strong your chosen emotion will be. Higher numbers will create a very emotional performance.

Required range: 1 < x < 10
temperature
number | null

A floating point number between 0, inclusive, and 2, inclusive. If equal to null or not provided, the model's default temperature will be used. The temperature parameter controls variance. Lower temperatures result in more predictable results, higher temperatures allow each run to vary more, so the voice may sound less like the baseline voice.

Required range: 0 < x < 2
textGuidance
number | null

A number between 1 and 2. This number influences how closely the generated speech adheres to the input text. Use lower values to create more fluid speech, but with a higher chance of deviating from the input text. Higher numbers will make the generated speech more accurate to the input text, ensuring that the words spoken align closely with the provided text.

Required range: 1 < x < 2
voiceGuidance
number | null

A number between 1 and 6. Use lower numbers to reduce how unique your chosen voice will be compared to other voices. Higher numbers will maximize its individuality.

Required range: 1 < x < 2

Response

200 - audio/mpeg

The response is of type file.