Speech Synthesis Interface Documentation
Table of Contents
Interface Description
Generates speech from input text. Supports multiple voices and formats, and offers streaming responses suitable for real-time speech synthesis scenarios.
Request
HTTP Request
POST /v1/audio/speech
Request Body
Parameter | Type | Required | Description |
---|---|---|---|
input | string | Yes | The text to generate speech from. Maximum length is 4096 characters |
model | string | Yes | The TTS model to use, e.g.: tts-1, tts-1-hd, or gpt-4o-mini-tts |
voice | string | Yes | The voice to use for speech generation. Supported voices depend on the model |
response_format | string | No | Audio format. Supported formats include mp3, opus, aac, flac, wav, and pcm. Default value depends on the model type |
speed | number | No | The speed of the generated audio. Optional values range from 0.25 to 4.0. Default is 1.0 |
stream | boolean | No | Whether to enable streaming response. Default is true |
sample_rate | integer | No | Audio sampling rate. Default is automatically selected based on model and format |
user | string | No | A unique identifier representing the end user |
Voice Options
Here are the available voice options and their characteristics:
- alloy: Neutral, balanced voice
- ash: Young, clear voice
- ballad: Soft, calm voice
- coral: Warm, friendly voice
- echo: Deep, powerful voice
- fable: Authoritative, confident voice
- onyx: Deep, solemn voice
- nova: Lively, enthusiastic voice
- sage: Calm, steady voice
- shimmer: Bright, cheerful voice
- verse: Lyrical, expressive voice
Response Format Options
Here are the supported audio formats and their characteristics:
- mp3: High compression ratio, suitable for network transmission, default option
- opus: Low latency, suitable for real-time applications
- aac: High quality, suitable for music
- flac: Lossless compression, suitable for high-quality requirements
- wav: Uncompressed, suitable for high-quality requirements
- pcm: Raw audio data
Response
The interface returns audio file content.
- When
stream=false
, the complete audio file is returned - When
stream=true
, audio data is returned as a stream, allowing clients to play while receiving
The Content-Type of the response is set according to the response_format
parameter in the request:
response_format | Content-Type |
---|---|
mp3 | audio/mpeg |
opus | audio/opus |
aac | audio/aac |
flac | audio/flac |
wav | audio/wav |
pcm | audio/pcm |
Error Codes
Error Code | Description |
---|---|
400 | Request parameter error, such as text too long or incorrect parameter format |
401 | Authentication failed, invalid API key |
403 | Insufficient permissions, API key doesn't have permission to access the requested resource |
404 | Requested resource doesn't exist, such as the specified model doesn't exist |
429 | Too many requests, exceeded rate limit |
500 | Internal server error |
503 | Service temporarily unavailable |
Examples
Request Examples
Basic Request
{
"input": "The weather today is really nice, sunny, making people feel happy.",
"model": "tts-1",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}
Streaming Response Request
{
"input": "The weather today is really nice, sunny, making people feel happy.",
"model": "tts-1",
"voice": "nova",
"response_format": "mp3",
"speed": 1.0,
"stream": true
}
High-Quality Speech Request
{
"input": "The weather today is really nice, sunny, making people feel happy.",
"model": "tts-1-hd",
"voice": "shimmer",
"response_format": "wav",
"speed": 0.9,
"sample_rate": 24000
}
Streaming Response
When stream=true
, the server returns audio data as a stream. Clients can play while receiving, suitable for real-time speech synthesis scenarios such as online customer service, navigation systems, etc.
Advantages of streaming responses:
- Low latency: Users don't need to wait for the entire audio generation to complete; they can immediately hear the beginning
- Real-time experience: Suitable for interactive scenarios requiring immediate feedback
- Resource efficiency: Clients can process audio while receiving, reducing memory usage
Implementation Examples
JavaScript Client Example
async function streamSpeech() {
const response = await fetch('https://api.example.com/v1/audio/speech', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer YOUR_API_KEY'
},
body: JSON.stringify({
input: "The weather today is really nice, sunny, making people feel happy.",
model: "tts-1",
voice: "nova",
response_format: "mp3",
stream: true
})
});
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
// Create audio context
const audioContext = new (window.AudioContext || window.webkitAudioContext)();
const reader = response.body.getReader();
// Process audio stream
const processStream = async () => {
const { done, value } = await reader.read();
if (done) return;
// Decode and play audio data
audioContext.decodeAudioData(value.buffer, (buffer) => {
const source = audioContext.createBufferSource();
source.buffer = buffer;
source.connect(audioContext.destination);
source.start(0);
});
// Continue processing the stream
processStream();
};
processStream();
}
Python Client Example
import requests
import pyaudio
import io
from pydub import AudioSegment
def stream_speech():
url = "https://api.example.com/v1/audio/speech"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_API_KEY"
}
data = {
"input": "The weather today is really nice, sunny, making people feel happy.",
"model": "tts-1",
"voice": "nova",
"response_format": "mp3",
"stream": True
}
# Send request and get streaming response
response = requests.post(url, json=data, headers=headers, stream=True)
if response.status_code != 200:
raise Exception(f"Error: {response.status_code}")
# Initialize PyAudio
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True)
# Process audio stream
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=4096):
if chunk:
buffer.write(chunk)
# Play when enough data accumulates
if buffer.tell() > 8192:
buffer.seek(0)
audio = AudioSegment.from_mp3(buffer)
stream.write(audio.raw_data)
buffer = io.BytesIO()
# Play remaining data
if buffer.tell() > 0:
buffer.seek(0)
audio = AudioSegment.from_mp3(buffer)
stream.write(audio.raw_data)
# Close stream
stream.stop_stream()
stream.close()
p.terminate()
Best Practices
- Choose appropriate voices: Select suitable voices based on application scenarios; for example, customer service systems can choose warm, friendly voices (coral), while navigation systems can choose clear, authoritative voices (fable)
- Control text length: Longer texts may increase generation time; it is recommended to process long texts in segments
- Adjust speech speed: Adjust speed according to the application scenario; for example, notifications can be slightly faster, explanations can be slightly slower
- Choose appropriate audio formats: For network transmission, mp3 and opus formats are more compact; for high-quality requirements, wav or flac can be chosen
- Utilize streaming responses: In scenarios requiring real-time feedback, enabling streaming responses can significantly enhance user experience