Bella OpenAPI Real-time Voice Dialogue Interface Documentation

1. Overview
- 1.1 Features
- 1.2 Application Scenarios
2. Interface Specifications
3. Implementation Guide
4. Error Handling
5. Best Practices

1. Overview

The Real-time Voice Dialogue Interface provides an all-in-one solution for speech recognition, large language model conversation, and speech synthesis. It supports users inputting speech through a microphone in real-time, with the system automatically recognizing speech content, calling a large language model to generate responses, and converting the response content into speech output. The entire process is implemented through WebSocket protocol, featuring low latency and high real-time performance.

1.1 Features

Real-time Speech Recognition: Supports real-time speech input from users and converts speech to text
Streaming Large Language Model Dialogue: Sends recognized text to a large language model and receives streaming responses
Speech Synthesis: Converts large language model responses into speech, providing a natural voice dialogue experience
Full-duplex Communication: Based on WebSocket protocol, supports full-duplex communication for true real-time dialogue

1.2 Application Scenarios

Intelligent voice assistants
Customer service chatbots
Voice interaction systems
Accessibility applications
In-vehicle voice systems

2. Interface Specifications

2.1 Basic Information

Interface Path: /v1/audio/realtime or /v1/audio/asr/stream (returns only transcription events)
Protocol: WebSocket
Request Method: GET
Authentication Method: Bearer Token (passed through the Authorization request header)

2.2 Request Parameters

2.2.1 URL Parameters

Parameter	Type	Required	Description
model	string	No	Specifies the model to use; if not specified, the default model for the capability will be used

2.2.2 WebSocket Startup Messages

After establishing a connection, the client needs to send a startup message in JSON format: For /v1/audio/realtime, the startup message is as follows: For /v1/audio/asr/stream, the llm_option and tts_option parameters are not required

StartTranscription Message

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "StartTranscription",
    "appkey": "default"
  },
  "payload": {
    "format": "pcm",
    "sample_rate": 16000,
    "enable_intermediate_result": true,
    "enable_punctuation_prediction": true,
    "enable_inverse_text_normalization": true,
    "llm_option": {
      "main": {
        "model": "qwen2.5-coder:3b",
        "sys_prompt": "You are a versatile voice assistant. Your responses will be converted to audio for the user, so please respond as concisely as possible, and end your first sentence quickly to better facilitate streaming speech synthesis."
      }
    },
    "tts_option": {
      "model": "chat-tts",
      "sample_rate": 24000
    }
  }
}

Parameter Description:

header: Message header information
- message_id: Unique message identifier, UUID recommended
- task_id: Unique task identifier, UUID recommended
- namespace: Fixed as "SpeechTranscriber"
- name: Fixed as "StartTranscription"
- appkey: Fixed as "default"
payload: Task configuration parameters
- format: Audio format, fixed as "pcm"
- sample_rate: Audio sampling rate, 16000 recommended
- enable_intermediate_result: Whether to enable intermediate results, true recommended
- enable_punctuation_prediction: Whether to enable punctuation prediction, true recommended
- enable_inverse_text_normalization: Whether to enable inverse text normalization, true recommended
- llm_option: Large language model options (optional)
  - main: Main model configuration
    - model: Model name, e.g., "qwen2.5-coder:3b"
    - sys_prompt: System prompt, used to set the model's behavior and role, can use Python Jinja2 templates
    - prompt: User prompt, can provide Python Jinja2 templates to rewrite user messages sent to the large model through context
    - temperature: Temperature parameter, controls the randomness of output, higher values increase randomness, range [0.0, 2.0], default is 1.0
  - workers: Worker model configuration list (optional), each worker model includes the following properties
    - model: Model name
    - blocking: Whether to block, default is false
    - variable_name: Result variable name
    - variable_type: Variable type
    - sys_prompt: System prompt, can use Python Jinja2 templates
    - prompt: User prompt, can use Python Jinja2 templates
    - temperature: Temperature parameter, default is 1.0
    - json_schema: JSON schema definition
- tts_option: Speech synthesis options (optional)
  - model: Speech synthesis model, e.g., "chat-tts"
  - sample_rate: Synthesis audio sampling rate, 24000 recommended
  - voice: Synthesis voice (optional), available voices depend on the selected model
- variables: Variable configuration (optional) map, used with prompt templates

Configuration Example

Below is a complete configuration example including main and worker models, showing how to use Jinja2 templates and variables for message rewriting:

{
  "llm_option": {
    "main": {
      "model": "qwen2.5-coder:3b",
      "sys_prompt": "You are a versatile voice assistant. Your responses will be converted to audio for the user, so please respond as concisely as possible, and end your first sentence quickly to better facilitate streaming speech synthesis."
    },
    "workers": [
      {
        "model": "qwen2.5-coder:3b",
        "blocking": true,
        "variable_name": "rewrite_user_message",
        "sys_prompt": "You are a user message rewrite expert, responsible for rewriting user messages transcribed from voice to make them more fluent and coherent, while also completing necessary reference disambiguation. Don't explain, just rewrite. If no rewriting is needed, output the user's original content directly.",
        "prompt": "## User Historical Dialogue\n<history_messages>\n{% for message in history_messages %}\n{% if message.role == \"user\" %}\n  <user timestamp=\"{{loop.index}}\">{{{ message.content }}}</user>\n{% elif message.role == \"assistant\" %}\n  <agent timestamp=\"{{loop.index}}\">{{{ message.content }}}</agent>\n{% endif %}\n{% endfor %}\n\n</history_messages>\n\n## User Latest Message\n<user_message>\n  {{user_message}}\n</user_message>\n\nPlease rewrite the message now!\n"
      }
    ]
  },
  "tts_option": {
    "model": "chat-tts",
    "voice": "zh_female",
    "sample_rate": 24000
  }
}

This configuration example shows:

Main model configuration: Using qwen2.5-coder:3b model for dialogue with users
Worker model configuration: Using the same model as a worker, responsible for rewriting user messages
Worker model using Jinja2 templates to process historical messages and current user messages
TTS configuration: Specifying model, voice, and sampling rate

StopTranscription Message

When you need to end a transcription task, the client needs to send a StopTranscription message: After the server processes unfinished messages, it will return a SessionClose message.

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "StopTranscription",
    "appkey": "default"
  },
  "payload": {}
}

2.2.3 Audio Data

After sending the startup message, the client needs to continuously send audio data in binary format. Audio data requirements:

Format: PCM (Pulse Code Modulation)
Sampling rate: 16000Hz
Bit depth: 16-bit
Channels: Single channel (mono)
Frame size: 3200 bytes recommended (corresponding to 100ms of audio)

2.3 Response Messages

The server will send various types of text messages and binary data through the WebSocket connection. Text messages are in JSON format and mainly include the following types:

TranscriptionStarted: Transcription task started
SentenceBegin: Speech beginning detected
TranscriptionResultChanged: Transcription result updated
SentenceEnd: Speech sentence ended, includes final transcription result
TranscriptionCompleted: Transcription task completed
TranscriptionFailed: Transcription task failed

Example:

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "TranscriptionResultChanged",
    "status": 20000000
  },
  "payload": {
    "result": "Hello, I'd like to know what today's weather is like"
  }
}

LLM_CHAT_BEGIN: Large model starts generating response
LLM_CHAT_DELTA: Large model response content update
LLM_CHAT_END: Large model response ended
LLM_CHAT_CANCELLED: Large model response cancelled

Example:

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "LLM_CHAT_DELTA",
    "status": 20000000
  },
  "payload": {
    "data": "Hello! Today in Beijing, the weather is clear, with temperatures ranging from 22°C to 28°C, good air quality, and it's a great day for outdoor activities."
  }
}

TTS_BEGIN: Speech synthesis started
TTS_TTFT: First audio data package ready (Time To First Token)
TTS_DELTA: Speech synthesis incremental data
TTS_END: Speech synthesis ended

Speech synthesis audio data is sent through binary messages, in PCM format, with a sampling rate of 24000Hz, 16-bit, single channel.

VOICE_QUIET: Silence detected
VOICE_STARTING: Start of speech detected
VOICE_SPEAKING: Speaking detected
VOICE_STOPPING: Stop speaking detected
VOICE_PAUSING: Speech pause detected

Example:

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "VOICE_SPEAKING",
    "status": 20000000,
    "status_message": "GATEWAY|SUCCESS|Success."
  },
  "payload": {
    "start_frame": 11,
    "end_frame": 29,
    "duration": 576,
    "data": null,
    "timestamp": 1744612674.163157,
    "latency": -1
  }
}

2.3.5 Session Control and Error Messages

SESSION_CLOSE: Session closed, after the client sends a StopTranscription message, the server will return a SessionClose message after processing unfinished messages.
TASK_FAILED: Task failed
LLM_CHAT_ERROR: LLM chat error
LLM_WORKER_ERROR: LLM subtask error

Example:

{
  "header": {
    "message_id": "uuid-string",
    "task_id": "uuid-string",
    "namespace": "SpeechTranscriber",
    "name": "SESSION_CLOSE",
    "status": 20000000
  },
  "payload": {}
}

3. Implementation Guide

3.1 Java Implementation Example

For Java implementation of the real-time voice interface, refer to the RealtimeDemo.java class in the project.

This example demonstrates how to use the OkHttp library to implement WebSocket connections and complete the following functions:

Establish a WebSocket connection and send a StartTranscription message
Send audio data (PCM format)
Process various event types (transcription, VAD, LLM, TTS, etc.)
Receive and process TTS audio data
Send a StopTranscription message to end the session

The example includes complete error handling and resource release logic, which can serve as an implementation reference.

3.2 JavaScript Implementation Example

For JavaScript implementation of the real-time voice interface, refer to the RealtimeAudioRecorder.ts class in the project.

3.3 Available Speech Synthesis Voices

Different models have different available voices. For details, please visit the bella-openapi homepage.

4. Error Handling

4.1 Common Error Codes

Error Code	Description	Solution
401	Unauthorized	Check if the API Key is correct
403	Forbidden	Check API Key permissions
429	Too Many Requests	Reduce request frequency
500	Internal Server Error	Contact technical support

4.2 WebSocket Connection Errors

Error Type	Possible Cause	Solution
Connection Failed	Network issues or service unavailable	Check network connection, try again later
Connection Closed	Server closed connection or network interruption	Implement automatic reconnection mechanism
Message Parsing Error	Invalid JSON format	Check message format

5. Best Practices

5.1 Audio Quality Optimization

Use high-quality microphones, reduce environmental noise
Maintain appropriate volume and distance
Use noise reduction and echo cancellation technology
Ensure audio sampling rate and format meet requirements

5.2 Performance Optimization

Implement connection recovery mechanism
Optimize audio data processing and sending logic
Use WebWorker to process audio data, avoid blocking the main thread
Implement buffering mechanism to smooth network fluctuations

5.3 User Experience Optimization

Provide clear recording status indication
Display real-time transcription results, provide visual feedback
Implement voice activity detection, automatically start and end recording
Provide volume visualization to help users adjust volume

6. Example Code

For complete frontend implementation examples, refer to the following files in the Bella OpenAPI project:

web/src/app/playground/v1/audio/realtime/page.tsx: Real-time voice dialogue page
web/src/components/playground/RealtimeAudioRecorder.ts: Real-time voice recorder
web/src/components/playground/PCMPlayer.ts: PCM audio player

7. Frequently Asked Questions

Q1: Why is my speech recognition result inaccurate?

A1: This may be due to microphone quality, environmental noise, or network issues. It is recommended to use a high-quality microphone, use it in a quiet environment, and ensure stable network connection.

Q2: How to handle compatibility issues with different browsers?

A2: Use the WebRTC adapter library to handle compatibility issues across different browsers, and provide friendly prompts on unsupported browsers.

Q3: How to optimize performance on mobile devices?

A3: Reduce audio buffer size, optimize audio processing logic, use lower sampling rates, and implement battery optimization strategies.

Q4: How to handle unstable network conditions?

A4: Implement connection recovery mechanism, use audio buffering strategies, and resynchronize dialogue state after network recovery.

Q5: Which languages does the system support?

A5: Supported languages depend on the selected model. Please refer to the model documentation for the specific list of supported languages.

Table of Contents​

1. Overview​

1.1 Features​

1.2 Application Scenarios​

2. Interface Specifications​

2.1 Basic Information​

2.2 Request Parameters​

2.2.1 URL Parameters​

2.2.2 WebSocket Startup Messages​

StartTranscription Message​

Configuration Example​

StopTranscription Message​

2.2.3 Audio Data​

2.3 Response Messages​

2.3.1 Transcription-related Messages​

2.3.2 LLM-related Messages​

2.3.3 Speech Synthesis-related Messages​

2.3.4 VAD-related Messages (Voice Activity Detection)​

2.3.5 Session Control and Error Messages​

3. Implementation Guide​

3.1 Java Implementation Example​

3.2 JavaScript Implementation Example​

3.3 Available Speech Synthesis Voices​

4. Error Handling​

4.1 Common Error Codes​

4.2 WebSocket Connection Errors​

5. Best Practices​

5.1 Audio Quality Optimization​

5.2 Performance Optimization​

5.3 User Experience Optimization​

6. Example Code​

7. Frequently Asked Questions​

Q1: Why is my speech recognition result inaccurate?​

Q2: How to handle compatibility issues with different browsers?​

Q3: How to optimize performance on mobile devices?​

Q4: How to handle unstable network conditions?​

Q5: Which languages does the system support?​

Table of Contents

1. Overview

1.1 Features

1.2 Application Scenarios

2. Interface Specifications

2.1 Basic Information

2.2 Request Parameters

2.2.1 URL Parameters

2.2.2 WebSocket Startup Messages

StartTranscription Message

Configuration Example

StopTranscription Message

2.2.3 Audio Data

2.3 Response Messages

2.3.1 Transcription-related Messages

2.3.2 LLM-related Messages

2.3.3 Speech Synthesis-related Messages

2.3.4 VAD-related Messages (Voice Activity Detection)

2.3.5 Session Control and Error Messages

3. Implementation Guide

3.1 Java Implementation Example

3.2 JavaScript Implementation Example

3.3 Available Speech Synthesis Voices

4. Error Handling

4.1 Common Error Codes

4.2 WebSocket Connection Errors

5. Best Practices

5.1 Audio Quality Optimization

5.2 Performance Optimization

5.3 User Experience Optimization

6. Example Code

7. Frequently Asked Questions

Q1: Why is my speech recognition result inaccurate?

Q2: How to handle compatibility issues with different browsers?

Q3: How to optimize performance on mobile devices?

Q4: How to handle unstable network conditions?

Q5: Which languages does the system support?