Skip to main content

Single-sentence Voice Recognition Interface Documentation

Table of Contents

Interface Description

The Single-sentence Voice Recognition (Flash ASR) interface is used for quick recognition of short audio content, converting speech to text. This interface is suitable for short speech recognition scenarios, such as voice commands, short voice messages, etc.

Request

HTTP Request

POST /v1/audio/asr/flash

Request Headers

ParameterTypeRequiredDefault ValueDescription
formatstringNowavAudio format, supports wav, mp3, pcm and other common formats
sample_rateintegerNo16000Audio sampling rate, in Hz
max_sentence_silenceintegerNo3000Maximum silence duration between sentences, in milliseconds
modelstringNo-ASR model to use, uses the default model if not specified

Request Body

The request body is a binary audio data stream, with the audio file content sent directly as the request body.

Supported audio formats:

  • WAV
  • MP3
  • PCM
  • Other common audio formats (specific support depends on the selected model)

Response

{
"task_id": "asr-task-123456",
"user": "user-123",
"flash_result": {
"duration": 5600,
"sentences": [
{
"text": "今天天气真不错",
"begin_time": 0,
"end_time": 2500
},
{
"text": "我很开心",
"begin_time": 3000,
"end_time": 5600
}
]
}
}

Response Parameters

ParameterTypeDescription
task_idstringTask ID, can be used to track the recognition task
userstringUser identifier
flash_resultobjectRecognition result object

FlashResult Object

ParameterTypeDescription
durationintegerTotal audio duration, in milliseconds
sentencesarrayArray of recognized sentences

Sentence Object

ParameterTypeDescription
textstringRecognized text content
begin_timeintegerSentence start time, in milliseconds
end_timeintegerSentence end time, in milliseconds

Error Codes

Error CodeDescription
400Request parameter error, such as unsupported audio format or incorrect parameter format
401Authentication failed, invalid API key
403Insufficient permissions, API key doesn't have permission to access the requested resource
404Requested resource does not exist, such as the specified model does not exist
413Request entity too large, audio file exceeds size limit
415Unsupported media type, audio format not supported
429Too many requests, exceeded rate limit
500Internal server error
503Service temporarily unavailable

Examples

Request Example

Using curl to send a request:

curl -X POST "https://api.example.com/v1/audio/asr/flash" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "format: wav" \
-H "sample_rate: 16000" \
-H "max_sentence_silence: 2000" \
-H "model: asr-model-1" \
--data-binary @audio_file.wav

Using Python to send a request:

import requests

url = "https://api.example.com/v1/audio/asr/flash"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"format": "wav",
"sample_rate": "16000",
"max_sentence_silence": "2000",
"model": "asr-model-1"
}

with open("audio_file.wav", "rb") as audio_file:
audio_data = audio_file.read()

response = requests.post(url, headers=headers, data=audio_data)
print(response.json())

Response Example

{
"task_id": "asr-task-123456",
"user": "user-123",
"flash_result": {
"duration": 5600,
"sentences": [
{
"text": "今天天气真不错",
"begin_time": 0,
"end_time": 2500
},
{
"text": "我很开心",
"begin_time": 3000,
"end_time": 5600
}
]
}
}

Best Practices

  1. Audio Quality:

    • Ensure audio is clear with minimal background noise
    • Use appropriate sampling rates (typically 16kHz or higher)
    • Avoid audio distortion or excessive compression
  2. Audio Length:

    • This interface is suitable for short audio (typically not exceeding 1 minute)
    • For longer audio, it's recommended to use the file transcription interface
  3. Silence Handling:

    • Adjust the max_sentence_silence parameter according to actual needs
    • Smaller values can more sensitively detect sentence boundaries
    • Larger values are suitable for speech with slower pace or natural pauses
  4. Format Selection:

    • WAV format typically provides the best recognition results
    • For network transmission, compressed formats like MP3 can be considered, but may slightly affect recognition accuracy