api:voice:stt
Table of Contents
API : Voice : Speech to text
Introduction
This request will generate text from an audio file.
Request
| URL | https://api.telecomx.dk/voice/stt | ||
|---|---|---|---|
| Method | POST - multipart form data | ||
| Access level | Any authenticated user. | ||
| Body | engine | String | [optional] Which speech engine to use: FREE or ELEVEN. Limits apply to usage of non-free engines. Defaults to FREE. |
| file | Binary | The audio file to convert to text. | |
| language | String | [optional] ISO-639-1 or ISO-639-3 language format, e.g. en, eng, da, or dan, for the spoken audio. Can help with language detection, FREE works best with it. |
|
| tag_audio_events | Boolean | [optional] True to tag audio events like laugther, footsteps etc. Defaults to true. (only applies to ELEVEN). |
|
| timestamps_granularity | String | [optional] Timestamp precision: word, character or none. Defaults to word. (only applies to ELEVEN). |
|
| diarize | Boolean | [optional] True to annotate which speaker is speaking. Default to false. (only applies to ELEVEN). |
|
Request body example
{
"engine": "ELEVEN",
"file": <BINARY BLOB>,
"language": "da",
"tag_audio_events": false,
"timestamps_granularity": "word",
"diarize": false
}
Response
| Property | Type | Description |
|---|---|---|
| language_code | String | Language detected, ISO 639-1 format. |
| language_probability | Number | Confidence in language detected, 0 - 1. |
| text | String | The complete transcribed text. |
| words | Array | List of timestamps. |
| words[].text | String | Text. |
| words[].start | Number | Starting time in fractional seconds. |
| words[].end | Number | End time in fractional seconds. |
| words[].type | String | Type of segment: word, spacing. |
| words[].speaker_id | String | Id of who is speaking, if diarize is enabled. |
| words[].characters | Array | List of characters, if granularity is character. |
| words[].characters[].text | String | The character spoken. |
| words[].characters[].start | Number | Start time in fractional seconds. |
| words[].characters[].end | Number | End time in fractional seconds. |
Note that properties holding no value may be omitted from the response.
Example
{ "language_code": "da", "language_probability": 0.9086595773696899, "text": "Hej. Goddag, du snakker med Morten Hansen fra TDC. Jeg er ham teknikeren, der skal ud til jer. Så prøv lige at ringe til mig. Det var lige om hvordan adgangsforholdene er. Ring til mig på 71 91 99 99. Det var 71 91 99 99. Hej.", "words": [ { "text": "Hej.", "start": 0.899, "end": 0.959, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 0.959, "end": 0.959, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "Goddag,", "start": 0.959, "end": 1.199, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.199, "end": 1.22, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "du", "start": 1.22, "end": 1.299, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.299, "end": 1.299, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "snakker", "start": 1.299, "end": 1.539, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.539, "end": 1.539, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "med", "start": 1.539, "end": 1.639, "type": "word", "speaker_id": "speaker_0" }, { ... } ] }
Errors
| Error code | Message | Description |
|---|---|---|
| 404 | file | Audio file missing or invalid format |
| 403 | access_denied | Insufficient access level |
| 403 | quota_exceeded | Quota limit has been reached |
| 422 | file | No speech detected in audio file |
| 500 | internal_error | <Unspecified> |
api/voice/stt.txt · Last modified: 2025/05/12 10:21 by Per Møller