Skip to main content
POST
/
v2
/
gateway
/
audio
/
transcriptions
Create transcription
curl --request POST \
  --url https://api.orq.ai/v2/gateway/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form 'model=<string>' \
  --form 'prompt=<string>' \
  --form enable_logging=true \
  --form diarize=false \
  --form response_format=json \
  --form tag_audio_events=true \
  --form num_speakers=123 \
  --form timestamps_granularity=word \
  --form temperature=0.5 \
  --form 'language=<string>' \
  --form 'timestamp_granularities[0]=word' \
  --form 'timestamp_granularities[1]=segment' \
  --form 'orq={
  "fallbacks": [
    {
      "model": "openai/gpt-4o-mini"
    }
  ],
  "retry": {
    "count": 3,
    "on_codes": [
      429,
      500,
      502,
      503,
      504
    ]
  },
  "contact": {
    "id": "contact_01ARZ3NDEKTSV4RRFFQ69G5FAV",
    "display_name": "Jane Doe",
    "email": "[email protected]",
    "metadata": [
      {
        "department": "Engineering",
        "role": "Senior Developer"
      }
    ],
    "logo_url": "https://example.com/avatars/jane-doe.jpg",
    "tags": [
      "hr",
      "engineering"
    ]
  },
  "load_balancer": [
    {
      "model": "openai/gpt-4o",
      "weight": 0.7
    },
    {
      "model": "anthropic/claude-3-5-sonnet",
      "weight": 0.3
    }
  ],
  "timeout": {
    "call_timeout": 30000
  }
}' \
  --form file='@example-file'
{
  "text": "<string>"
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

multipart/form-data

Transcribes audio into the input language.

model
string
required

ID of the model to use

prompt
string

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.

enable_logging
boolean
default:true

When enable_logging is set to false, zero retention mode is used. This disables history features like request stitching and is only available to enterprise customers.

diarize
boolean
default:false

Whether to annotate which speaker is currently talking in the uploaded file.

response_format
enum<string>

The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.

Available options:
json,
text,
srt,
verbose_json,
vtt
tag_audio_events
boolean
default:true

Whether to tag audio events like (laughter), (footsteps), etc. in the transcription.

num_speakers
number

The maximum amount of speakers talking in the uploaded file. Helps with predicting who speaks when, the maximum is 32.

timestamps_granularity
enum<string>
default:word

The granularity of the timestamps in the transcription. Word provides word-level timestamps and character provides character-level timestamps per word.

Available options:
none,
word,
character
temperature
number

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Example:

0.5

language
string

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

timestamp_granularities
enum<string>[]

The timestamp granularities to populate for this transcription. response_format must be set to verbose_json to use timestamp granularities. Either or both of these options are supported: "word" or "segment". Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Available options:
word,
segment
Example:
["word", "segment"]
orq
object
file
file

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

Response

Returns the transcription or verbose transcription

text
string
required