---
title: "Speech To Text"
slug: "speech-to-text"
updated: 2026-02-26T07:59:24Z
published: 2026-02-26T07:59:24Z
---

> ## Documentation Index
> Fetch the complete documentation index at: https://composer.docs.vindral.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Speech To Text

The **Speech To Text Operator** transcribes audio from your input source into text in real time and can optionally display it as on-screen subtitles. It also integrates with [Composer's Script Engine](https://composer.docs.vindral.com/docs/scriptengine), enabling actions to be triggered based on specific words or phrases, allowing for voice-command functionality.

### Requirements

- **CUDA Toolkit v12.4** For instructions on installing the CUDA Toolkit, please follow the installation guide for your [Windows](/docs/windows-platform) or [Linux](/docs/linux-platform) platform.
- **Model Requirements** Composer supports Whisper models from OpenAI in the ggml format (General-purpose GPU-optimized Machine Learning) with a .bin extension. These models are specifically designed for efficient and fast processing.
  - **Supported Language**  

Currently, Composer supports only English for speech-to-text processing.
  - **Included Model**  

Composer comes with the *ggml-tiny.en.bin*. This is the smallest and fastest model, ideal for applications where speed and low resource usage are a priority.
  - **Additional Models**  

If you require other Whisper models in the ggml format, you can find them on [HuggingFace](https://huggingface.co/ggerganov/whisper.cpp/tree/main). Models with the "**en**" abbreviation are optimized for English-only transcription.

### Getting Started

To get started, you first need to add a **Text To Speech Operator** to an input source. Under *Scenes*, right click on your input's Operators icon ![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2848%29.png) and select:

- **Add Operator** -> **AI** -> **Text To Speech**

Now a *Text To Speech Operator* has been added to your input and is ready for use.

### Load Model

The Model Source section is where you select and load the Whisper model (*.bin) for speech-to-text processing.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2865%29.png)

1. **Click the "Load" button**:  

This will open a file dialog box.
2. **Browse to the location of the model**:  

Ensure the model file has a **.bin** extension.
3. **Select the model file and click "Open"**:  

The selected model will now be loaded into the application, ready for use.

### State

The State section allows you to monitor and control the current status of the Operator.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2866%29.png)

- **State**:  

Displays the current state of the Operator.
- **Autostart**:  

Check this box to make the Operator start automatically the next time the project is loaded.
- **Start**:  

Click to start the Operator.
- **Stop**:  

Click to stop the Operator.

### Threshold

The Threshold options allows you to fine-tune the behavior of the Operator.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2867%29.png)

- **Confidence**:  

Sets how confident the Operator should be in its speech-to-text transcription.
  - Higher values require the model to be more certain about the detected words, which may reduce errors but also limit output in challenging audio conditions.
  - Lower values allow more transcription attempts but may increase inaccuracies.
- **Audio Buffer**:  

Sets the amount of audio (in milliseconds) fed into the model for processing.
  - A larger buffer can improve accuracy by providing more context to the model but may slightly increase latency.
  - A smaller buffer reduces latency but may decrease transcription accuracy.
- **Text-on-Screen Timeout (ms)**  

Determines how long subtitles should remain visible on the screen before being cleared.
- **Reset**:  

Resets all options in the *Threshold* section to their default values.

### Text Position

The text position options lets you control where the subtitles appear on the screen.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2868%29.png)

- **Show Text**:
  - Toggle this option to show or hide subtitles on the screen.
  - Subtitles will not be displayed if this box is unchecked, regardless of other settings.
- **Pos-X**:  

Adjusts the horizontal position of the subtitles.
- **Pos-Y**:  

Adjusts the vertical position of the subtitles.
- **Reset**:  

Resets all settings in the Text Position section to their default values.

### Text Appearance

The Text Appearance section allows you to customize the look of the subtitle.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2869%29.png)

- **Font Size**:  

Sets the font size in pixels.
- **Red**:  

Sets the red color between **0** (no red) and **255** (full red).
- **Green**:  

Sets the green color between **0** (no green) and **255** (full green).
- **Blue**:  

Sets the blue color between **0** (no blue) and **255** (full blue).
- **Text Alpha**:  

Controls the transparency of the text.  **0** (fully transparent) and **255** (fully opaque).
- **Background Alpha**:  

Controls the transparency of the text-background. **0** (fully transparent) and **255** (fully opaque).
- **Reset**:  

Resets all settings in the Text Appearance section to their default values.

### Text Settings

The Text Settings section lets you adjust how subtitles are displayed on the screen.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2870%29.png)

- **Max Lines**:  

Sets the maximum number of subtitle lines that can appear on the screen at one time.
- **Max Chars (per line)**:  

Sets the maximum number of characters allowed per subtitle line.
- **Small Letters Only**:  

Display subtitles in lowercase letters only.
  - This can be helpful for a more uniform and minimalist subtitle style.
- **Reset**:  

Resets all settings in the Text Settings section to their default values.

### Recent Text

The *Recent Text* property displays the most recent spoken text. This text is automatically updated as new speech is transcribed.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2871%29.png)

- **Text**:  

A read-only field that shows the most recent spoken words.

### Script Callback Function (optional)

The Script Callback Function allows advanced users to define a custom [Script Engine function](https://composer.docs.vindral.com/docs/scriptengine) that will be invoked whenever new speech is transcribed.

![image.png](https://cdn.document360.io/94808959-fd66-406c-ab5e-4691ce952a14/Images/Documentation/image%2873%29.png)

- **Function Name**:  

Define the name of your custom [Script Engine function](https://composer.docs.vindral.com/docs/scriptengine).
  - This function will be called whenever new speech is recognized, enabling you to define your own event handler.
  - Use this feature to trigger specific actions or interact with other components in Composer based on transcribed speech.
