Speech To Text

The Speech To Text Operator transcribes audio from your input source into text in real time and can optionally display it as on-screen subtitles. It also integrates with Composer's Script Engine, enabling actions to be triggered based on specific words or phrases, allowing for voice-command functionality.

Model Requirements

Composer supports Whisper models from OpenAI in the ggml format (General-purpose GPU-optimized Machine Learning) with a .bin extension. These models are specifically designed for efficient and fast processing.

Supported Language
Currently, Composer supports only English for speech-to-text processing.

Included Model
Composer comes with the ggml-tiny.en.bin. This is the smallest and fastest model, ideal for applications where speed and low resource usage are a priority.

Additional Models
If you require other Whisper models in the ggml format, you can find them on HuggingFace. Models with the "en" abbreviation are optimized for English-only transcription.

Getting Started

To get started, you first need to add a Text To Speech Operator to an input source. Under Scenes, right click on your input's Operators icon and select:

Add Operator -> AI -> Text To Speech

Now a Text To Speech Operator has been added to your input and is ready for use.

Load Model

The Model Source section is where you select and load the Whisper model (*.bin) for speech-to-text processing.

Click the "Load" button:
This will open a file dialog box.
Browse to the location of the model:
Ensure the model file has a .bin extension.
Select the model file and click "Open":
The selected model will now be loaded into the application, ready for use.

State

The State section allows you to monitor and control the current status of the Operator.

State:
Displays the current state of the Operator.
Autostart:
Check this box to make the Operator start automatically the next time the project is loaded.
Start:
Click to start the Operator.
Stop:
Click to stop the Operator.

Threshold

The Threshold options allows you to fine-tune the behavior of the Operator.

Confidence:
Sets how confident the Operator should be in its speech-to-text transcription.
- Higher values require the model to be more certain about the detected words, which may reduce errors but also limit output in challenging audio conditions.
- Lower values allow more transcription attempts but may increase inaccuracies.
Audio Buffer:
Sets the amount of audio (in milliseconds) fed into the model for processing.
- A larger buffer can improve accuracy by providing more context to the model but may slightly increase latency.
- A smaller buffer reduces latency but may decrease transcription accuracy.
Text-on-Screen Timeout (ms)
Determines how long subtitles should remain visible on the screen before being cleared.
Reset:
Resets all options in the Threshold section to their default values.

Text Position

The text position options lets you control where the subtitles appear on the screen.

Show Text:
- Toggle this option to show or hide subtitles on the screen.
- Subtitles will not be displayed if this box is unchecked, regardless of other settings.
Pos-X:
Adjusts the horizontal position of the subtitles.
Pos-Y:
Adjusts the vertical position of the subtitles.
Reset:
Resets all settings in the Text Position section to their default values.

Text Appearance

The Text Appearance section allows you to customize the look of the subtitle.

Font Size:
Sets the font size in pixels.
Red:
Sets the red color between 0 (no red) and 255 (full red).
Green:
Sets the green color between 0 (no green) and 255 (full green).
Blue:
Sets the blue color between 0 (no blue) and 255 (full blue).
Text Alpha:
Controls the transparency of the text. 0 (fully transparent) and 255 (fully opaque).
Background Alpha:
Controls the transparency of the text-background. 0 (fully transparent) and 255 (fully opaque).
Reset:
Resets all settings in the Text Appearance section to their default values.

Text Settings

The Text Settings section lets you adjust how subtitles are displayed on the screen.

Max Lines:
Sets the maximum number of subtitle lines that can appear on the screen at one time.
Max Chars (per line):
Sets the maximum number of characters allowed per subtitle line.
Small Letters Only:
Display subtitles in lowercase letters only.
- This can be helpful for a more uniform and minimalist subtitle style.
Reset:
Resets all settings in the Text Settings section to their default values.

Recent Text

The Recent Text property displays the most recent spoken text. This text is automatically updated as new speech is transcribed.

Text:
A read-only field that shows the most recent spoken words.

Script Callback Function (optional)

The Script Callback Function allows advanced users to define a custom Script Engine function that will be invoked whenever new speech is transcribed.

Function Name:
Define the name of your custom Script Engine function.
- This function will be called whenever new speech is recognized, enabling you to define your own event handler.
- Use this feature to trigger specific actions or interact with other components in Composer based on transcribed speech.