Step-by-Step Method to Create an Automatic Text Transcription Program
Step-by-Step Method to Create an Automatic Text Transcription Program
OpenAI’s Whisper is one of the most powerful solutions for turning your voice into text. However, Whisper can also be annoying to use, since you have to type commands to transcribe an audio file into text. But why do that when we’ve got AutoHotkey?
With AutoHotkey, we can effortlessly create a basic GUI for command-line apps like Whisper. So, let’s do that and see how you can create your own transcription app by combining AutoHotkey’s GUI-making superpowers with OpenAI’s Whisper as the “brain” behind the buttons.
Laying the Foundations for Whisper and AutoHotkey
You can make cool scripts with AutoHotkey , but that’s not all it can do. For this project, we’ll use AutoHotkey to create a GUI for Whisper. This will allow us to use OpenAI’s voice recognition AI tool by clicking buttons and customizing its functionality using menus instead of typing commands.
However, this means that you’ll need to have both AutoHotkey and Whisper installed to follow along.
For the first part of the equation, you can download AutoHotkey from its official site , then run its installer and follow the presented steps.
Note that we’ll use the older “v1” version of the scripting language, not the new v2. That’s important because the two versions use a somewhat different syntax. What we’ll see here might not work if using the new v2.
The second part is more complicated, but you can learn how to do it by checking our article on how to turn your voice into text with OpenAI’s Whisper for Windows .
With both installed, our plan of action is as follows:
- Create a GUI with elements for Whisper’s variables and values.
- Create functions to grab values from the interface, select files and folders, and assemble everything into a usable Whisper command.
- Run the Whisper command to produce results.
Of course, you could always use Windows built-in support for Voice Typing, as we saw in our article on how to start Voice Typing on Windows 11 . Still, as you’ll see while using it, Whisper is much more accurate (but also slower).
On a more personal note, I should explain that I am not a programmer, and this project is a “remix” of a solution made for personal use.
How to Make a New AutoHotkey Script
The first step is to create a new blank script file. Keep it in its own folder, just in case you decide to tweak or build on it, creating more files.
- Run your favorite file manager (or press Windows Key + E to launch Windows Explorer) and create a folder for your transcription app anywhere you like.
- Right-click on a blank spot of the window and select New > AutoHotkey Script to create an empty script file.
- Shift + Right Click on the file to access the full context menu and select to open it with your favorite code or text editor. Windows’ own Notepad will do.
- Despite being “an empty script”, your AHK file will already be pre-populated with some “stuff”. Those are useful AutoHotkey variables and flags that define how it should work on your desktop. Ignore them, leave them as they are, and do all your future typing underneath them.
Getting to Know Whisper’s Flags
Since we’re making a GUI for a command line app, it’s handy to have a reference to its major variables and flags that we’ll be using in our project. You can check them out by reading Whisper’s documentation, visiting its official Github page , and running it in your terminal.
We’ll list the ones we’ll use in this project for convenience. We suggest you add them to your script as comments (in separate lines, each beginning with a “;” character followed by a space).
; Whisper Flags:; --initial_prompt PROMPT_TEXT; --output_format txt; -o OUTPUT_FOLDER; --model MODEL_TO_USE; --task TRANSCRIBE/TRANSLATE; --language EN/EL
Creating the GUI With AutoHotkey
We suggest you split your script into sections using comments like we did to keep it organized. We’ll start by defining some variables, continue to the actual GUI, and end by defining its functions.
Establishing the Hidden Variables
We begin with a section where we’ll define variables we may want to change in the future, but not so often that we’d like to expose them through the GUI, over-complicating it. You can type “Variable_Name = Content or value of the variable” with one variable and value pair per line.
For this project, we’ve defined a OutputFormat variable that we set to the “txt“ value and a WhisperExecutable variable stating Whisper’s executable file name. This way, if we want to use the same solution in the future to create SRT subtitle files instead of TXT documents or upgrade Whisper/switch to an alternative app, we can adjust the values of those variables on that single spot instead of throughout the script.
OutputFormat = txtWhisperExecutable = whisper
Setting Up the User Options
When using Whisper on the command line, three of its flags allow you to define:
- If you’re doing translation or transcription
- The audio file’s language
- The language model you want to use (various sizes are available, each affecting performance VS quality of results).
The easiest way to offer the same functionality through a GUI is through tried and tested drop-down lists. The syntax for adding a drop-down list to an AutoHotkey GUI is as follows:
Gui, Add, DropDownList, xPosition yPosition wWidth hHeight vVariable_that_will_hold_selected_value, optionA|optionB|default_optionC||optionD|
Based on that, let’s add three drop-down lists to our script for selecting Whisper’s language (between English/en and Greek/el), model (tiny, base, small, medium, large), and task type (transcribe or translate).
Gui, Add, DropDownList, x5 y5 w165 h50 vSelectedLanguage, en||el Gui, Add, DropDownList, x175 y5 w165 h100 vSelectedModel, tiny|base|small||medium|large| Gui, Add, DropDownList, x345 y5 w165 h100 vTaskType, transcribe||translate|
To set an option as the default selection, use a double pipe symbol (“|”) after it. You can see that, in our example, we’ve set our language to en, SelectedModel to small, and TaskType to transcribe.
How to Guide Whisper
Since Whisper is AI-based, there’s no way to have absolute control over how Whisper transcribes audio. It’s free to choose what it considers optimal.
However, like other AI solutions, Whisper can accept user prompts. By crafting a prompt, you can “guide” how it transcribes your audio.
Did the solution we’re making fail to Transcribe something correctly? You can try “explaining” to Whisper “what the voice file is about”, including the syntax of words, acronyms, and phrases in your prompt as you want them to appear in the transcription. For that, we’ll add an AutoHotkey Text Edit field.
The syntax is not too different than what we used for adding drop-down lists above:
Gui, Add, Edit, x5 w505 h400 vPromptText, %PromptText%
The “%PromptText%” at the end “tells” AHK to show the PromptText variable’s content (if it’s already assigned a value) within the text field. It won’t show anything in the script we’re making, but consider it a placeholder for when you eventually tweak the script in the future also to save and load prompts!
Would you prefer to assign a predefined value to the PromptText variable? Add something like the following to the Variables section of the script. Remember to replace “Your Name’s” with your actual name.
PromptText = Transcription of Your Name's notes
Setting Up the Action Buttons
For choosing files, folders, and running Whisper after we’ve set everything up, it’s better to use buttons. You can add buttons to an AHK-made interface using the following:
Gui, Add, Button, xPosition yPosition wWidth hHeight gFunction_To_Perform, Button Text
Notice that unlike variables in GUI elements, which begin with the letter “v”, function names start with “g”, for “Go (to this spot of the script)”.
A single button of an AHK interface can also be deemed “the default one”, which will be activated if you don’t click anywhere on the GUI and press Enter. This is defined by adding “default“ in the coordinates-and-function section, as you’ll notice in our “OK” button:
Gui, Add, Button, x5 w505 h50 gSelectFile, Load FileGui, Add, Button, x5 w505 h50 gSelectFolder, Choose Output Folder Gui, Add, Button, Default x5 w505 h50 gButtonSubmit, OK
With the above, we’re defining three buttons:
- One labeled “Load File“ that, when clicked, will run the SelectFile function.
- One labeled “Choose Output Folder“, which will run the SelectFolder function.
- One labeled “OK“, selected by default, “calling” the ButtonSubmit function.
- Title: Step-by-Step Method to Create an Automatic Text Transcription Program
- Author: David
- Created at : 2024-08-08 13:19:50
- Updated at : 2024-08-09 13:19:50
- Link: https://win11.techidaily.com/step-by-step-method-to-create-an-automatic-text-transcription-program/
- License: This work is licensed under CC BY-NC-SA 4.0.