VoiceGoose Manual
This manual will guide you through the process of using VoiceGoose.
What is VoiceGoose
VoiceGoose is a application that uses AI to generate realistic speech. It can be used to generate speech in a variety of languages, voices and even clone existing voices based on a reference audio file of just 5-10 seconds. VoiceGoose is designed to run locally on your own computer, and does not require an internet connection. This means that your audio is always processed on your own hardware, and never sent to any remote servers.
System Requirements
VoiceGoose requires:
- M-series processor (M1, M2, M3, M4, etc.)
- 16GB RAM
- 30GB of storage space
- macOS Sequoia (15.6) or later.
The speed at which voices can be generated will depend on your hardware.
Voice Design
When using voice design you can give VoiceGoose an instruction in natural language, describing the voice you want it to say your text in.
The best way to think about this is to think about how you would describe the voice you are looking for to a regular person. If you tell VoiceGoose to sound like a toaster it will not know what to do, but if you tell it to sound like an excited young man then it is likely to understand your intent. There are limitations as to what you can ask for, but you can get a long way with a well crafted instruction.
For example, consider the following aspects:
- Gender
- Emotion
- Pace
- Pitch
- Speed
- Volume
- Tone
Bare in mind that sometimes you may need to generate a few versions before it comes out just right, even with a well crafted instruction.
Voice Cloning
With voice cloning you can provide VoiceGoose with a reference audio file of a voice and VoiceGoose will generate a new clip that sounds like the reference voice, but saying something different. In fact, VoiceGoose can even make voices speak in other languages than the one in the reference audio file. Have you ever heard yourself speaking Japanese, or French?
The reference audio file should be at least 5-10 seconds long, and should be of as high quality as possible. Longer clips of up to 30 seconds can sometimes provide better results. The audio file should be of a single speaker, and should preferably not contain any background noise. The audio file needs to be in WAV format.
When generating a voice clone you can also provide a transcript of the reference audio file. This is optional, but can be used to greatly(!) improve the quality of the clone and is highly recommended to provide.
Writing for VoiceGoose
When generating a voice clip you provide VoiceGoose with the text that you want it to say. This text can be in any of the supported languages. When writing your text you can add flavor to it, indicating to VoiceGoose how it should be spoken. For example, if you wanted the speaker to speak up you could write "IN ALL CAPS!" or if you wanted the speaker to be hesitant you could add "Some dots... O-Or maybe hy-hyphenation?". Try to keep the feeling of your text in line with any instructions you are giving the voice, as the model won't know what to do if the text and instructions are at odds with each other.
Common Issues
These are some common issues that users may encounter when using VoiceGoose.
"VoiceGoose quit unexpectedly"
The most typical reason for this is if you are trying to generate too long of a voice clip. How long voice clips you can generate will depend on your hardware, in particular the amount of memory on your computer. Try exiting other memory hungry applications on your computer while you are using VoiceGoose as this can lessen the memory pressure on the system. If this does not help, try generating shorter clips.
Generating is taking a lot of time
Generating is a computationally intensive process, and VoiceGoose is running on your local hardware. The time it takes to generate a voice clip will therefore depend on your system specs. Note that the first generation will always take longer than subsequent generations as VoiceGoose loads the necessary AI models into memory. Subsequent generations are usually faster.
There is some noise right at the beginning of the audio clip
This is a known issue with the current model and unfortunately sometimes happens. It tends to happen more when the generated clip is very short. If you are experiencing this issue, try generating a longer clip.
The voice is not doing what I tell it to do
This can happen for a number of reasons.
- The model simply can not do what you are asking it to do.
- The instruction is not clear enough.
- The instruction is at odds with the text.
- The model is not able to understand the text.
- The generated clip is too short.
Or, at the end of the day, it could just be bad luck. Try generating the clip again!