VoiceGoose Manual

This manual will guide you through the process of using VoiceGoose.

What is VoiceGoose

VoiceGoose is a application that uses AI to generate realistic speech. It can be used to generate speech in a variety of languages, voices and even clone existing voices based on a reference audio file of as little as just 3 seconds. VoiceGoose is designed to run locally on your own computer, and does not require an internet connection. This means that your audio is always processed on your own hardware, and never sent to any remote servers.

System Requirements

VoiceGoose requires:

M-series processor (M1, M2, M3, M4, etc.)
8 GB RAM
10 GB of storage space
macOS Sequoia (15.6) or later.

The speed at which voices can be generated will depend on your hardware.

Voice Design

When using voice design you can give VoiceGoose an instruction in natural language, describing the voice you want it to say your text in.

The best way to think about this is to think about how you would describe the voice you are looking for to a regular person. If you tell VoiceGoose to sound like a toaster it will not know what to do, but if you tell it to sound like an excited young man then it is likely to understand your intent. There are limitations as to what you can ask for, but you can get a long way with a well crafted instruction.

For example, consider the following aspects:

Gender
Emotion
Pace
Pitch
Speed
Volume
Tone

Keep in mind that sometimes you may need to generate a few versions before it comes out just right, even with a well crafted instruction.

Voice Cloning

With voice cloning you can provide VoiceGoose with a reference audio file of a voice and VoiceGoose will generate a new clip that sounds like the reference voice, but saying something different. In fact, VoiceGoose can even make voices speak in other languages than the one in the reference audio file. Have you ever heard yourself speaking Japanese, or French?

The reference audio file should preferably at least 5-10 seconds long, and should be of as high quality as possible. Longer clips of around 20-30 seconds will often provide better results. The audio file should be of a single speaker, and should preferably not contain any background noise.

When generating a voice clone you can also provide a transcript of the reference audio file. This is optional, but can be used to greatly(!) improve the quality of the clone and is highly recommended to provide.

Writing for VoiceGoose

When generating a voice clip you provide VoiceGoose with the text that you want it to say. This text can be in any of the supported languages. When writing your text you can add flavor to it, indicating to VoiceGoose how it should be spoken. For example, if you wanted the speaker to speak up you could write "IN ALL CAPS!" or if you wanted the speaker to be hesitant you could add "Some dots... O-Or maybe hy-hyphenation?". Try to keep the feeling of your text in line with any instructions you are giving the voice, as the model won't know what to do if the text and instructions are at odds with each other.

Common Issues

These are some common issues that users may encounter when using VoiceGoose.

"VoiceGoose quit unexpectedly"

The most typical reason for this is if you are trying to generate too long of a voice clip. How long voice clips you can generate will depend on your hardware, in particular the amount of memory on your computer. Try exiting other memory hungry applications on your computer while you are using VoiceGoose as this can lessen the memory pressure on the system. If this does not help, try generating shorter clips.

Generating is taking a lot of time

Generating is a computationally intensive process, and VoiceGoose is running on your local hardware. The time it takes to generate a voice clip will therefore depend on your system specs. Note that the first generation will always take longer than subsequent generations as VoiceGoose loads the necessary AI models into memory. Subsequent generations are usually faster.

There is some noise right at the beginning of the audio clip

This is a known issue with the current AI model and unfortunately sometimes happens. It tends to happen more when the generated clip is very short. If you are experiencing this issue, try generating a longer clip.

The voice is not doing what I tell it to do

This can happen for a number of reasons.

The model simply can not do what you are asking it to do.
The instruction is not clear enough.
The instruction is at odds with the text.
The model is not able to understand the text.
The generated clip is too short.

Or, at the end of the day, it could just be bad luck. Try generating the clip again!