Google SoundStorm: A high-quality generative sound AI platform

Sazzad Yousuf
By Sazzad Yousuf - Review Editor 14 Min Read

Imagine a world where sounds can be twisted about and mixed with unmatched clarity, twisting between characters in dialogue that drags on across dozens of scenes. Thanks to the game-changing Google SoundStorm, that future is closer than you think.

Google SoundStrom AI Voiceover
AI Voiceover

Google SoundStorm is an awesome sound tech leaving the world wide open-mouthed. It can copy voices accurately and quickly, opening up various applications. It’s like making audio sound 10 million times better than usual, managing to address multiple speakers at the same time with exquisite chat skills, and even notching up a voice copying rate which makes old-style tech look slow as molasses (a.k.a., around one hundred times faster). All hail Google SoundStorm who usher in a new era of awesome audio experiences!

Demo of Google SoundStorm

Here is a demo video of Google SoundStorm | From Google Research Blog

Input: Audio prompt

I didn’t sleep well last night. | Oh, no. What happened? 

Output: Audio prompt + generated audio

I didn’t sleep well last night. | Oh, no. What happened? | I don’t know. I I just couldn’t seem to uh to fall asleep somehow, I kept tossing and turning all night. | That’s too bad. Maybe you should uh try going to bed earlier tonight or uh maybe you could try reading a book. | Yeah, thanks for the suggestions, I hope you’re right. | No problem. I I hope you get a good night’s sleep

What is Google SoundStorm?

Google SoundStorm is a non-auto-regressive audio generation model. It is a perfect example of an audio model worth preparing for.

Its output is representations of speech in a neural audio codec which are more compact, and of higher quality than the acoustic generation pipeline employed by AudioLM and SPEAR-TTS

So SoundStorm users can completely customize things like what kind of audio must be produced (spoken, voice prompts brief), and they don’t have to laboriously produce a recording the old-fashioned way. It’s exciting sound, without delay. SoundStorm is an AI platform created by Google.

Features of Google SoundStorm

Soundstorm is based on a non-auto-regressive approach to audio generation. That means it does not generate audio one token at a time but instead generates the entire audio signal simultaneously. This makes SoundStorm much faster than auto-regressive models and it also allows SoundStorm to generate audio with higher fidelity and consistency.

Soundstotrm uses a bidirectional Conformer attending network (a hypothesis combination of Transformer and convolutions). This helps grasp the local and global organization of a series of tokens. Instead of the slower autoregressive decoding, faster parallel decoding takes its place along with a new and improved generation method which creates audio much more quickly.

This model is for efficient, non-autoregressive audio generation. It’s 100 times faster than other models to date here at Honsivalley. With the help of a TPU-v4, it only takes half a second to achieve 30 seconds of audio. SoundStorm, since it works without any lag time or loss of consistency whatsoever.

It is not so far behind AudioLM and will ensure you can make consistent audio of the same quality with every run of a model.

Learn More

This is not the final feature of Google SoundStorm, as Google has not officially announced its arrival. For further details, you can visit Google’s research blog on ”Google SoundStorm”.

How does it work?

Soundstorm works by using a non-auto-regressive approach to audio generation. It converts continuous raw audio data into a sequence of tokens, representing discrete properties and the temporal structure thereof.

Soundstorm uses a bidirectional attention-based Conformer which merges Transformers with convolutions to explore the local and global structure of tokens in a token stream.

SoundStorm model architecture
SoundStorm model architecture

It parallely produces audio tokens and uses a decoding technique modeled on MaskGIT. Because it runs on audio tokens, for faster generation. Perhaps the only model of this kind is Soundstrom, which produces audio as high a quality as AudioLM but with better consistency in terms of speaker identity and acoustic conditions.

Put more simply, SoundStorm operates through the generation of a stream of semantic tokens. Such semantic tokens are attempts by SoundStorm to express the sense of audio. Next, SoundStorm converts the semantic tokens into sound using an audio codec based on a neural network. A large data set of audio has been entered as training material into the neural Audio codec.

How to Run Google SoundStorm?

There are a couple of things you have to do before you can use Google SoundStorm. First, you need to have some basic knowledge of Python because SoundStorm was built with the popular machine-learning framework PyTorch as its foundation. 

SoundStorm PyTroch
SoundStorm PyTroch

Second, you need to have the resources for running the model such as a Tensor Processing Unit (TPU) or a Graphics Processing Unit (GPU). Finally, you must be able to get hold of the SoundStorm model which is on GitHub. While there is no denying the importance of this point we should also note that SoundStrom uses semantic tokens from AudioLM as its input so you’ll need to find a source for these.

Steps for running it:

Set up everything and then you can begin to use SoundStorm to create sounds, customizing them. Here’s a quick guide on how to do it:

  • 1. Clone the SoundStorm Repository: First, you’ll need to copy the SoundStorm source code from GitHub onto your computer. You can do this by running the command git clone https://github.com/rishikksh20/SoundStorm-pytorch in your terminal.
  • 2. Install the Required Libraries: Then go to the cloned repository and install the required libraries by using pip install -r requirements.txt.
  • 3. Prepare Your Data: As semantic tokens represent direct speech, output from AudioLM will be used as input to SoundStrom. So first, your data must undergo preprocessing for it to end up in this format. The pre-processing and data format are the same as those recommended by Hugging Face’s whisper speech dataset.
  • 4. Train the Model: After your data is ready, you can begin to train the model. In your terminal, run the command python train.py to begin training. These are the paths to your semantic token and acoustic token data that you will need.
  • 5. Generate Audio: After training the model, you can use it to create sound. The exact command will depend on your implementation, but it generally involves giving the model semantic tokens and letting them be converted into audio.

But these are just the basic steps. What you need to do depends on what exactly your specific use case is, so it may involve tweaking the model or changing aspects of the training process. Read the official SoundStorm documentation for the most accurate up-to-date information from arXiv

Application for SoundStorm

The breakthrough audio technology, SoundStorm can be used in many directions. Here are some examples of how SoundStorm can be used

  1. Language Learning Tools:
    SoundStorm’s outfitting abilities are of tremendous help to language-learning apps. These appointments all create realistic dialogues for learners to experience a more complete and effective learning process. Users can use this to practice and improve their listening skills, as well as pronunciation.
  2. Voice Assistants:
    When SoundStorm is utilized, voice assistants such as Google Assistant and Alexa will be able to converse more naturally thereby improving the user experience. This can give your voice helper a human touch, instead of feeling like merely another tool.
  3. Podcast Production:
    With SoundStorm, you can let your podcast such dialogue naturally flow in everyday life and use the sound effects with the most delicate handling. This can give your Podcast a greater professional sound and style that is certainly more interesting, with no need or expense of studio time or fancy equipment.
  4. Music Production:
    For creative production of music, the generative sound capabilities that Sound Storm offers can be used to create new sounds like effects and beat patterns. This will result in your music having a character and distinctiveness, making it different from anything already on the market.
  5. Audiobook Narration:
    For example, SoundStorm can synthesize beautiful natural speech for audiobooks. This will make for a more pleasant listening pleasure for your visitor, as well as saving you the outlay and effort of employing an accomplished narrator.
  6. Sound Effects:
    SoundStorm can be used in entertainment as a generator of music and sound effects. It can generate an incredible variety of sound effects, including bomb explosions and gunshots as well as footsteps, and vehicle sounds. It is suitable for use in applications including video games and movies as well.

Soundstorm, blessing or curse?

New Technology, Google SoundStorm, the birth of this revolutionary technique not only brings advantages but also disadvantages deserving due thought. On the bright side, SoundStorm represents an important step forward in making it possible for researchers and audio fans to explore new ground in sound creation. This remarkable reduction in memory requirements and computational needs makes research into audio generation more accessible to a wider audience.

But these advantages come attached to a set of challenges that must be approached responsibly. One particularly worrying aspect is that any built-in biases in the training data itself could have an effect on which attributes are extracted. It can be seen through this example that while the commitment to responsible control over speaker characteristics is there, continuous analysis and addressing limitations in training data are necessary if it is going to conform with the principles of responsible AI. First, the likelihood of producing biases in accents and voice characteristics makes ethical questions difficult to avoid.

Another major drawback is SoundStorm’s vulnerability to exploitation for nefarious ends. Mimicking voices brings up nightmarish possibilities. Bypassing biometric authentication, and impersonation. Recognizing these dangers needed for its safe use, Google has taken steps to avoid misuse. Adding a layer of security The establishment that SoundStorm’s audio is also detectable by the same kind of dedicated classifier used in previous research marks another step forward.

Google recognizes its obligation to prevent abuse in the pursuit of audio generation research’s accessibility. The determination to explore other methods, including audio watermarking, represents a forward-looking approach to making sure the technology is ethically sound. The responsible development and deployment of Google SoundStorm depends absolutely upon maintaining a proper balance between innovation and protecting against potential risks.

Lastly

The release of SoundStorm is a milestone in audio generation technology. Its unprecedented efficiency–combined with excellent audio quality and consistency–offers new possibilities for speech synthesis and text-to-speech systems. The further researchers delve into this field, the more excited we are looking forward to seeing new applications and improved user experience. Now, with this faster and more efficient method for generating audio at its disposal, SoundStorm is set to change the way we think about making sounds.

Let me know in the comments if this article is helpful. If there are any mistakes, I beg your pardon. See you next time, until then have a great time.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *