Stable Audio consists of several parts that work together to quickly create custom audio. One of them reduces the size of the file so that its important functions are preserved while removing unnecessary noise. This allows the system to be trained more quickly and new audio data to be created more quickly. Another part uses text, metadata descriptions of the music and sounds, to determine what type of track is generated.
To speed up operations, the Stable Audio architecture works with a highly simplified and compressed audio model to reduce inference time, which is the time it takes a machine learning model to generate an output after it is given an input. According to Stability AI, Stable Audio can convert 95 seconds to 16 seconds stereo audio Bits at a transmission frequency of 44.1 kHz, often called “CD quality”, in less than a second on a graphics processing unit (GPU) Nvidia A100 a robust data center version designed for AI use and is much more powerful than a typical gaming desktop GPU.
But Stable Audio isn’t the first music generator to use latent diffusion techniques. In December, Ars Technica reported on the start of Rifffusion, an amateur audio version of Stable Diffusion, although the resulting generations were far removed from Stable Audio samples in quality. In January, Google introduced MusicLM another AI music generator for 24kHz audio, and Meta released a series of open source audio tools in August, including a text-to-music generator called AudioCraft. Stable Diffusion now goes one better with 44.1 kHz stereo sound.
Stability indicator Stable Audio will be available for free and as part of a $12 monthly Pro plan. The free option allows users to generate up to 20 tracks per month, each with a maximum duration of 20 seconds. The Pro plan expands these limits, allowing 500 song generations per month and lengths of up to 90 seconds. Future releases are expected to include open source models based on the Stable Audio architecture, as well as training code for people interested in developing audio generation models.
Currently, it seems we are reaching the limits of AI-generated music thanks to stable audio and fidelity with good production quality. Will musicians be happy if they are replaced by artificial intelligence models? Probably not, if history has taught us anything about anti-AI protests in the visual arts. At the moment, a human can easily outperform any product created using this technology, but that may not be the case for long. In any case, AI-generated audio could become another resource in the professional music production toolbox.