Skip to content

Blog Post

Marc Pozzo edited this page Feb 11, 2022 · 8 revisions

The Noise Enricher

The Noise Enricher was a spontaneous idea emerged shortly before the project pitch session in the techlabs winter term 2021/22. When artificial intelligence is used to create texts and images, why not to create sounds? Neural Networks indeed are used for this, but the field seemed to lag behind the former two (after some research, this is not really the case). However, a team of Seven was gathered to work on this idea: Two people from UX track, two from WD track, and three from AI track. This blog post will portray the project from the three perspectives.

Oh my! Can you hear the spectrograms? Novicing in Artificial Intelligence.

The AI team did build two types of neural networks. The first one is classifying the music genre of an input sound and was realized with PyTorch. However, we did not finally implement it in our first application prototype due to deployment issues with Heroku. The second neural network structure was a Variational Autoencoder built with Tensorflow/Keras that we used to manipulate audio signals during the reconstruction process (encoding and decoding). This was the core of our app. A more detailed explanation is following.

We used the GTZAN dataset for the training of our networks. The dataset can be obtained from here or here. It consists of 1.000 sound files of ten music genres with a length of 30 seconds each. For the training of both networks, we did split the files in chunks of three seconds and turned them into spectrograms using integrated torchaudio functions and the librosa library. This procedure was identical for both networks. For this reason, our networks could only process sound with a length of three seconds. We solved that problem by combining all chunks of the same sound file to one input tensor or input array, one batch, though.

For the classification task, we built a deep CNN (convolutional neural network) consisting of five convolutional layers with tanh activation functions. Each convolutional layer was followed by a maxpooling layer to reduce the dimensions. Two dense layers followed with the linear output layer containing ten features representing the music genres. During training, we varied the architecture and we also tried to run the procedure with complex numbers instead of real numbers, but the network’s performance did not improve significantly. For the training process itself we used Google Colab and did run through 300 epochs, but from epoch 250 on, there was no improvement at all. At the end, we reached an accuracy of about 60 percent on the test set.

CNN.png

The table above shows the categorization results of our network for three different interprets and genres. Emma Ruth Rundle’s “Return” a mix of classical, jazz, and blues (38, 31, 12 percent respectively), Rahu’s “Ordeal Of X” is definitely metal (80 percent), and The Midnight Ghost Train’s “Spacefaze” seems to be solid mix of rock, country and jazz with a pinch of metal. Feel free to click on the links to verify this. We think the network did a sound job of work.

Spectrograms comparison

For our autoencoder network, we used five convolutional layers, too. But now with Relu activation and batch normalisation to prevent overfitting. Anyway, overfitting was not really an issue as you will see later. The VAE had the same but reversed architecture for the encoder and decoder and stored mean values and variances in the latent space consisting of dense linear layers. We adapted the structure from Valerio Velardo, also know for his YT Channel “The Sound of AI”, here you can find his github repo that was our starting point. His videos and tutorials are amazing and were really helpful for us, it´s worth a visit.

However, our idea at the very beginning of our project was to manipulate the latent space and so to obtain different sound patterns in the reconstruction. This plan did not work. When we “disturbed” the latent space, the final result was a mess. So, we decided to train multiple networks on just one music genre in the hope, that the network would learn the very features of a certain genre. If any other music is processed through such a network, the trained features should be emphasized by the network and the music should sound more like another genre. We decided to train three separate networks on Metal, Blues and HipHop - the genres the user can currently select in the app. After we built our network structure and training procedure locally and wanted to outsource the training to Google Colab. Here we encountered heavy versioning-issues with Python and Tensorflow. Our algorithm relied among others on tensorflow 2.1.0 and python 3.6 what was in this particular combination not supported by Google Colab, that is building Tensorflow from source code to ensure its workability. So, we had to train our networks locally and were not able to train them for more than 20 epochs without risking the health of our CPUs. Thus, the quality of the output was below expectations.

Even after a handfull epochs, we could indeed observe different outputs from our networks for identical inputs, as you can see in the spectrograms above. While the Blues-network emphasized the lower and middle frequencies, the Metal-network overemphasized higher frequencies typical for sawing guitars for example. The HipHop-network seems to be trained on reconstructing beats and hits as well lower frequencies.

But, a running prototype is more than a fancy cluster of weights and parameters only accessible over command line. Because the VAE is reconstructing not sound signals but spectrograms, we had to design a pipeline to provide the web-app backend clear access points. The pipeline starts with the uploaded audio file. This file is snipped into chunks of 3 seconds, and from each chunk spectrograms were made (the training was done with the same audio length and spectrogram dimensions, otherwise it would not work). As one batch of numpy arrays, these snipplets were processed and reconstructed by one of the networks. The reconstructed spectrograms with the genre specific bias then were reverse transformed to a three-seconds-signal, and all signals were combined to one audio file, provided to the user.

High-five on Github - Web Development on filling structural holes

Web development UML components diagram

Web development UML components diagram

The WD team was composed by Gianna and Valerii, respectively front end and back end. The two have had great pleasure in working together, with Valerii's energy and Gianna’s proactive nature. We have always consulted each other before any big code related decision, through endless conversations on Git and slack. We set up the Node.js backend and frontend with express server and server-side embedded javascript templates (ejs).

Note from one AI guy: These two pals kept the process really running - with competencies, commit volleys and holding up the dynamic. But, mostly they solved issues and watched our back.

There were indeed a lot of challenges. First, the biggest difficulty from the front-end point of view was to deal with an unsteady idea of what the landing page has to look like. Very often it was necessary to build the code almost from scratch. The interaction between JS backend and Python scripts was an issue, too. We solved it by setting up both backends (Node.js and AI) on one server and then simply call python script from Node.js after receiving request from a user. And about the third big thing - the deployment on Heroku - you can read in the README.md

The team was dreaming about a fully implemented website, having a “Pro” page where the user could personalize the track in deeper and fine ways. Nevertheless, due to difficulties in implementing the technology behind those functionalities, we have commonly decided to keep the page and a dummy one. In this way, the visitor can still have a glimpse of what we are envisioning for the tool.

If we would have to start the project again, we would probably get a better clarity on the possibility the machine learning team can achieve in such a sort amount of time, and push the UX team to give indications on the UI and functionalities much earlier. Further more, from a front end prospective, it would be interesting to rebuild the tool with React. Because we didn’t had a clear idea about how to deploy the tool and connect AI and backend, we have decided to keep it simple with HTML, CSS and JS, and maybe later switch to react. Nevertheless, because of the fast pace of the project, the focus was on have something ready for the final delivery and the switch didn't happen. (But Gianna is probably gonna do it by her own to test her capacity ;)

Another possible next step is to set up several separate flask servers with different AI algorithms (classification and audio generation) and connect these servers to Node.js backend.

You need something in the window - late ignition in User Experience Design

The work of UX-Design started with generating ideas and creating first prototypes of wireframes for the web app leading to the initial mockup. During this period, the team was in a continuous ideation process. Next, a potential user research survey was conducted to get first impressions on interested users, their motivation and expectations regarding such an app. The findings did also contributed to the personas.

This User Persona was made by getting together all user survey findings and conclusions.

This User Persona was made by getting together all user survey findings and conclusions.

User survey key findings:

  • The majority of potential users would be willing to use The Noise Enricher from 10 to 30 minutes.
  • Most of them are more interested in “unknown transformations” of their instrumental music ideas.
  • Most potential users sees themselves using the app on a computer.
  • They might be interested in a PRO version with more features.
  • The answers came from roughly 90% of musicians between 25 and 40 years old.

One of our findings was, that users were rather interested in playing around with the app than in selecting certain sound transformations. However, this did not fit with the progress of the AI team (see above), that rejected the idea of a flexible manipulation of the autoencoder’s latent space and went over to train networks on certain music genres. So, the team was not able to meet this particular user need.

More on User Research:

  • The AI team didn't have time to perform the "unknown transformation" feature that we proposed based on the potential users's survey. This was a UX proposition but did not make it to the MVP because they already were doing the genres classification. So that was their priority and it was totally reasonable. But it remained as something that could be implemented in the future.
  • We proposed to include a playground where users could spend more time playing inside The Noise Enricher app.
  • The WD track was indeed able to prioritise the desktop design over the mobile one

After the picture of the Noise Enricher App became more clear, the collaboration with the Web Development Team started and the first draft was realized. However, as we were impressed by the designs of the other techlabs teams presented in the midterm meeting, we thought over the whole design again and built it almost new. We designed a new logo and favicon, defined a color palette with icons and fonts, and developed fancy background animations with the WD team. Especially the animated waveform/ equalizer did lead to a handfull iterations between UX and WD.

As our minimal viable product only provides basic functionality, our work on the sitemap - including PRO version features for example - exceeds the pages visible in the prototype. This PRO version is a wishlist for The Noise Enricher and it consists in a series of features that are not implemented but it would be nice to have in the future. One example of this is the capability of adding mic input as an alternative of uploading a track. So that the user could sing or hum something straight into the app instead of uploading a prerecorded file. This feature makes the app more accessible to non-musicians as well!

User testing conclusion:

  • Implement the backend part of the mic input feature
  • Polish the responsiveness of the app and fix CSS bugs
  • Support more audio file formats
  • Explain the style transformation a little bit more so users understand what they are doing
  • Read the full user testing document here

thank you techlabs

... for forcing us not only to learn technical skills but also to learn process management and collaboration. Not all went like we expected, but we know on what screws we have to turn in the future!