openAUDIO.eu

Editors:

Björn Schuller (Imperial College London, UK)
Florian Eyben (audEERING GmbH)

Content and Software

openBliSSART

Authors: Felix Weninger, Alexander Lehmann, Björn Schuller

openBliSSART is a C++ framework and toolbox that provides "Blind Source Separation for Audio Recognition Tasks". Its areas of application include, but are not limited to, instrument separation (e.g. extraction of drum tracks from popular music), speech enhancement, and feature extraction. It features various source separation algorithms, with a strong focus on variants of Non-Negative Matrix Factorization (NMF).

Besides basic blind (unsupervised) source separation, it provides support for component classification by Support Vector Machines (SVM) using common acoustic features from speech and music processing. For component playback and data set creation, a Qt-based GUI is available. Furthermore, supervised NMF can be performed for source separation as well as audio feature extraction.

openBliSSART is fast: typical real-time factors are in the order of 0.1 (Euclidean NMF) on a state-of-the-art desktop PC. It is written in C++, enforcing strict coding standards, and adhering to modular design principles for seamless integration into multimedia applications.

Interfaces are provided to Weka and HTK (Hidden Markov Model Toolkit).

openBliSSART is free software and licensed under the GNU General Public License.

We provide a demonstrator that uses various features of openBliSSART to separate drum tracks from popular music. This demonstrator, along with extensive documentation, including a tutorial, reference manual, and description of the framework API, can be found in the openBliSSART source distribution.

If you want to use openBliSSART for your research, please cite the following paper:

Björn Schuller, Alexander Lehmann, Felix Weninger, Florian Eyben, Gerhard Rigoll: "Blind Enhancement of the Rhythmic and Harmonic Sections by NMF: Does it help?", in Proc. NAG/DAGA 2009, Rotterdam, The Netherlands, pp. 361-364.

openSMILE.

Authors: Florian Eyben, Martin Wöllmer, Björn Schuller

The openSMILE tool enables you to extract large audio feature spaces in realtime. SMILE is an acronym for Speech & Music Interpretation by Large Space Extraction. It is written in C++ and is available as both a standalone commandline executable as well as a dynamic library (A GUI version is to come soon). The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple configuration file. New components can be added to openSMILE via an easy plugin interface and a comprehensive API.

openSMILE is free software licensed under the GPL license and is currently available via Subversion (http://subversion.tigris.org/) in a pre-release state here. Commercial licensing options are available upon request.

To directly check out the Subversion repository, type the following command in a command-line prompt on a system where SVN is installed:
svn co https://opensmile.svn.sourceforge.net/svnroot/opensmile opensmile

If you use openSMILE for your research, please cite the following paper:

Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", Proc. ACM Multimedia (MM), ACM, Firenze, Italy, 25.-29.10.2010.

A brief summary of openSMILE's features is given here:

Cross-platform (Windows, Linux, Mac)
Fast and efficient incremental processing in real-time
High modularity and reusability of components
Plugin support
Multi-threading support for parallel feature extraction
Audio I/O:
- WAVE file reader/writer
- Sound recording and playback via PortAudio library.
- Acoustic echo cancellation for full duplex recording/playback in an open-microphone setting
General audio signal processing:
- Windowing Functions (Hamming, Hann, Gauss, Sine, ...)
- Fast-Fourier Transform
- Pre-emphasis filter
- Comb filter (available soon)
- FIR/IIR filter (available soon)
- Autocorrelation
- Cepstrum
Extraction of speech-related features:
- Signal energy
- Loudness (pseudo)
- Mel-spectra
- MFCC
- Pitch
- Voice quality
- Formants (available soon)
- LPC (available soon)
Music-related features:
- Pitch classes (semitone spectrum)
- Chroma features
- Chroma based CENS features
- Tatum and Meter vector
Moving average smoothing of feature contours
Moving average mean subtraction (e.g. for on-line cepstral mean subtraction)
Delta Regression coefficients of arbitrary order
Functionals:
- Means, Extremes
- Moments
- Segments
- Peaks
- Linear and quadratic regression
- Percentiles
- Durations
- Onsets
- DCT coefficients
- ...
Popular feature file formats supported:
- Hidden Markov Toolkit (HTK) parameter files (write)
- WEKA Arff files (currently only non-sparse) (read/write)
- Comma separated value (CSV) text
- LibSVM feature file format
Fully HTK compatible MFCC, energy, and delta regression coefficient computation
Fast: 6k features extracted with 0.02 RTF

Acknowledgment: openSMILE's development has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 211486 (SEMAINE).

openEAR.

Authors: Florian Eyben, Martin Wöllmer, Björn Schuller

The Munich openEAR toolkit is a complete package for automatic speech emotion recognition. Its acronym stands for open Emotion and Affect Recognition Toolkit. It is based on the openSMILE feature extractor and thus is capable of real-time on-line emotion recognition. Pre-trained models on various standard corpora are included, as well as scripts and tools to quickly build and evaluate custom model sets. As classifier currently included are Support-Vector Machines using the LibSVM libray. Soon to come are also Bidirectional Long-Short-Term-Memory Recurrent Neural Nets, Discriminative Muli-nominal Bayesian Networks, and Lazy Learners.

openEAR is free software licensed under the GPL license. The first release (including model sets and pre-compiled openSMILE) will be available soon on Sourceforge: openEAR. Meanwhile, please refer to the openSMILE project, where we provide the feature extraction engine.

If you use openEAR for your research, please cite the following paper:

Florian Eyben, Martin Wöllmer, Björn Schuller: "openEAR - Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit", in Proc. 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2009 (ACII 2009), IEEE, Amsterdam, The Netherlands, 10.-12.09.2009.

Acknowledgment: openEAR's development has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 211486 (SEMAINE).

DOWNLOAD: The first release of openEAR can be downloaded at: http://www.mmk.ei.tum.de/~eyb/openEAR-0.1.0.tar.gz . A short tutorial is included with the release. Further, the release contains pre-compiled binaries of the openSMILE engine for Windows and Linux, including PortAudio support. The live emotion recognition GUI is not yet included in the release, it will be made available within the next few weeks.

iHEARu-EAT Database.

Authors: Simone Hantke, Björn Schuller, and others (cf. below)

The iHEARu-EAT database contains audio and video of subjects speaking under eating condition with different food types. The audio-track was featured as a Sub-Challenge for the Interspeech 2015 Computational Paralinguistics Challenge (Interspeech ComParE 2015). Here, we provide a richer version including additional annotations and mappings as well as a video track of the subjects.

30 subjects (15 f, 15 m; 26.1 +- 2.7 years) were recorded in a quiet, low reverberant office room (27 German; 1 Chinese, 1 Indian, 1 Tunisian origin, all of them having a close-to-native competence in German; no speaker displayed significant speech impediments.). Food classes were chosen with partly similar consistency (for instance, crisps and biscuits) and partly dissimilar consistency (for instance, nectarine vs crisps). These food classes represent snacks which are likely to be encountered in practical scenarios and enable the subjects to speak while eating. For read speech, the German version of the phonetically balanced standard story “The North Wind and the Sun” (“Der Nordwind und die Sonne”) was chosen (71 word types with 108 tokens, 172 syllables). The subjects had to read the whole text with each sort of food. Spontaneous narrative speech was elicited by prompting subjects to briefly comment on, e. g., their favourite travel destination, genre of music, or sports activity. The narratives were segmented into units whose length roughly equals the length of the six pre-defined units in the read story. All in all, 1 414 turns and 2.9 hours of speech (sampled at 16 kHz) were recorded. Note that there is a slight difference in the amount of utterances per class, because some subjects chose not to eat all types of food.

If you use iHEARu-EAT for your research, please cite the following paper where you will find an extensive descriptions and baseline results:

Simone Hantke, Felix Weninger, Richard Kurle, Fabien Ringeval, Anton Batliner, Amr El-Desoky Mousa, and Björn Schuller: "I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance", PLOS ONE, 2016.

Acknowledgment: The research leading to these database and results has received funding from the European Community's Seventh Framework Programme under grant agreement No. 338164 (ERC Starting Grant iHEARu).

DOWNLOAD: Please obtain the License Agreement to get a password and further instructions for the download of the datasets: Please fill it out, print, sign, scan, and email accordingly (simone.hantke@uni-passau.de). The agreement has to be signed by a permanent staff member. After downloading the data you can directly start your experiments with the dataset.

Annotations.

Authors: Björn Schuller

The annotation of the MTV music data set for Automatic Mood Classification is accessible as PDF or Comma-Separated-Values (CSV) Text Files. For details please refer to and cite in case of usage:

Björn Schuller, Clemens Hage, Dagmar Schuller, Gerhard Rigoll: "'Mister D.J., Cheer Me Up!': Musical and Textual Features for Automatic Mood Classification", Journal of New Music Research, Routledge Taylor & Francis, Vol. 39, Issue 1, pp. 13-34, 2010.

The annotation of the NTWICM music data set for Automatic Mood Classification is accessible as ARFF File. This file is readable as text file and resembles Comma-Separated-Values (CSV) with an explanatory header. The according labelling-tool can be downloaded as Foobar2000 plugin. It allows for annotation of audio in the valence-arousal plane. For details please refer to and cite in case of usage of the annotation or the tool:

Björn Schuller, Johannes Dorfner, Gerhard Rigoll: "Determination of Non-Prototypical Valence and Arousal in Popular Music: Features and Performances", EURASIP Journal on Audio, Speech, and Music Processing (JASMP), Special Issue on "Scalable Audio-Content Analysis", vol. 2010, Article ID 735854, 19 pages, 2010.

The annotation of the UltraStar Singer Traits Database is subdivided into an ARFF file containing the singer meta-data and a ZIP file containing the beat-level alignments of the singers in the songs, in the UltraStar format. The subdivision of the songs into training, development, and test set is defined by the folder structure in the ZIP file. For copyright reasons, lyrics have been blinded in the alignments. Each singer change is annotated by the "word" _SINGERid=nnnn.

In case you use this data set for your own research, please cite:

Felix Weninger, Martin Wöllmer, Björn Schuller: "Automatic Assessment of Singer Traits in Popular Music: Gender, Age, Height and Race", Proc. 12th International Society for Music Information Retrieval Conference (ISMIR) 2011, ISMIR, Miami, FL, USA, pp. 37-42, 24.-28.10.2011.

The annotation of the Emotional Sound Database is available as plain text readable CSV file containing the sound category, sound file names and the four individual labeler ratings per arousal and valence. For copyright reasons, the sound files need to be retrieved via the FINDSOUNDS page.

In case you use this data set for your own research, please cite:

Björn Schuller, Simone Hantke, Felix Weninger, Wenjing Han, Zixing Zhang, Shrikanth Narayanan: "Automatic Recognition of Emotion Evoked by General Sound Events", to appear in Proc. 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan, 25.-30.03.2012.

Coding Schemes.

Authors: Björn Schuller and others (cf. below)

The coding scheme of acoustic features for inter-site comparison as used in CEICES is accessible as PDF File. For details please refer to and cite in case of usage:

Anton Batliner, Stefan Steidl, Björn Schuller, Dino Seppi, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Vered Aharonson, Loic Kessous, Noam Amir: "Whodunnit - Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech", Computer Speech and Language (CSL), Special Issue on "Affective Speech in real-life interactions", ELSEVIER, vol. 25, issue 1, pp. 4-28, 2011.

Demo Sounds.

Authors: Björn Schuller and others (cf. below)

Examples of music written by our Deep Neural Network: Clip 1, Clip 2, Clip 3, Clip 4, Clip 5.
For details please refer to and cite in case of usage:

Romain Sabathe, Eduardo Coutinho, Björn Schuller: "Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure", under review, 2017.

In any case do not hesitate to contact us .
Looking forward to hearing from you,

Björn Schuller
Florian Eyben

More information will follow in short.

Last updated: August 19, 2011