To create multisensory AVEmotion datasets for improving performance of emotion recognition tasks, we introduce a novel dataset for emotion recognition in multimodal data (known as ”AVEmotion”), which consists of multimodal video and audio data, collected from 3 actors (male) expressing 4 emotions: happy, sad, angry, afraid. Each actor expresses the same emotion in three different tempos, for a total of 12 different recordings. The dataset is composed of four partially different modality setups, with 240 videos of the voice to be conjoined with (1) 169 audio recordings of only audio, (2) 576 audio recordings of only video, (3) 172 audio and video recordings, and (4) 69 videos and audio recordings. Each modality combination is recorded with the three actors performing the same emotion with no more than two different tempos. The scaling factors for each modality combination are the same (changing only the audio-video ratio). The dataset are split into three sets: 50% and 90% split for training and validation, respectively. The audio and video recordings were acquired using two different digital camcorders: LG G8 and ELG-EF12 (using LED lights) that are plugged to the same pair of audio IEM headphones. The audio and video recordings were patched and then resampled to 48 kHz, with an average resolution of 1920$ imes$1080 (excluding audio and video recordings for single modality). All recordings were aligned to the video recordings, and 21 audio and video recordings have their own audio under the headings of the audio recording, which serves as an anchor to align the remainder of the modalities. The audio and video the recordings were aligned with, except the remaining 14 recordings, each of which had its own audio file.
This dataset contains videos and corresponding audio files of 40 different people speaking and singing in malayalam. Through manual listening of the audio files, they were classified into emotion categories. The context of the recordings was taken from the emotion recognition literature and manual listening of the audio recordings. The audio recordings were captured using a portable 8-channel Android phone, and have been down-sampled to 16KHz for the final file of all the audio recordings. This dataset provides videos of a different environmental background (indoor, outdoor). d2c66b5586