Asynchronous recordings of speech mixtures

Introduction

Asynchronous recording is a new interesting task of source separation. The multichannel observation is obtained by multiple independent recording devices, and has wide range of application using portable recording devices as smartphones and IC recorders. The main differences from the conventional synchronous multichannel recording are unknown time offset and drift. The former, the time offset, is caused by the difference of the recording start time for each of the recording devices. The latter, the drift, is the difference of the lengths of the sample times caused by small biases of the independent A/D converters. Although typical sampling frequency mismatch is below 100 ppm (=0.01%), the drift is particularly serious in the source separation because the time differences of arrival (TDOAs) of the sources changes according to the time. Therefore, a drift-robust separation scheme is indispensable in this task. Also, there are various issues in asynchronous source separation, such as differences of microphone properties and distributed positioning of the microphones.

Description of the datasets

In addition to the test/dev datasets of the SiSEC 2013, we propose the new datasets test2/dev2, which contains both simulation and real asynchronous recording.

The test/dev datasets were generated from a simple simulation using real room impulse responses and resampling. To simulate the time offset, random delays are given to the room impulse responses. To simulate the drift, we gave artificial mismatches of sampling frequencies by resampling to give small biases of the sampling frequencies.

The new test2/dev2 datasets contains both the real asynchronous recording using multiple IC recorders and the simulation. The signals and the impulse responses are observed together under the similar conditions in the same room. Thus, we can compare the real asynchronous recording and the simulation, and can discuss whether the difference is small or large. To mock a meeting environment, the loudspeakers to output speech are set around a table, on which microphones are set.

The difficulty to evaluate the source separation performance of the real asynchronous recording is in how to generate the reference under the existence of the drift. Although the accurate reference sound can be made easily by loudspeaker playback, the reproduction of the same drift is not a simple problem. To equalize the drifts in the mixture and the reference, we used a time marking. Assuming that the drift occurs with the constant rates of the sampling frequency mismatches without jitter, the mixture and the reference have exactly the same change of the time when they have the same time offsets. A chirp signal is played back from a loudspeaker for the time marking, and the time offsets of the mixture and the reference are aligned by equalizing the times when the chirp signals are recorded. Since the distortion of the sums of the references from the real observations is smaller than 30 dB in our data, the constant drift model ignoring jitter is sufficient for the evaluation of the source separation.

Test Data

Download test.zip (18.8 MB)

Download test2.zip (23.8 MB)

test.zip contains 18 stereo WAV audio files that can be imported in Matlab using the wavread command.

The data has two different recording environments:

150ms: all the microphone elements are spaced in a linear arrangement. The spacing of each stereo microphone pair is about 2.15 cm. The reverberation time is about 150 ms.
300ms: all the microphone elements are spaced in a radial fashion. The spacing of each stereo microphone pair is about 7.65 cm. The reverberation time is about 300 ms.

These files are named test_<srcset>_<cond>_mix_<ch>.wav, where

<srcset>: source sets male2, male3 and male4, which correspond to the mixture of two, three and four male speakers’ utterances, respectively.
<cond>: the recording conditions 150ms and 300ms.
<ch>: the indexes of the stereo channels ch12, ch34 and ch56. The channels are synchronized within each file, but no two channels in different files are synchronized to each other.

Each combination of <srcset> and <cond> determines one source set. The source sets do not share the same time offsets, sampling frequency mismatches and the direction of the sources. The sampling frequency mismatches are smaller than 100 ppm (= 0.01 %).

test2.zip contains 24 stereo WAV audio files that can be imported in Matlab using the wavread command.

The data has two different ways to emulate asynchronous recording. The signals and impulse responses are recorded in the same room with the reverberation time of 800 ms.

asynchrec: real asynchronous recording using four stereo IC recorders.
simulated: simulation of four asynchronous stereo recording devices with artificial drift generated by resampling. All the microphones are omnidirectional. The microphone spacing of each stereo set is about 5 cm.

Note that the different conditions and source set have the slightly different microphone positions.

These files are named test2_<srcset>_<cond>_<mixtype>_<ch>.wav, where

<srcset>: source sets mix3 and mix4, which correspond to the mixture of three and four speakers’ utterances, respectively.
<cond>: the recording conditions asynchrec and simulated, corresponding to real the asynchronous recording and the simulation.
<mixtype>: mixing types realmix, sumrefs, and mix, describing how the mixed observation was generated. The type realmix is the real asynchronous recording, sumrefs is the summation of asynchronous recording of source images, and mix is the mixture in the simulation.
<ch>: the indexes of the stereo channels ch12, ch34, ch56 and ch78. The channels are synchronized within each filebut no two channels in different files are synchronized to each other.

The IC recorders we used are SANYO ICR-PS603RM and OLYMPUS LS-14. The
sampling frequency mismatches of these devices seem
to be within 40 ppm (= 0.004 %), and we gave the similar scale of the artificial sampling frequency mismatches to the simulated data.

Development data

Download dev.zip (75.5 MB)

Download dev2.zip (78.8 MB)

dev.zip consists of 72 stereo and 4 monaural WAV audio files and six Matlab MAT files, which can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:

dev_src_<src>.wav: single-channel speech signal, shared in whole the dev dataset.
dev_<srcset>_<cond>_<src>_<ch>.wav: two-channel spatial image of each source.
dev_<srcset>_<cond>_mix_<ch>.wav: two-channel observed signal of each stereo channel pair.
dev_<srcset>__filt.mat: MAT file containing room impulse responses as a multidimensional array A, whose size is [number of the channels, number of the sources, number of samples]. Note that the recording time offset is included in the impulse responses.

Here the variables are determined as follows.

<srcset>: source set male2, male3 and male4, which correspond to the mixture of two, three and four male speakers’ utterances.
<cond>: recording conditions 150ms and 300ms.
<ch>: indexes of the stereo channels ch12, ch34 and ch56. The channels are synchronized within each file, but the files are not synchronized each other.
<src>: indexes of the sources.

The signals and the impulse responses are observed at the same room as that of the test2 dataset, but the microphone positions are slightly different.

dev2.zip consists of 80 stereo and 4 monaural WAV audio files, two Matlab MAT files and two Matlab script M-files, which can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:

dev2_src_<src>.wav: single-channel speech signal, shared in whole the dev2 dataset.
dev2_<srcset>_<cond>_sim<src>_<ch>.wav: two-channel spatial image of each source.
dev2_<srcset>_<cond>_mix_<ch>.wav: two-channel observed signal of each stereo channel pair.
dev2_<srcset>_simulated_filt.mat: MAT file which has impulse responses used for the simulated data as a variable A, sized [number of the channels, number of the sources, number of samples].
dev2_<srcset>_simulated_generate.m: script M-file to generate the simulated data from the source signals and the impulse responses.

Similarly to the test2 dataset, the sampling frequency mismatches of the real asynchronous recording seem to be within 40 ppm (= 0.004 %), and the simulated data has the similar scale of the artificial sampling frequency mismatches.

Tasks

Test/dev and test2/dev2 datasets have slightly different tasks.

test/dev datasets

The task is to separate individual sources in the first channel. Since the data has offsets, only the middle 15 senconds are taken into account in the evaluation. However, the submitted data must have the same lengths as the observed mixture.

The submitted files must be monaural WAV data named as follows:

<dataset>_<srcset>_<cond>_<src>.wav

where

<dataset>: test or dev
<srcset>: male2, male3 or male4
<cond>: 150ms or 300ms
<src>: 1, 2, 3 or 4

test2/dev2 datasets

The task is to estimate the source images of each channel. Since the offsets are small in these datasets, we will evaluate whole the 20 seconds. It is preferable to submit the results of all the conditions and the mixing types with the same tuning parameters, so that we can compare the performances.

The submitted files must be monaural WAV data named as follows:

<dataset>_<srcset>_<cond>_<mixtype>_sim<src>_ch<ch>.wav

where

<dataset>: test2 or dev2
<srcset>: mix3 or mix4
<cond>: asynchrec or simulated
<mixtype>: realmix, sumrefs or mix
<src>: 1, 2, 3 or 4
<ch>: 1, 2, 3, 4, 5, 6, 7 or 8

Submissions

Participants may submit separation results for any above-mentioned tracks of the test and development mixtures.
In addition, each participant is asked to provide basic information about his/her algorithm (e.g. a bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

Evaluation criteria

We will use the criteria defined in the BSS_EVAL toolbox. Since the tasks of test/dev and test2/dev2 datasets are not exactly the same, we will use slightly different criteria.
For the test/dev datasets, the submitted results will be evaluated with SDR, SIR, and SAR using original sources at the first channel as “s” in bss_eval_sources.m.
The submitted results of the test2/dev2 datasets will be evaluated with SDR, SIR, and SAR using original sources at all the channels as in bss_eval_images.m.

Licensing issues

All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 (external link) license. The files to be submitted by participants will be made available on a website under the terms of the same license.

The recordings are authored by Shigeki Miyabe.

ASY 2015

Asynchronous recordings of speech mixtures

Introduction

Description of the datasets

Test Data

Development data

Tasks

test/dev datasets

test2/dev2 datasets

Submissions

Evaluation criteria

Licensing issues

Recent Comments