07-RESCUESPEECH- A GERMAN CORPUS FOR SPEECH RECOGNITION IN SEARCH AND RESCUE DOMAIN

RESCUESPEECH: A GERMAN CORPUS FOR SPEECH RECOGNITION IN SEARCH AND RESCUE DOMAINSangeet Sagar1,4, MircoRavanelli2, Bernd Kiefer1,4, Ivana Kruijff-Korbayov4, Josef van Genabith1,41 Saarland University, Germany2 Concordia University, Mila-Quebec AI Institute, Canada4 German Research Center for Artificial Intelligence (DFKI), Germany sangeetsagar2020@gmail .com, ravanellim@mila .quebec,arXiv:2306.04054v3 [ees s.AS] 25 Sep 2023{bernd .kiefer,josef .van genabith}@dfki .de, ivana .kruijff@rettungsrobotik .deABSTRACTDespite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic en- vironments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive train- ing recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.Index Terms— speech recognition, search and rescue,noise robustness.1. INTRODUCTIONAutomatic speech recognition (ASR) can be crucial in situa- tions like search and rescue (SAR) missions. These scenar- ios often involve making critical decisions in extremely hos- tile conditions, such as underground rescue operations, nu- clear accidents, fire evacuation, or collapsed building after an earthquake. In such cases, rescue workers must act quickly and accurately to prevent the loss of lives and damage. Tran- scribing and automatically analyzing the conversations within the rescue team can provide useful support to help the team make the right decisions in a limited amount of time. The context of search and rescue missions poses significant chal- lenges for current speech recognition technologies. Speech <a id="bookmark1"></a>recognizers must be able to handle conversational speech that is fast, emotional, and spoken under stressful conditions. Ad-ditionally, the acoustic environment in which rescuers oper- ate is often extremely noisy, and recordings may be corrupted by various non-stationary noises, such as engine noise, ve- hicle sirens,radio chatter, helicopter noise, and other unpre- dictable disturbances. In recent years, there has been a sig- nificant amount of research focused on addressing these chal- <a href="#bookmark2">lenges [1–3]</a>. Advanced deep learning techniques, such as <a href="#bookmark3">self-supervised learning coupled with large datasets [4], have</a> been instrumental in achieving impressive performance im- provements. One of the most intriguing aspects of the SAR domain is that all of the aforementioned challenges occur si- multaneously, creating an incredibly difficult and complex task. This not only makes it an area of significant scientific interest but also underscores the urgent need for continued re- search and development in this field.Developing a speech recognition system in this context is made even more challenging due to the limited availabil- ity of data in this critical domain. Collecting speech data re- lated specifically to the SAR domain can be difficult, and pri- vacy restrictions can often limit access to such data by the scientific community. To encourage research in this field, <a href="#bookmark1">we have released RescueSpeech1 ,</a> a German dataset for the Search and Rescue Domain Speech. This dataset contains au- thentic speech recordings between members of a rescue team during several rescue exercises. To the best of our knowledge, we are the first to publicly release an audio dataset in the SAR domain. RescueSpeech contains approximately 2 hours of an- notated speech material. Although this amount may seem limited, it is actually quite valuable and can be effectively used to fine-tune large pretrained models such as wav2vec2.0 <a href="#bookmark4">[5], WavLM [6], and Whisper [7]</a>. In fact, we demonstrate that this material is also suitable for training models from scratch when combined with proper data augmentation tech- niques and multi-condition training.This paper presents a comprehensive collection of exper- imental evidence for the task at hand– noise-robust German speech recognition. It employs state-of-the-art methods for1Available at: <a href="https://zenodo.org/record/8077622">https://zenodo.org/record/8077622</a><a id="bookmark5"></a>both speech recognition and speech enhancement, as well as a combination of the two. Despite excelling in simpler sce- narios, our results show that even modern ASR systems like <a href="#bookmark6">Whisper [7], struggle to perform well in the demanding res</a>- cue and search domain. We have made our training recipes and pretrained models available to the community within the SpeechBrain toolkit <a href="#bookmark7">2</a> . With the release of the RescueSpeech dataset we hope to foster research in this field and establish a common benchmark. We believe that our effort can help raise awareness about the importance of the use of speech technol- ogy in SAR missions, and the need for continued research in this domain.2. THE RESCUESPEECH DATASETRescueSpeech contains a blend of microphone and radio- recorded speech that includes excerpts from communication among robot-assisted emergency response team members during several simulated SAR exercises, which involves real firefighters speaking in high-stress situations like fire res- cue, explosion etc. that can elicit heightened emotions. The speakers involved in the exercises are native speakers of the German language where conversations were carried out be- tween team members, radio operators and the team leader. These dialogues loosely adopt a typical radio style communi- cation wherein the start/end of a conversation is indicated by the use of certain words, connection quality is relayed, and acceptance or rejection of requests are conveyed. The practi- cal use case of our dataset is limited not only for robot control but also for speech recognition, with its main application being the support of decision-makers and process monitors in disaster situations. The ASR output is analyzed by a nat- ural language understanding (NLU) component and fused with sensor data, including GPS coordinates from robots or drones. This way we extract mission-related information from conversations and use it to offer assistance later in the deployment of the full systemInitially captured at 44.1 kHz sampling rate, these record- ings are down-sampled to 16 kHz, and further segmented to obtain a set of mono-speaker single-channel audio record- ings. All utterances are also manually transcribed. The to- tal length of the dataset is 1.6h with a total of 2412 sen- tences with 1591/245/576 sentences in train/valid/test set. We call it the RescueSpeech clean dataset. Figure <a href="#bookmark5">1</a> shows a histogram plot of the average length of the segmented utter- ances with an average length of 2.39 sec. We also created a noisy version of RescueSpeech by contaminating our dataset <a href="#bookmark8">with noisy clips from the AudioSet dataset [8] that includes</a> five noise types– emergency vehicle siren, breathing, engine, chopper, and static radio noise. We utilized both real and syn- <a href="#bookmark9">thetic room-impulse responses (RIR) (SLR26, SLR27 [9]) to</a> <a id="bookmark7"></a>add reverberation as well. We then added noisy sequences2 Available at: <a href="https://github.com/speechbrain/speechbrain/tree/develop/recipes/RescueSpeech">https://github.com/speechbrain/</a> <a href="https://github.com/speechbrain/speechbrain/tree/develop/recipes/RescueSpeech">speechbrain/tree/develop/recipes/RescueSpeech</a><table><tr><td>Number of Files</td><td>300250200150100500</td><td><table><tr><td rowspan="2"></td><td rowspan="2"></td><td colspan="2" rowspan="2"></td><td rowspan="2"></td><td></td><td></td><td></td><td></td><td></td><td></td><td rowspan="2"></td></tr><tr><td></td><td></td><td>Le</td><td>2</td><td>.39 sec</td><td></td></tr><tr><td rowspan="2"><img src="/media/202408//1724838576.861574.png" /></td><td rowspan="2"></td><td colspan="2" rowspan="2"></td><td></td><td></td><td></td><td>age</td><td>ngth:</td><td></td><td></td><td></td></tr><tr><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838576.867879.png" /></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838576.882774.png" /></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838576.890804.png" /></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838576.89511.png" /></td><td><img src="/media/202408//1724838576.902909.png" /></td><td></td><td></td><td colspan="2"><img src="/media/202408//1724838576.906571.png" /></td><td><img src="/media/202408//1724838576.930401.png" /></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838576.934228.png" /></td><td colspan="11"><img src="/media/202408//1724838576.942085.png" /><img src="/media/202408//1724838576.956397.png" /><img src="/media/202408//1724838576.988514.png" /><img src="/media/202408//1724838576.993341.png" /><img src="/media/202408//1724838577.004046.png" /><img src="/media/202408//1724838577.0220642.png" /></td></tr></table><img src="/media/202408//1724838577.180796.png" /><img src="/media/202408//1724838577.208525.png" /><img src="/media/202408//1724838577.282939.png" />0.0 2.5 5.0 7.5 10.0 12.5 15.0AverLength (s)</td></tr></table>Fig. 1: Histogram plot illustrating the average length of utter- ances in RescueSpeech in secs.to generate noisy utterances with different signal-to-noise ra- tios (SNR) (from -5 dB to 15 dB with a step of 1 dB). Each clean utterance is randomly corrupted with one of the noise types to generate 4500/1350/1350 train/valid/test utterances. We also ensure that a noise utterance used in the train set is only in this set. This randomness and exclusivity ensure that each split has an equal proportion for each noise type and that noises in each of the splits are different. This dataset provides a diverse set of noise and reverberation conditions that enable fine-tuning of our speech-enhancement model for improved accuracy on noisy RescueSpeech. We call this the Rescue- Speech noisy dataset. Table <a href="#bookmark10">1</a> briefly shows the distribution of utterances and duration for the clean and noisy version of the dataset.2.1. Related CorporaTo improve the accuracy of speech recognition systems in noisy and reverberant environments, several corpora have <a href="#bookmark11">been developed, such as CHIME [10–13], DIRHA [14–17],</a> <a href="#bookmark12">AMI [18], VOiCES [19], and COSINE [20]</a>. Among these, <a href="#bookmark13">CHIME5 [12] and CHIME6 [13] are especially challenging</a> because it contains conversational speech recorded during a dinner party in a domestic setting, where noise and reverber- ations are common. RescueSpeech also contains conversa- tional speech recorded in challenging acoustic environments, but the scenario addressed in this corpus is unique and differ- ent from a dinner party. The acoustic conditions, emotions, and lexicon used in RescueSpeech are distinct, and thus pro- vide an additional set of challenges for speech recognition systems.The noisy version of RescueSpeech can be utilized to train speech enhancement systems that are robust in the acous- tic conditions present in the Search and Rescue (SAR) do- main. There are numerous datasets that have been released for speech enhancement purposes, including the deep-noise <a href="#bookmark14">suppression (DNS) dataset [21], VoiceBank-DEMAND cor-</a>Table 1: Distribution of utterances and hours in the Rescue- <a id="bookmark10"></a>Speech clean and noisy dataset.<table><tr><td colspan="2">Clean NoisyMins #Utts. <img src="/media/202408//1724838577.315941.png" /> HRS #Utts.</td></tr><tr><td>Train 61.86 1591 Valid 9.61 245 Test 24.68 576</td><td>7.20 4500 2.16 1350 2.16 1350</td></tr></table><a href="#bookmark15">pus [22], and WHAM! and WHAMR! corpora [23], all of</a> which are helpful for training speech enhancement models. However, the key difference with RescueSpeech is that it has been specifically designed for the SAR domain, where char- acteristic sounds such as sirens, radio signals, helicopters, trucks, and others affect the recordings. This unique char- acteristic of RescueSpeech makes it an especially valuable resource for training speech enhancement systems that canperform well in SAR environments.3. EXERIMENTAL SETUPWe explored multiple training strategies to perform noise ro- bust speech recognition. Speech recognizers and enhance- ment models are trained on large corpora and then fine-tuned and evaluated on RescueSpeech data.<a id="bookmark16"></a>3.1. ASR trainingWe follow two approaches for ASR training: one based on sequence-to-sequence modeling (seq2seq) and another one based on the connectionist temporal classification (CTC) method. For the seq2seq model, we employ a CRDNN (convolutional, recurrent, and dense-neural network) archi- <a href="#bookmark17">tecture [24,</a><a href="#bookmark18">25]</a>. The CRDNN encoder is trained on the <a href="#bookmark19">full 1200h of the German CommonVoice corpus [26]</a>. De- coding uses an attentional-GRU decoder and a beam search coupled with an RNN-based language model (LM). The LM is trained on Tuda-<a href="#bookmark20">De2</a> <a href="#bookmark21">[27] (8M sents), Leipzig news</a> <a href="#bookmark22">corpus [28] (9M sents), and train transcripts of the Com</a>- monVoice corpus. For the CTC based models, we use wav2vec2.0, and WavLM architecture as encoders for the ASR pipeline. These encoders use self-supervised approach for learning high-level contextualized speech representation. It needs no language model and decoding is performed using greedy search. For wav2vec2.0 and WavLM we use pre- trained encoders facebook/wav2vec2-large-xlsr-53-<a href="#bookmark23">german3</a> and microsoft/wavlm-<a href="#bookmark24">large4</a> respectively. Additionally, we <a href="#bookmark6">also employ the pre-trained Whisper [7] model</a> openai/whisper2<a href="https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic">https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic</a>- <a id="bookmark20"></a><a id="bookmark23"></a>models.html<a id="bookmark24"></a>3<a href="https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german">https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german</a> 4<a href="https://huggingface.co/microsoft/wavlm-large">https://huggingface.co/microsoft/wavlm-large</a>-large-<a href="#bookmark25">v25</a> to benchmark our systems against competitive state-of-the-art model.CRDNN combines two blocks of CNN (each block with 2 CNN layers with a channel size (128, 256)), an RNN block (4 bidirectional LSTM layers with 1024 neurons in each layer), and a dense-neural network layer. The inputs are 40- dimensional mel-fiterbank features and the network is trained <a href="#bookmark26">with an AdaDelta [29] optimizer with a learning rate (LR) of</a> 1 (during fine-tuning we use LR 0.1). The model is trained for 25 epochs with a batch size of 8. During testing, beam search is used with a beam size of 80. Each epoch takes approximately 8h on a single RTXA6000 GPU with 48GB of memory. For wav2vec2.0 and WavLM CTC, training is per- formed for 45 and 20 epochs respectively with LR 1e-4 on a <a href="#bookmark27">batch size 8 using an Adam [30] optimizer.</a> Each epoch takes approximately 5.5h on a single RTXA6000 GPU with 48GB of memory. LR is annealed and the sampling frequency is set to 16 kHz for both approaches. More details on training and model parameters can be found in the repository.3.2. Speech enhancement trainingIn this work, we perform speech enhancement using Sep- <a href="#bookmark28">Former [31]– a multi-head attention transformer-based source</a> separation architecture. It uses a fully learnable masking- based architecture composed of an encoder, a masking net- work, and a decoder. The encoder and decoder blocks are essentially convolutional layers and we learn a deep-masking network based on self-attention which estimates element-wise masks. These masks are used by the decoder to reconstruct the enhanced signal in the time-domain. We use the DNS4 <a href="#bookmark29">7</a> dataset to synthesize the training and evaluation set. Us- ing provided clean utterances, noisy clips (150 noises types), and RIRs, we generate 1300h of train and 6.7h of valid set at varying SNR (from -5 dB to 15 dB with a step of 1 dB),and a DNS-2022 baseline dev set is used as test set. Sampling rate is set to 16 kHz and only 30% of clean speech is convolved with RIR.SepFormer employs an encoder and decoder with 256 convolution filters with kernel size 16, each with stride 8. The masking network has 2 layers of dual-composition block and a chunk length of 250. With each clean-noisy pairs fixed at 4s in length, the model is trained in a supervised fashion using scale-invariant SNR (SI-SNR) loss and Adam optimizer with LR of 1.5e-4. We utilize multi-GPU distributed data parallel (DDP) training scheme to train the network for 50 epochs with a batch size of 4. Each epoch takes approximately 9h on 8 × RTXA6000 GPU.<a id="bookmark25"></a><a id="bookmark29"></a>5<a href="https://huggingface.co/openai/whisper-large-v2">https://huggingface.co/openai/whisper-large-v2</a> 7<a href="https://github.com/microsoft/DNS-Challenge">https://github.com/microsoft/DNS-Challenge</a><a id="bookmark30"></a>3.3. Training strategiesWe use various training methods to create a robust speech recognition system that operates in the SAR (Search and Res- cue) domain. These methods are described below:1. Clean training: After pretraining the ASR and Lan- guage Model (LM) models, we fine-tune them on the RescueSpeech clean dataset. This process helps to adapt the models to our target domain. We keep the model and training parameters the same as described in Section <a href="#bookmark16">3.1.</a>2. Multi-condition training: Using the same pretrained model as above, we perform multi-condition training, which involves training the ASR model on an equal mix of clean and noisy audio from the RescueSpeech noisy dataset. By doing this, the model can learn to adapt to different noises present in the utterances, which helps it to perform speech recognition. This method forms <a id="bookmark31"></a>the baseline for all our results. We set the learning rate (LR) to 0.1 and keep other parameters the same as above.3. Model-combination I: Independent training: We pre- train a speech enhancement model and then fine-tune it on the RescueSpeech noisy dataset. This model is then integrated with the ASR model trained in the clean training stage to perform noise-robust speech recogni- tion. In this stage, we freeze the enhancement model.4. Model-combination II: Joint training: This is a con- tinuation of the previous stage, where we follow a joint-training approach. We unfreeze the enhancement <a id="bookmark32"></a>model and allow gradients from the ASR to propagate back to the speech enhancement model. Updating the weights of the model in this way enables it to generate output that is as clean as possible, as required by the ASR model.4. RESULTS4.1. ASR PerformanceAs a first attempt, we created a simple pipeline consisting solely of an ASR model, with no speech enhancement utilized in the front-end. Table <a href="#bookmark30">2</a> provides a comparison of different ASR models used on both clean and noisy audio recordings from the RescueSpeech dataset. The models included in the comparison are CRDNN, wav2vec2.0, WavLM, and Whis- per. During the pre-training stage, all models (except Whis- per) utilized only the CommonVoice dataset. However, dur- ing the clean training and multi-condition fine-tuning stage, the RescueSpeech dataset was used.Unsurprisingly, the clean training approach is the most effective when tested on clean audio recordings. The top- performing model in this scenario is Whisper, which achievedTable 2: Comparison of test WERs for CRDNN, wav2vec2.0-large, WavLM-large, and whisper-large-v2 mod- els using different training strategies on clean and noisy speech inputs from the RescueSpeech dataset.ASR Model clean <img src="/media/202408//1724838577.491103.png" /> noisy52.03 47.92 46.28 27.01CRDNNWav2vec2WavLMWhisper81.14 76.98 73.84 50.85Pre-training31.18 27.69 23.9360.10 62.60 58.28 46.70CRDNNWav2vec2WavLMWhisperClean training23.1433.22 29.89 25.22 24.1158.95 57.98 52.7545.84CRDNNWav2vec2WavLMWhisperMulti-cond. trainingTable 3: Speech enhancement performance on the Rescue- Speech noisy test inputs when combining speech enhance- ment and speech recognition (Model Comb. I vs Model Comb. II).<table><tr><td rowspan="2"></td><td rowspan="2">ModelComb. I</td><td colspan="4">Model Comb. II</td></tr><tr><td>CRDNN</td><td>wav2vec2</td><td>WavLM</td><td>Whisper</td></tr><tr><td>SI-SNRi</td><td>6.516</td><td>6.618</td><td>7.205</td><td>7.140</td><td>7.482</td></tr><tr><td>SDRi</td><td>7.439</td><td>7.490</td><td>7.765</td><td>7.694</td><td>8.011</td></tr><tr><td>PESQ</td><td>2.008</td><td>2.010</td><td>2.060</td><td>2.064</td><td>2.083</td></tr><tr><td>STOI</td><td>0.842</td><td>0.844</td><td>0.854</td><td>0.854</td><td>0.859</td></tr></table>Table 4: Word-Error-Rate (WER%) achieved with indepen- dent training (Model Comb. I ) and joint training (Model Comb. II) of the speech enhancement and ASR modules.ASR Model <img src="/media/202408//1724838577.550963.png" /> Model Comb. I <img src="/media/202408//1724838577.586134.png" /> Model Comb. II<table><tr><td>CRDNN</td><td>54.98</td><td>54.55</td></tr><tr><td>Wav2vec2</td><td>50.68</td><td>49.24</td></tr><tr><td>WavLM</td><td>48.24</td><td>46.04</td></tr><tr><td>Whisper</td><td>48.04</td><td>45.29</td></tr></table>a WER of 23.14%. On the other hand, multi-condition train- ing proved to be a superior strategy when dealing with noisy recordings. In this scenario, the best model is again Whis- per, which achieved a WER of 45.84%. The performance gap with clean signals, highlights one more time the significant decline in recognition performance when dealing with chal- lenging acoustic conditions, even for models that were pre- trained using state-of-the-art self-supervised techniques like wav2vec, WavLM, and Whisper (the latter of which is even semi-supervised).log power (dB) log power (dB)<table><tr><td>Chopper Emergency-vehicle-and-siren</td><td><table><tr><td><img src="/media/202408//1724838577.619301.png" /><img src="/media/202408//1724838577.692557.png" />Enhanced<a id="bookmark33"></a>CleanNoisyHzHzHzHz<img src="/media/202408//1724838577.777882.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838577.971456.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838577.977632.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838578.017704.jpeg" /><table><tr><td></td></tr></table>0 1.5 3 Time0 1.5 3 Time<img src="/media/202408//1724838578.0583858.jpeg" /><table><tr><td></td></tr></table>2 40402TimeClean42Time Noisy0TimeEnhanced<img src="/media/202408//1724838578.068592.jpeg" /><img src="/media/202408//1724838578.0710442.png" />0 1.5 3 Time<img src="/media/202408//1724838578.090662.png" /></td><td><img src="/media/202408//1724838578.0978749.png" /></td><td>+0 dB-10 dB-20 dB-30 dB-40 dB-50 dB-60 dB-70 dB-80 dB+0 dB-10 dB-20 dB-30 dB-40 dB-50 dB-60 dB-70 dB</td></tr></table>Hz Hz4096 <img src="/media/202408//1724838578.11496.png" />2048 <img src="/media/202408//1724838578.120685.png" />1024 <img src="/media/202408//1724838578.139304.png" />512 <img src="/media/202408//1724838578.1428869.png" />256 <img src="/media/202408//1724838578.152881.png" />128 <img src="/media/202408//1724838578.160602.png" />64 <img src="/media/202408//1724838578.201225.png" />0 <img src="/media/202408//1724838578.223714.png" />4096 <img src="/media/202408//1724838578.240621.png" />2048 <img src="/media/202408//1724838578.262249.png" />1024 <img src="/media/202408//1724838578.290411.png" />512 <img src="/media/202408//1724838578.3399942.png" />256 <img src="/media/202408//1724838578.536344.png" />128 <img src="/media/202408//1724838578.594626.png" />64 <img src="/media/202408//1724838578.602633.png" />0 <img src="/media/202408//1724838578.6120791.png" /></td></tr></table>Fig. 2: Log-power spectrogram of clean, noisy, and SepFormer-enhanced utterances for emergency vehicle siren and chopper noise types at -5 dB SNR.4.2. Combining ASR and Speech EnhancementIn order to improve the ASR performance, we developed a speech enhancement system to clean up the recordings. To accomplish this, we utilized the SepFormer model, which has demonstrated competitive performance in speech sepa- <a href="#bookmark34">ration and enhancement tasks [32]</a>. Specifically, we trained the model on the DNS4 dataset, achieving SIG, BAK, and OVRL scores of 2.999, 3.076, and 2.437, respectively. Figure <a href="#bookmark33">2</a> shows the log-power spectrogram for two types of noisy audio recordings, emergency vehicle siren and chopper noise, both with an SNR of -5 dB, using the SepFormer model fine-tuned on the RescueSpeech noisy dataset. From a qual- itative standpoint, it appears that SepFormer performs well on noises that impact the SAR domain. Figure <a href="#bookmark35">3</a> presents PESQ vs SNR and SI-SNRi, SDRi vs SNR for the same noise types. We observed that improvements in SI-SNR and SDR were greater for utterances with an SNR of -5 dB, indicating a more significant enhancement in speech intelligibility and reduction of distortion than for higher SNR utterances. This pattern is consistent across all noise types.Table <a href="#bookmark31">3</a>displays the speech enhancement results obtained by incorporating a speech recognizer into the pipeline. In section 3.3, we explored two approaches: independent train- ing (Model Comb. I) and joint training (Model Comb. II).<a id="bookmark35"></a>Emergency-vehicle-and-siren-noise<table><tr><td><img src="/media/202408//1724838578.6261.png" />3.25PESQ3.002.752.502.252.001.75- 5 0 5 10 15SNR (dB)</td><td><table><tr><td><img src="/media/202408//1724838578.6326358.png" /></td><td></td><td><img src="/media/202408//1724838578.655759.png" /></td><td>SDRiSI-SNRi</td></tr><tr><td><img src="/media/202408//1724838578.660867.png" /></td><td><img src="/media/202408//1724838578.663976.png" /></td><td></td><td></td></tr><tr><td></td><td><img src="/media/202408//1724838578.6755662.png" /><img src="/media/202408//1724838578.691102.png" /></td><td><img src="/media/202408//1724838578.698816.png" /><img src="/media/202408//1724838578.704309.png" /></td><td></td></tr><tr><td></td><td></td><td><img src="/media/202408//1724838578.710081.png" /></td><td><img src="/media/202408//1724838578.7126641.png" /></td></tr><tr><td></td><td></td><td></td><td><img src="/media/202408//1724838578.7180061.png" /><img src="/media/202408//1724838578.722502.png" /></td></tr></table><img src="/media/202408//1724838578.729314.png" /><img src="/media/202408//1724838578.734475.png" /><img src="/media/202408//1724838578.7437062.png" />- 5 0 5 10 1510864SNR (dB)</td></tr></table>Chopper-noise<table><tr><td><img src="/media/202408//1724838578.7481332.png" />3.0PESQ2.82.62.42.22.01.81.61.4- 5 0 5 10 15SNR (dB)</td><td><table><tr><td><img src="/media/202408//1724838578.752629.png" /></td><td></td><td></td><td><img src="/media/202408//1724838578.7556381.png" /></td><td>SDRi</td></tr><tr><td rowspan="2"><img src="/media/202408//1724838578.763392.png" /><img src="/media/202408//1724838578.786079.png" /></td><td rowspan="2"></td><td></td><td><img src="/media/202408//1724838578.7927208.png" /></td><td rowspan="2">SI-SNRi<img src="/media/202408//1724838578.829314.png" /></td></tr><tr><td colspan="2"></td></tr><tr><td><img src="/media/202408//1724838578.8505971.png" /><img src="/media/202408//1724838578.895676.png" /></td><td><img src="/media/202408//1724838578.927687.png" /></td><td colspan="2"></td><td></td></tr><tr><td><img src="/media/202408//1724838578.957057.png" /></td><td><img src="/media/202408//1724838579.1092072.png" /></td><td colspan="2"><img src="/media/202408//1724838579.148324.png" /><img src="/media/202408//1724838579.178358.png" /></td><td></td></tr><tr><td></td><td></td><td colspan="2"><img src="/media/202408//1724838579.229506.png" /><img src="/media/202408//1724838579.2730849.png" /></td><td><img src="/media/202408//1724838579.419798.png" /></td></tr><tr><td></td><td></td><td colspan="2"></td><td><img src="/media/202408//1724838579.451221.png" /></td></tr></table><img src="/media/202408//1724838579.501837.png" />- 5 0 5 10 151210864SNR (dB)</td></tr></table>Fig. 3: PESQ, SDRi, SI-SNRi vs SNR of SepFormer en- hanced utterances for two noise types– emergency vehicle siren and chopper noise.The joint training approach resulted in improvements across all considered speech enhancement metrics (SI-SNRi, SDRi, PESQ, STOI) and all ASR modules (CRDNN, Wav2vec2, WavLM, Whisper). Table <a href="#bookmark32">4</a>presents the final speech recogni- tion output at the end of the pipeline.As anticipated, the joint training approach outperformed a simple combination of independently trained speech en- hancement and speech recognition modules. It is important to note that both speech enhancement and speech recognition models undergo fine-tuning using enhanced signals from the unfrozen Sepformer. We postulate that backpropagating the ASR gradient to the speech enhancement model enables the SepFormer to denoise utterances according to the specific requirements of the ASR model, facilitating better conver- gence. Training both models jointly allows the enhancement model to adapt its cleaning capabilities to align better with the needs of the ASR system. Overall, the best-performing model is the combination of SepFormer with Whisper ASR, which achieved a WER of 45.29%.5. CONCLUSIONSOur work addresses some major challenges that arise in the SAR domain: the lack of speech data, the need for robust- ness to SAR noises, and conversational speech. To over- come these challenges, we have introduced RescueSpeech, a new dataset of speech data in German that we use to per- form robust speech recognition in a hostile noise-filled en-vironment. To achieve this, we proposed multiple training strategies that involve fine-tuning pretrained models on our in-domain data. We tested different self-supervised models (e.g, Wav2Vec2, WavLM, and Whisper) for speech recogni- tion. Despite leveraging these cutting-edge systems, our best model only achieves a WER of 45.29% on our test set. This result highlights the significant difficulty and the urgent need <a id="bookmark6"></a>for further research in this crucial domain.Overall, our work represents a step forward in address- ing the challenges of speech recognition in the SAR domain. By introducing a new dataset, we hope to establish a useful<a id="bookmark8"></a>benchmark and foster more studies in this field.6. ACKNOWLEDGEMENTSOur work was supported under the project “A-DRZ: Setting up the German Rescue Robotics Center” and funded by the German Ministry of Education and Research (BMBF), grant No. I3N14856. We would like to thank our colleague from A-DRZ project- Alina Leippert for transcribing the dataset.7. REFERENCES[1] Christian Willms, Constantin Houy, Jana-Rebecca<a id="bookmark2"></a>Rehse, Peter Fettke, and Ivana Kruijff-Korbayov<img src="/media/202408//1724838579.5790012.png" />,“Team Communication Processing and Process Ana- lytics for Supporting Robot-Assisted Emergency Re- <a id="bookmark11"></a>sponse,” in 2019 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), 2019.[2] Aylin G<img src="/media/202408//1724838579.6341162.png" />zalan, Ole John, Thomas Lbcke, AndreasMaier, Maximilian Reimann, Jan-Gerrit Richter, and Ivan Zverev, “Assisting Maritime Search and Rescue (SAR) Personnel with AI-Based Speech Recognition and Smart Direction Finding,” Journal of Marine Sci- <a id="bookmark36"></a>ence and Engineering, vol. 8, no. 10, 2020.[3] Saeid Mokaram and Roger K. Moore, “The Sheffield Search and Rescue corpus,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5840–5844.[4] Abdelrahman Mohamed, Hung yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaloe, Tara N. Sainath, and Shinji Watanabe, “Self-SupervisedSpeech Representation Learning: A Review,” IEEE <a id="bookmark37"></a><a id="bookmark4"></a>Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, oct 2022.[5] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations,” 2020.[6] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, NaoyukiKanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, oct 2022.[7] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Ro- bust Speech Recognition via Large-Scale Weak Super- vision,” 2022.[8] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” <a id="bookmark9"></a>in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017,pp. 776– 780.[9] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.[10] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The Third CHiME Speech Separation and Recognition Challenge,” Comput. Speech Lang., vol. 46, no. C, pp. 605–626, nov 2017.[11] E. Vincent, S. Watanabe, A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recog- <a id="bookmark13"></a>nition,” Computer Speech and Language, vol. 46, pp. 535–557, 2017.[12] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal, “The fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines,” in <a id="bookmark3"></a>Proc. of Interspeech, 2018.[13] Shinji Watanabe et al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020.[14] Mirco Ravanelli, Luca Cristoforetti, Roberto Gretter, Marco Pellin, Alessandro Sosi, and Maurizio Omologo, “The DIRHA-ENGLISH corpus and related tasks for <a id="bookmark38"></a>distant-speech recognition in domestic environments,” in Proc. of ASRU, 2015.[15] Marco Matassoni, Ram<img src="/media/202408//1724838579.7425869.png" />n Fernandez Astudillo,Athana- sios Katsamanis, and Mirco Ravanelli, “The DIRHA- <a id="bookmark18"></a>GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones,” in Proc. of Interspeech, 2014.[16] Mirco Ravanelli and Maurizio Omologo, “On the selec- tion of the impulse responses for distant-speech recog- <a id="bookmark19"></a>nition based on contaminated speech training,” in Proc. <a id="bookmark39"></a>of Interspeech, Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, Eds., 2014.[17] Mirco Ravanelli and Maurizio Omologo, “Contami- nated speech training methods for robust DNN-HMM <a id="bookmark12"></a><a id="bookmark22"></a><a id="bookmark21"></a>distant speech recognition,” in Proc. of Interspeech, 2015.[18] Steve Renals, Thomas Hain, and Herve Bourlard, “Recognition and interpretation of meetings: The AMI and AMIDA projects,” in Proc. of ASRU, 2007.[19] Colleen Richey, Maria A. Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Mahesh Kumar Nandwana, Allen Stauf- fer, Julien van Hout, Paul Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “Voices Obscured in Complex Environmental Settings (VOICES) corpus,” 2018.[20] Alex Stupakov, Evan Hanusa, Deepak Vijaywargi, Di- eter Fox, and Jeff A. Bilmes, “The design and collec- <a id="bookmark27"></a>tion of COSINE, a multi-microphone in situ speech cor- <a id="bookmark28"></a>pus recorded in noisy environments,” Comput. Speech <a id="bookmark14"></a><a id="bookmark34"></a>Lang., vol. 26, no. 1, pp. 52–66, 2012.[21] Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yosh- ioka, Hannes Gamper, and Robert Aichner, “ICASSP2022 Deep Noise Suppression Challenge,” 2022.International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2015, pp. 4580–4584.[25] Yusheng Xiang, Tian Tang, Tianqing Su, Christine Brach, Libo Liu, Samuel S. Mao, and Marcus Geimer, “Fast CRDNN: Towards on Site Training of Mobile Construction Machines,” IEEE Access, vol. 9, pp. 124253–124267, 2021.[26] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor We- ber, “Common Voice: A Massively-Multilingual Speech Corpus,” 2019.[27] Benjamin Milde and Arne Koehn, “Open Source Auto- maticSpeech Recognition for German,” in Speech Com- munication; 13th ITG-Symposium, 2018, pp. 1–5.[28] Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Lan- <a id="bookmark26"></a>guages,” in Proceedings of the Eighth International Conference on Language Resources and Evaluation(LREC’12), Istanbul, Turkey, May 2012, pp. 759–765, European Language Resources Association (ELRA).[29] Matthew D. Zeiler, “ADADELTA: An Adaptive Learn- ing Rate Method,” 2012.[30] Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” 2014.[31] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong, “Attention is All You Need in Speech Separation,” 2020.[32] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Fran- cois Grondin, and Mirko Bronzi, “On Using Transform- ers for Speech-Separation,” 2022.[22] Christophe Veaux, Junichi Yamagishi, and Simon King, “The voice bank corpus: Design, collection <a id="bookmark15"></a>and data analysis of a large regional accent speech <a id="bookmark40"></a>database,” in 2013 International Conference Orien- tal COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O- COCOSDA/CASLRE), 2013, pp. 1–4.[23] Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux, “WHAM!: Ex- tending Speech Separation to Noisy Environments,” in <a id="bookmark17"></a>Proc. Interspeech, Sept. 2019.[24] Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Has¸im Sak, “Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks,” in 2015 IEEE