伤员转运后送
01-从角色2向角色3医疗设施航空医疗后送期间的战斗伤亡管理
03-Collective aeromedical evacuations of SARS-CoV-2-related ARDS patients in a military tactical plane- a retrospective descriptive study
04-乌克兰火车医疗后送的特点,2022
02-Decision Support System Proposal for Medical Evacuations in Military Operations
02-军事行动中医疗后送的决策支持系统建议
05-无人驾驶飞机系统的伤员疏散需要做什么
04-Characteristics of Medical Evacuation by Train in Ukraine, 2022.
05-Unmanned Aircraft Systems for Casualty Evacuation What Needs to be Done
07-一个德语语料库,用于搜索和救援领域的语音识别
08-雷达人类呼吸数据集的应用环境辅助生活和搜索和救援行动
08-Radar human breathing dataset for applications of ambient assisted living and search and rescue operations
06-基于信息融合的海上搜索救援目标定位
07-RESCUESPEECH- A GERMAN CORPUS FOR SPEECH RECOGNITION IN SEARCH AND RESCUE DOMAIN
12-欧盟和世卫组织联手进一步加强乌克兰的医疗后送行动
09-战场伏击场景下无人潜航器最优搜索路径规划
11-麦斯卡尔医疗后送-康涅狄格州陆军警卫医务人员在大规模伤亡训练中证明了他们的能力
06-Target localization using information fusion in WSNs-based Marine search and rescue
13- 年乌克兰火车医疗后送的特点
09-Optimal search path planning of UUV in battlefeld ambush scene
10-志愿医护人员从乌克兰前线疏散受伤士兵
14-海上搜救资源配置的多目标优化方法——在南海的应用
14-A Multi-Objective Optimization Method for Maritime Search and Rescue Resource Allocation An Application to the South China Sea
15-基于YOLOv5和分层人权优先的高效无人机搜索路径规划方法
17-乌克兰医疗保健专业人员在火药行动期间的经验对增加和加强培训伙伴关系的影响
17-Ukrainian Healthcare Professionals Experiences During Operation Gunpowder Implications for Increasing and Enhancing Training Partnerships
15-An Integrated YOLOv5 and Hierarchical Human Weight-First Path Planning Approach for Efficient UAV Searching Systems
16-基于旋转变压器的YOLOv5s海上遇险目标检测方法
16-YOLOv5s maritime distress target detection method based on swin transformer
19-人工智能的使用在伤员撤离、诊断和治疗阶段在乌克兰战争中
19-THE USE OF ARTIFICIAL INTELLIGENCE AT THE STAGES OF EVACUATION, DIAGNOSIS AND TREATMENT OF WOUNDED SOLDIERS IN THE WAR IN UKRAINE
18-军事行动中医疗后送的决策支持系统建议
20-乌克兰医疗保健专业人员在火药行动中的经验对增加和加强培训伙伴关系的影响
20-Ukrainian Healthcare Professionals Experiences During Operation Gunpowder Implications for Increasing and Enhancing Training Partnerships
21-大国冲突中医疗后送的人工智能
18-Decision Support System Proposal for Medical Evacuations in Military Operations
23-伤亡运输和 疏散
24-某军用伤员疏散系统仿真分析
23-CASUALTY TRANSPORT AND EVACUATION
24-Simulation Analysis of a Military Casualty Evacuation System
25-无人驾驶飞机系统的伤员疏散需要做什么
26-Aeromedical Evacuation, the Expeditionary Medicine Learning Curve, and the Peacetime Effect.
26-航空医疗后送,远征医学学习曲线,和平时期的影响
25-Unmanned Aircraft Systems for Casualty Evacuation What Needs to be Done
28-军用战术飞机上sars - cov -2相关ARDS患者的集体航空医疗后送——一项回顾性描述性研究
27-乌克兰火车医疗后送的特点,2022
27-Characteristics of Medical Evacuation by Train in Ukraine, 2022.
28-Collective aeromedical evacuations of SARS-CoV-2-related ARDS patients in a military tactical plane- a retrospective descriptive study
03-军用战术飞机上sars - cov -2相关ARDS患者的集体航空医疗后送——一项回顾性描述性研究
30-评估局部现成疗法以减少撤离战场受伤战士的需要
31-紧急情况下重伤人员的医疗后送——俄罗斯EMERCOM的经验和发展方向
31-Medical Evacuation of Seriously Injured in Emergency Situations- Experience of EMERCOM of Russia and Directions of Development
30-Evaluation of Topical Off-the-Shelf Therapies to Reduce the Need to Evacuate Battlefield-Injured Warfighters
29-军事行动中医疗后送的决策支持系统建议
29-Decision Support System Proposal for Medical Evacuations in Military Operations
32-决策支持在搜救中的应用——系统文献综述
32-The Syrian civil war- Timeline and statistics
35-印尼国民军准备派飞机接运 1
33-eAppendix 1. Information leaflet basic medical evacuation train MSF – Version April 2022
36-战场上的医疗兵
34-Characteristics of Medical Evacuation by Train in Ukraine
22-空军加速变革以挽救生命:20年来航空医疗后送任务如何取得进展
34-2022年乌克兰火车医疗疏散的特点
33-信息传单基本医疗后送车
40-航空医疗后送
43-美军的黄金一小时能持续多久
42-陆军联手直升机、船只和人工智能进行伤员后送
47-受伤的士兵撤离
46-伤员后送的历史从马车到直升机
37-从死亡到生命之路
41-后送医院
52-印度军队伤员航空医疗后送经验
53-“地狱之旅”:受伤的乌克兰士兵撤离
45-伤病士兵的撤离链
54-热情的和资源匮乏的士兵只能靠自己
57-2022 年乌克兰火车医疗后送
51-医务人员在激烈的战斗中撤离受伤的乌克兰士兵
59-乌克兰展示医疗后送列车
61-俄罗斯士兵在乌克兰部署自制UGV进行医疗后送
60-“流动重症监护室”:与乌克兰顿巴斯战斗医务人员共24小时
50-医疗后送——保证伤员生命安全
阿拉斯加空军国民警卫队医疗后送受伤陆军伞兵
航空撤离,印度经验 抽象的
通过随机森林模拟规划方法解决军事医疗后送问题
2022 年乌克兰火车医疗后送的特点
战术战地救护教员指南 3E 伤员后送准备和要点 INSTRUCTOR GUIDE FOR TACTICAL FIELD CARE 3E PREAPRING FOR CASUALTY EVACUTION AND KEY POINTS
军事医疗疏散
北极和极端寒冷环境中的伤亡疏散:战术战斗伤亡护理中创伤性低温管理的范式转变
-外地伤员后送现场伤亡疏散
伤员后送图片
从角色2到角色3医疗设施期间战斗人员伤亡管理
关于军事行动中医疗疏散的决策支持系统建议书
在军事战术平面上对sars-cov-2相关 ARDS患者进行的集体空中医疗后送: 回顾性描述性研究
2022年乌克兰火车医疗疏散的特点
透过战争形势演变看外军营救后送阶梯 及医疗救护保障措施
东部伤兵营 英文 _Wounded_Warrior_Battalion_East
组织紧急医疗咨询和医疗后送 2015 俄文
-
+
首页
07-RESCUESPEECH- A GERMAN CORPUS FOR SPEECH RECOGNITION IN SEARCH AND RESCUE DOMAIN
<p><strong>RESCUESPEECH: A GERMAN CORPUS FOR SPEECH RECOGNITION IN SEARCH AND RESCUE DOMAIN</strong></p><p><em>Sangeet Sagar</em>1,4<em>, MircoRavanelli</em>2<em>, Bernd Kiefer</em>1,4<em>, Ivana Kruijff-Korbayov</em>4<em>, Josef van Genabith</em>1,4</p><p>1 Saarland University, Germany</p><p>2 Concordia University, Mila-Quebec AI Institute, Canada</p><p>4 German Research Center for Artificial Intelligence (DFKI), Germany sangeetsagar2020@gmail .com, ravanellim@mila .quebec,</p><p>arXiv:2306.04054v3 [ees s.AS] 25 Sep 2023</p><p>{bernd .kiefer,josef .van genabith}@dfki .de, ivana .kruijff@rettungsrobotik .de</p><p><strong>ABSTRACT</strong></p><p>Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic en- vironments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.</p><p>To address this issue, we have created and made publicly available a German speech dataset called <em>RescueSpeech</em>. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive train- ing recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.</p><p><strong><em>Index Terms</em>— </strong>speech recognition, search and rescue,</p><p>noise robustness.</p><p><strong>1. INTRODUCTION</strong></p><p>Automatic speech recognition (ASR) can be crucial in situa- tions like search and rescue (SAR) missions. These scenar- ios often involve making critical decisions in extremely hos- tile conditions, such as underground rescue operations, nu- clear accidents, fire evacuation, or collapsed building after an earthquake. In such cases, rescue workers must act quickly and accurately to prevent the loss of lives and damage. Tran- scribing and automatically analyzing the conversations within the rescue team can provide useful support to help the team make the right decisions in a limited amount of time. The context of search and rescue missions poses significant chal- lenges for current speech recognition technologies. Speech <a id="bookmark1"></a>recognizers must be able to handle conversational speech that is fast, emotional, and spoken under stressful conditions. Ad-</p><p>ditionally, the acoustic environment in which rescuers oper- ate is often extremely noisy, and recordings may be corrupted by various non-stationary noises, such as engine noise, ve- hicle sirens,radio chatter, helicopter noise, and other unpre- dictable disturbances. In recent years, there has been a sig- nificant amount of research focused on addressing these chal- <a href="#bookmark2">lenges [1–3]</a>. Advanced deep learning techniques, such as <a href="#bookmark3">self-supervised learning coupled with large datasets [4], have</a> been instrumental in achieving impressive performance im- provements. One of the most intriguing aspects of the SAR domain is that all of the aforementioned challenges occur si- multaneously, creating an incredibly difficult and complex task. This not only makes it an area of significant scientific interest but also underscores the urgent need for continued re- search and development in this field.</p><p>Developing a speech recognition system in this context is made even more challenging due to the limited availabil- ity of data in this critical domain. Collecting speech data re- lated specifically to the SAR domain can be difficult, and pri- vacy restrictions can often limit access to such data by the scientific community. To encourage research in this field, <a href="#bookmark1">we have released RescueSpeech1 ,</a> a German dataset for the Search and Rescue Domain Speech. This dataset contains au- thentic speech recordings between members of a rescue team during several rescue exercises. To the best of our knowledge, we are the first to publicly release an audio dataset in the SAR domain. RescueSpeech contains approximately 2 hours of an- notated speech material. Although this amount may seem limited, it is actually quite valuable and can be effectively used to fine-tune large pretrained models such as wav2vec2.0 <a href="#bookmark4">[5], WavLM [6], and Whisper [7]</a>. In fact, we demonstrate that this material is also suitable for training models from scratch when combined with proper data augmentation tech- niques and multi-condition training.</p><p>This paper presents a comprehensive collection of exper- imental evidence for the task at hand– noise-robust German speech recognition. It employs state-of-the-art methods for</p><p>1Available at: <a href="https://zenodo.org/record/8077622">https://zenodo.org/record/8077622</a></p><p><a id="bookmark5"></a>both speech recognition and speech enhancement, as well as a combination of the two. Despite excelling in simpler sce- narios, our results show that even modern ASR systems like <a href="#bookmark6">Whisper [7], struggle to perform well in the demanding res</a>- cue and search domain. We have made our training recipes and pretrained models available to the community within the SpeechBrain toolkit <a href="#bookmark7">2</a> . With the release of the RescueSpeech dataset we hope to foster research in this field and establish a common benchmark. We believe that our effort can help raise awareness about the importance of the use of speech technol- ogy in SAR missions, and the need for continued research in this domain.</p><p><strong>2. THE RESCUESPEECH DATASET</strong></p><p>RescueSpeech contains a blend of microphone and radio- recorded speech that includes excerpts from communication among robot-assisted emergency response team members during several simulated SAR exercises, which involves real firefighters speaking in high-stress situations like fire res- cue, explosion etc. that can elicit heightened emotions. The speakers involved in the exercises are native speakers of the German language where conversations were carried out be- tween team members, radio operators and the team leader. These dialogues loosely adopt a typical radio style communi- cation wherein the start/end of a conversation is indicated by the use of certain words, connection quality is relayed, and acceptance or rejection of requests are conveyed. The practi- cal use case of our dataset is limited not only for robot control but also for speech recognition, with its main application being the support of decision-makers and process monitors in disaster situations. The ASR output is analyzed by a nat- ural language understanding (NLU) component and fused with sensor data, including GPS coordinates from robots or drones. This way we extract mission-related information from conversations and use it to offer assistance later in the deployment of the full system</p><p>Initially captured at 44.1 kHz sampling rate, these record- ings are down-sampled to 16 kHz, and further segmented to obtain a set of mono-speaker single-channel audio record- ings. All utterances are also manually transcribed. The to- tal length of the dataset is 1.6h with a total of 2412 sen- tences with 1591/245/576 sentences in train/valid/test set. We call it the RescueSpeech clean dataset. Figure <a href="#bookmark5">1</a> shows a histogram plot of the average length of the segmented utter- ances with an average length of 2.39 sec. We also created a noisy version of RescueSpeech by contaminating our dataset <a href="#bookmark8">with noisy clips from the AudioSet dataset [8] that includes</a> five noise types– <em>emergency vehicle siren</em>, <em>breathing</em>, <em>engine</em>, <em>chopper</em>, and <em>static radio noise</em>. We utilized both real and syn- <a href="#bookmark9">thetic room-impulse responses (RIR) (SLR26, SLR27 [9]) to</a> <a id="bookmark7"></a>add reverberation as well. We then added noisy sequences</p><p>2 Available at: <a href="https://github.com/speechbrain/speechbrain/tree/develop/recipes/RescueSpeech">https://github.com/speechbrain/</a> <a href="https://github.com/speechbrain/speechbrain/tree/develop/recipes/RescueSpeech">speechbrain/tree/develop/recipes/RescueSpeech</a></p><table><tr><td><p>Number of Files</p></td><td><p>300</p><p>250</p><p>200</p><p>150</p><p>100</p><p>50</p><p>0</p></td><td><table><tr><td rowspan="2"></td><td rowspan="2"></td><td colspan="2" rowspan="2"></td><td rowspan="2"></td><td></td><td></td><td></td><td></td><td></td><td></td><td rowspan="2"></td></tr><tr><td></td><td></td><td><p>Le</p></td><td><p>2</p></td><td><p>.39 sec</p></td><td></td></tr><tr><td rowspan="2"><p><img src="/media/202408//1724838576.861574.png" /></p></td><td rowspan="2"></td><td colspan="2" rowspan="2"></td><td></td><td></td><td></td><td><p>age</p></td><td><p>ngth:</p></td><td></td><td></td><td></td></tr><tr><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838576.867879.png" /></p></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838576.882774.png" /></p></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838576.890804.png" /></p></td><td></td><td></td><td></td><td colspan="2"></td><td></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838576.89511.png" /></p></td><td><p><img src="/media/202408//1724838576.902909.png" /></p></td><td></td><td></td><td colspan="2"><p><img src="/media/202408//1724838576.906571.png" /></p></td><td><p><img src="/media/202408//1724838576.930401.png" /></p></td><td></td><td></td><td></td><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838576.934228.png" /></p></td><td colspan="11"><p><img src="/media/202408//1724838576.942085.png" /><img src="/media/202408//1724838576.956397.png" /><img src="/media/202408//1724838576.988514.png" /><img src="/media/202408//1724838576.993341.png" /><img src="/media/202408//1724838577.004046.png" /><img src="/media/202408//1724838577.0220642.png" /></p></td></tr></table><p><img src="/media/202408//1724838577.180796.png" /><img src="/media/202408//1724838577.208525.png" /><img src="/media/202408//1724838577.282939.png" />0.0 2.5 5.0 7.5 10.0 12.5 15.0</p><p>Aver</p><p>Length (s)</p></td></tr></table><p><strong>Fig. 1</strong>: Histogram plot illustrating the average length of utter- ances in RescueSpeech in secs.</p><p>to generate noisy utterances with different signal-to-noise ra- tios (SNR) (from -5 dB to 15 dB with a step of 1 dB). Each clean utterance is randomly corrupted with one of the noise types to generate 4500/1350/1350 train/valid/test utterances. We also ensure that a noise utterance used in the train set is only in this set. This randomness and exclusivity ensure that each split has an equal proportion for each noise type and that noises in each of the splits are different. This dataset provides a diverse set of noise and reverberation conditions that enable fine-tuning of our speech-enhancement model for improved accuracy on noisy RescueSpeech. We call this the Rescue- Speech noisy dataset. Table <a href="#bookmark10">1</a> briefly shows the distribution of utterances and duration for the clean and noisy version of the dataset.</p><p><strong>2.1. Related Corpora</strong></p><p>To improve the accuracy of speech recognition systems in noisy and reverberant environments, several corpora have <a href="#bookmark11">been developed, such as CHIME [10–13], DIRHA [14–17],</a> <a href="#bookmark12">AMI [18], VOiCES [19], and COSINE [20]</a>. Among these, <a href="#bookmark13">CHIME5 [12] and CHIME6 [13] are especially challenging</a> because it contains conversational speech recorded during a dinner party in a domestic setting, where noise and reverber- ations are common. RescueSpeech also contains conversa- tional speech recorded in challenging acoustic environments, but the scenario addressed in this corpus is unique and differ- ent from a dinner party. The acoustic conditions, emotions, and lexicon used in RescueSpeech are distinct, and thus pro- vide an additional set of challenges for speech recognition systems.</p><p>The noisy version of RescueSpeech can be utilized to train speech enhancement systems that are robust in the acous- tic conditions present in the Search and Rescue (SAR) do- main. There are numerous datasets that have been released for speech enhancement purposes, including the deep-noise <a href="#bookmark14">suppression (DNS) dataset [21], VoiceBank-DEMAND cor-</a></p><p><strong>Table 1</strong>: Distribution of utterances and hours in the Rescue- <a id="bookmark10"></a>Speech clean and noisy dataset.</p><table><tr><td colspan="2"><p><strong>Clean Noisy</strong></p><p>Mins #Utts. <img src="/media/202408//1724838577.315941.png" /> HRS #Utts.</p></td></tr><tr><td><p>Train 61.86 1591 Valid 9.61 245 Test 24.68 576</p></td><td><p>7.20 4500 2.16 1350 2.16 1350</p></td></tr></table><p><a href="#bookmark15">pus [22], and WHAM! and WHAMR! corpora [23], all of</a> which are helpful for training speech enhancement models. However, the key difference with RescueSpeech is that it has been specifically designed for the SAR domain, where char- acteristic sounds such as sirens, radio signals, helicopters, trucks, and others affect the recordings. This unique char- acteristic of RescueSpeech makes it an especially valuable resource for training speech enhancement systems that can</p><p>perform well in SAR environments.</p><p><strong>3. EXERIMENTAL SETUP</strong></p><p>We explored multiple training strategies to perform noise ro- bust speech recognition. Speech recognizers and enhance- ment models are trained on large corpora and then fine-tuned and evaluated on RescueSpeech data.</p><p><a id="bookmark16"></a><strong>3.1. ASR training</strong></p><p>We follow two approaches for ASR training: one based on sequence-to-sequence modeling (seq2seq) and another one based on the connectionist temporal classification (CTC) method. For the seq2seq model, we employ a CRDNN (convolutional, recurrent, and dense-neural network) archi- <a href="#bookmark17">tecture [24,</a><a href="#bookmark18">25]</a>. The CRDNN encoder is trained on the <a href="#bookmark19">full 1200h of the German CommonVoice corpus [26]</a>. De- coding uses an attentional-GRU decoder and a beam search coupled with an RNN-based language model (LM). The LM is trained on Tuda-<a href="#bookmark20">De2</a> <a href="#bookmark21">[27] (8M sents), Leipzig news</a> <a href="#bookmark22">corpus [28] (9M sents), and train transcripts of the Com</a>- monVoice corpus. For the CTC based models, we use wav2vec2.0, and WavLM architecture as encoders for the ASR pipeline. These encoders use self-supervised approach for learning high-level contextualized speech representation. It needs no language model and decoding is performed using greedy search. For wav2vec2.0 and WavLM we use pre- trained encoders facebook/wav2vec2-large-xlsr-53-<a href="#bookmark23">german3</a> and microsoft/wavlm-<a href="#bookmark24">large4</a> respectively. Additionally, we <a href="#bookmark6">also employ the pre-trained Whisper [7] model</a> openai/whisper</p><p>2<a href="https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic">https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic</a>- <a id="bookmark20"></a><a id="bookmark23"></a>models.html</p><p><a id="bookmark24"></a>3<a href="https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german">https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german</a> 4<a href="https://huggingface.co/microsoft/wavlm-large">https://huggingface.co/microsoft/wavlm-large</a></p><p>-large-<a href="#bookmark25">v25</a> to benchmark our systems against competitive state-of-the-art model.</p><p>CRDNN combines two blocks of CNN (each block with 2 CNN layers with a channel size (128, 256)), an RNN block (4 bidirectional LSTM layers with 1024 neurons in each layer), and a dense-neural network layer. The inputs are 40- dimensional mel-fiterbank features and the network is trained <a href="#bookmark26">with an AdaDelta [29] optimizer with a learning rate (LR) of</a> 1 (during fine-tuning we use LR 0.1). The model is trained for 25 epochs with a batch size of 8. During testing, beam search is used with a beam size of 80. Each epoch takes approximately 8h on a single RTXA6000 GPU with 48GB of memory. For wav2vec2.0 and WavLM CTC, training is per- formed for 45 and 20 epochs respectively with LR 1e-4 on a <a href="#bookmark27">batch size 8 using an Adam [30] optimizer.</a> Each epoch takes approximately 5.5h on a single RTXA6000 GPU with 48GB of memory. LR is annealed and the sampling frequency is set to 16 kHz for both approaches. More details on training and model parameters can be found in the repository.</p><p><strong>3.2. Speech enhancement training</strong></p><p>In this work, we perform speech enhancement using Sep- <a href="#bookmark28">Former [31]– a multi-head attention transformer-based source</a> separation architecture. It uses a fully learnable masking- based architecture composed of an encoder, a masking net- work, and a decoder. The encoder and decoder blocks are essentially convolutional layers and we learn a deep-masking network based on self-attention which estimates element-wise masks. These masks are used by the decoder to reconstruct the enhanced signal in the time-domain. We use the DNS4 <a href="#bookmark29">7</a> dataset to synthesize the training and evaluation set. Us- ing provided clean utterances, noisy clips (150 noises types), and RIRs, we generate 1300h of train and 6.7h of valid set at varying SNR (from -5 dB to 15 dB with a step of 1 dB),and a DNS-2022 baseline dev set is used as test set. Sampling rate is set to 16 kHz and only 30% of clean speech is convolved with RIR.</p><p>SepFormer employs an encoder and decoder with 256 convolution filters with kernel size 16, each with stride 8. The masking network has 2 layers of dual-composition block and a chunk length of 250. With each clean-noisy pairs fixed at 4s in length, the model is trained in a supervised fashion using scale-invariant SNR (SI-SNR) loss and Adam optimizer with LR of 1.5e-4. We utilize multi-GPU distributed data parallel (DDP) training scheme to train the network for 50 epochs with a batch size of 4. Each epoch takes approximately 9h on 8 × RTXA6000 GPU.</p><p><a id="bookmark25"></a><a id="bookmark29"></a>5<a href="https://huggingface.co/openai/whisper-large-v2">https://huggingface.co/openai/whisper-large-v2</a> 7<a href="https://github.com/microsoft/DNS-Challenge">https://github.com/microsoft/DNS-Challenge</a></p><p><a id="bookmark30"></a><strong>3.3. Training strategies</strong></p><p>We use various training methods to create a robust speech recognition system that operates in the SAR (Search and Res- cue) domain. These methods are described below:</p><p>1. <em>Clean training</em>: After pretraining the ASR and Lan- guage Model (LM) models, we fine-tune them on the RescueSpeech clean dataset. This process helps to adapt the models to our target domain. We keep the model and training parameters the same as described in Section <a href="#bookmark16">3.1.</a></p><p>2. <em>Multi-condition training</em>: Using the same pretrained model as above, we perform multi-condition training, which involves training the ASR model on an equal mix of clean and noisy audio from the RescueSpeech noisy dataset. By doing this, the model can learn to adapt to different noises present in the utterances, which helps it to perform speech recognition. This method forms <a id="bookmark31"></a>the baseline for all our results. We set the learning rate (LR) to 0.1 and keep other parameters the same as above.</p><p>3. <em>Model-combination I: Independent training</em>: We pre- train a speech enhancement model and then fine-tune it on the RescueSpeech noisy dataset. This model is then integrated with the ASR model trained in the <em>clean training </em>stage to perform noise-robust speech recogni- tion. In this stage, we freeze the enhancement model.</p><p>4. <em>Model-combination II: Joint training</em>: This is a con- tinuation of the previous stage, where we follow a joint-training approach. We unfreeze the enhancement <a id="bookmark32"></a>model and allow gradients from the ASR to propagate back to the speech enhancement model. Updating the weights of the model in this way enables it to generate output that is as clean as possible, as required by the ASR model.</p><p><strong>4. RESULTS</strong></p><p><strong>4.1. ASR Performance</strong></p><p>As a first attempt, we created a simple pipeline consisting solely of an ASR model, with no speech enhancement utilized in the front-end. Table <a href="#bookmark30">2</a> provides a comparison of different ASR models used on both clean and noisy audio recordings from the RescueSpeech dataset. The models included in the comparison are CRDNN, wav2vec2.0, WavLM, and Whis- per. During the pre-training stage, all models (except Whis- per) utilized only the CommonVoice dataset. However, dur- ing the clean training and multi-condition fine-tuning stage, the RescueSpeech dataset was used.</p><p>Unsurprisingly, the clean training approach is the most effective when tested on clean audio recordings. The top- performing model in this scenario is Whisper, which achieved</p><p><strong>Table 2</strong>: Comparison of test WERs for CRDNN, wav2vec2.0-large, WavLM-large, and whisper-large-v2 mod- els using different training strategies on clean and noisy speech inputs from the RescueSpeech dataset.</p><p><strong>ASR Model clean </strong><img src="/media/202408//1724838577.491103.png" /><strong> noisy</strong></p><p>52.03 47.92 46.28 27.01</p><p>CRDNN</p><p>Wav2vec2</p><p>WavLM</p><p>Whisper</p><p>81.14 76.98 73.84 50.85</p><p>Pre-training</p><p>31.18 27.69 23.93</p><p>60.10 62.60 58.28 46.70</p><p>CRDNN</p><p>Wav2vec2</p><p>WavLM</p><p>Whisper</p><p>Clean training</p><p><strong>23.14</strong></p><p>33.22 29.89 25.22 24.11</p><p>58.95 57.98 52.75</p><p><strong>45.84</strong></p><p>CRDNN</p><p>Wav2vec2</p><p>WavLM</p><p>Whisper</p><p>Multi-cond. training</p><p><strong>Table 3</strong>: Speech enhancement performance on the Rescue- Speech noisy test inputs when combining speech enhance- ment and speech recognition (Model Comb. I vs Model Comb. II).</p><table><tr><td rowspan="2"></td><td rowspan="2"><p><strong>Model</strong></p><p><strong>Comb. I</strong></p></td><td colspan="4"><p><strong>Model Comb. II</strong></p></td></tr><tr><td><p>CRDNN</p></td><td><p>wav2vec2</p></td><td><p>WavLM</p></td><td><p>Whisper</p></td></tr><tr><td><p>SI-SNRi</p></td><td><p>6.516</p></td><td><p>6.618</p></td><td><p>7.205</p></td><td><p>7.140</p></td><td><p>7.482</p></td></tr><tr><td><p>SDRi</p></td><td><p>7.439</p></td><td><p>7.490</p></td><td><p>7.765</p></td><td><p>7.694</p></td><td><p>8.011</p></td></tr><tr><td><p>PESQ</p></td><td><p>2.008</p></td><td><p>2.010</p></td><td><p>2.060</p></td><td><p>2.064</p></td><td><p>2.083</p></td></tr><tr><td><p>STOI</p></td><td><p>0.842</p></td><td><p>0.844</p></td><td><p>0.854</p></td><td><p>0.854</p></td><td><p>0.859</p></td></tr></table><p><strong>Table 4</strong>: Word-Error-Rate (WER%) achieved with indepen- dent training (Model Comb. I ) and joint training (Model Comb. II) of the speech enhancement and ASR modules.</p><p><strong>ASR Model </strong><img src="/media/202408//1724838577.550963.png" /><strong> Model Comb. I </strong><img src="/media/202408//1724838577.586134.png" /><strong> Model Comb. II</strong></p><table><tr><td><p>CRDNN</p></td><td><p>54.98</p></td><td><p>54.55</p></td></tr><tr><td><p>Wav2vec2</p></td><td><p>50.68</p></td><td><p>49.24</p></td></tr><tr><td><p>WavLM</p></td><td><p>48.24</p></td><td><p>46.04</p></td></tr><tr><td><p>Whisper</p></td><td><p>48.04</p></td><td><p><strong>45.29</strong></p></td></tr></table><p>a WER of 23.14%. On the other hand, multi-condition train- ing proved to be a superior strategy when dealing with noisy recordings. In this scenario, the best model is again Whis- per, which achieved a WER of 45.84%. The performance gap with clean signals, highlights one more time the significant decline in recognition performance when dealing with chal- lenging acoustic conditions, even for models that were pre- trained using state-of-the-art self-supervised techniques like wav2vec, WavLM, and Whisper (the latter of which is even semi-supervised).</p><p>log power (dB) log power (dB)</p><table><tr><td><p>Chopper Emergency-vehicle-and-siren</p></td><td><table><tr><td><p><img src="/media/202408//1724838577.619301.png" /><img src="/media/202408//1724838577.692557.png" />Enhanced</p><p><a id="bookmark33"></a>Clean</p><p>Noisy</p><p>Hz</p><p>Hz</p><p>Hz</p><p>Hz</p><img src="/media/202408//1724838577.777882.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838577.971456.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838577.977632.jpeg" /><table><tr><td></td></tr></table><img src="/media/202408//1724838578.017704.jpeg" /><table><tr><td></td></tr></table><p>0 1.5 3 Time</p><p>0 1.5 3 Time</p><img src="/media/202408//1724838578.0583858.jpeg" /><table><tr><td></td></tr></table><p>2 4</p><p>0</p><p>4</p><p>0</p><p>2</p><p>Time</p><p>Clean</p><p>4</p><p>2</p><p>Time Noisy</p><p>0</p><p>Time</p><p>Enhanced</p><img src="/media/202408//1724838578.068592.jpeg" /><p><img src="/media/202408//1724838578.0710442.png" /></p><p>0 1.5 3 Time</p><img src="/media/202408//1724838578.090662.png" /></td><td><p><img src="/media/202408//1724838578.0978749.png" /></p></td><td><p>+0 dB</p><p>-10 dB</p><p>-20 dB</p><p>-30 dB</p><p>-40 dB</p><p>-50 dB</p><p>-60 dB</p><p>-70 dB</p><p>-80 dB</p><p>+0 dB</p><p>-10 dB</p><p>-20 dB</p><p>-30 dB</p><p>-40 dB</p><p>-50 dB</p><p>-60 dB</p><p>-70 dB</p></td></tr></table><p>Hz Hz</p><p>4096 <img src="/media/202408//1724838578.11496.png" /></p><p>2048 <img src="/media/202408//1724838578.120685.png" /></p><p>1024 <img src="/media/202408//1724838578.139304.png" /></p><p>512 <img src="/media/202408//1724838578.1428869.png" /></p><p>256 <img src="/media/202408//1724838578.152881.png" /></p><p>128 <img src="/media/202408//1724838578.160602.png" /></p><p>64 <img src="/media/202408//1724838578.201225.png" /></p><p>0 <img src="/media/202408//1724838578.223714.png" /></p><p>4096 <img src="/media/202408//1724838578.240621.png" /></p><p>2048 <img src="/media/202408//1724838578.262249.png" /></p><p>1024 <img src="/media/202408//1724838578.290411.png" /></p><p>512 <img src="/media/202408//1724838578.3399942.png" /></p><p>256 <img src="/media/202408//1724838578.536344.png" /></p><p>128 <img src="/media/202408//1724838578.594626.png" /></p><p>64 <img src="/media/202408//1724838578.602633.png" /></p><p>0 <img src="/media/202408//1724838578.6120791.png" /></p></td></tr></table><p><strong>Fig. 2</strong>: Log-power spectrogram of clean, noisy, and SepFormer-enhanced utterances for <em>emergency vehicle siren </em>and <em>chopper </em>noise types at -5 dB SNR.</p><p><strong>4.2. Combining ASR and Speech Enhancement</strong></p><p>In order to improve the ASR performance, we developed a speech enhancement system to clean up the recordings. To accomplish this, we utilized the SepFormer model, which has demonstrated competitive performance in speech sepa- <a href="#bookmark34">ration and enhancement tasks [32]</a>. Specifically, we trained the model on the DNS4 dataset, achieving SIG, BAK, and OVRL scores of 2.999, 3.076, and 2.437, respectively. Figure <a href="#bookmark33">2</a> shows the log-power spectrogram for two types of noisy audio recordings, <em>emergency vehicle siren </em>and <em>chopper noise</em>, both with an SNR of -5 dB, using the SepFormer model fine-tuned on the RescueSpeech noisy dataset. From a qual- itative standpoint, it appears that SepFormer performs well on noises that impact the SAR domain. Figure <a href="#bookmark35">3</a> presents PESQ vs SNR and SI-SNRi, SDRi vs SNR for the same noise types. We observed that improvements in SI-SNR and SDR were greater for utterances with an SNR of -5 dB, indicating a more significant enhancement in speech intelligibility and reduction of distortion than for higher SNR utterances. This pattern is consistent across all noise types.</p><p>Table <a href="#bookmark31">3</a>displays the speech enhancement results obtained by incorporating a speech recognizer into the pipeline. In section 3.3, we explored two approaches: independent train- ing (Model Comb. I) and joint training (Model Comb. II).</p><p><a id="bookmark35"></a>Emergency-vehicle-and-siren-noise</p><table><tr><td><p><img src="/media/202408//1724838578.6261.png" />3.25</p><p>PESQ</p><p>3.00</p><p>2.75</p><p>2.50</p><p>2.25</p><p>2.00</p><p>1.75</p><p>- 5 0 5 10 15</p><p>SNR (dB)</p></td><td><table><tr><td><p><img src="/media/202408//1724838578.6326358.png" /></p></td><td></td><td><p><img src="/media/202408//1724838578.655759.png" /></p></td><td><p>SDRi</p><p>SI-SNRi</p></td></tr><tr><td><p><img src="/media/202408//1724838578.660867.png" /></p></td><td><p><img src="/media/202408//1724838578.663976.png" /></p></td><td></td><td></td></tr><tr><td></td><td><p><img src="/media/202408//1724838578.6755662.png" /><img src="/media/202408//1724838578.691102.png" /></p></td><td><p><img src="/media/202408//1724838578.698816.png" /><img src="/media/202408//1724838578.704309.png" /></p></td><td></td></tr><tr><td></td><td></td><td><p><img src="/media/202408//1724838578.710081.png" /></p></td><td><p><img src="/media/202408//1724838578.7126641.png" /></p></td></tr><tr><td></td><td></td><td></td><td><p><img src="/media/202408//1724838578.7180061.png" /></p><p><img src="/media/202408//1724838578.722502.png" /></p></td></tr></table><p><img src="/media/202408//1724838578.729314.png" /><img src="/media/202408//1724838578.734475.png" /><img src="/media/202408//1724838578.7437062.png" />- 5 0 5 10 15</p><p>10</p><p>8</p><p>6</p><p>4</p><p>SNR (dB)</p></td></tr></table><p>Chopper-noise</p><table><tr><td><p><img src="/media/202408//1724838578.7481332.png" />3.0</p><p>PESQ</p><p>2.8</p><p>2.6</p><p>2.4</p><p>2.2</p><p>2.0</p><p>1.8</p><p>1.6</p><p>1.4</p><p>- 5 0 5 10 15</p><p>SNR (dB)</p></td><td><table><tr><td><p><img src="/media/202408//1724838578.752629.png" /></p></td><td></td><td></td><td><p><img src="/media/202408//1724838578.7556381.png" /></p></td><td><p>SDRi</p></td></tr><tr><td rowspan="2"><p><img src="/media/202408//1724838578.763392.png" /><img src="/media/202408//1724838578.786079.png" /></p></td><td rowspan="2"></td><td></td><td><p><img src="/media/202408//1724838578.7927208.png" /></p></td><td rowspan="2"><p>SI-SNRi</p><p><img src="/media/202408//1724838578.829314.png" /></p></td></tr><tr><td colspan="2"></td></tr><tr><td><p><img src="/media/202408//1724838578.8505971.png" /></p><p><img src="/media/202408//1724838578.895676.png" /></p></td><td><p><img src="/media/202408//1724838578.927687.png" /></p></td><td colspan="2"></td><td></td></tr><tr><td><p><img src="/media/202408//1724838578.957057.png" /></p></td><td><p><img src="/media/202408//1724838579.1092072.png" /></p></td><td colspan="2"><p><img src="/media/202408//1724838579.148324.png" /><img src="/media/202408//1724838579.178358.png" /></p></td><td></td></tr><tr><td></td><td></td><td colspan="2"><p><img src="/media/202408//1724838579.229506.png" /><img src="/media/202408//1724838579.2730849.png" /></p></td><td><p><img src="/media/202408//1724838579.419798.png" /></p></td></tr><tr><td></td><td></td><td colspan="2"></td><td><p><img src="/media/202408//1724838579.451221.png" /></p></td></tr></table><p><img src="/media/202408//1724838579.501837.png" />- 5 0 5 10 15</p><p>12</p><p>10</p><p>8</p><p>6</p><p>4</p><p>SNR (dB)</p></td></tr></table><p><strong>Fig. 3</strong>: PESQ, SDRi, SI-SNRi vs SNR of SepFormer en- hanced utterances for two noise types– <em>emergency vehicle siren </em>and <em>chopper </em>noise.</p><p>The joint training approach resulted in improvements across all considered speech enhancement metrics (SI-SNRi, SDRi, PESQ, STOI) and all ASR modules (CRDNN, Wav2vec2, WavLM, Whisper). Table <a href="#bookmark32">4</a>presents the final speech recogni- tion output at the end of the pipeline.</p><p>As anticipated, the joint training approach outperformed a simple combination of independently trained speech en- hancement and speech recognition modules. It is important to note that both speech enhancement and speech recognition models undergo fine-tuning using enhanced signals from the unfrozen Sepformer. We postulate that backpropagating the ASR gradient to the speech enhancement model enables the SepFormer to denoise utterances according to the specific requirements of the ASR model, facilitating better conver- gence. Training both models jointly allows the enhancement model to adapt its cleaning capabilities to align better with the needs of the ASR system. Overall, the best-performing model is the combination of SepFormer with Whisper ASR, which achieved a WER of 45.29%.</p><p><strong>5. CONCLUSIONS</strong></p><p>Our work addresses some major challenges that arise in the SAR domain: the lack of speech data, the need for robust- ness to SAR noises, and conversational speech. To over- come these challenges, we have introduced RescueSpeech, a new dataset of speech data in German that we use to per- form robust speech recognition in a hostile noise-filled en-</p><p>vironment. To achieve this, we proposed multiple training strategies that involve fine-tuning pretrained models on our in-domain data. We tested different self-supervised models (e.g, Wav2Vec2, WavLM, and Whisper) for speech recogni- tion. Despite leveraging these cutting-edge systems, our best model only achieves a WER of 45.29% on our test set. This result highlights the significant difficulty and the urgent need <a id="bookmark6"></a>for further research in this crucial domain.</p><p>Overall, our work represents a step forward in address- ing the challenges of speech recognition in the SAR domain. By introducing a new dataset, we hope to establish a useful</p><p><a id="bookmark8"></a>benchmark and foster more studies in this field.</p><p><strong>6. ACKNOWLEDGEMENTS</strong></p><p>Our work was supported under the project “A-DRZ: Setting up the German Rescue Robotics Center” and funded by the German Ministry of Education and Research (BMBF), grant No. I3N14856. We would like to thank our colleague from A-DRZ project- Alina Leippert for transcribing the dataset.</p><p><strong>7. REFERENCES</strong></p><p>[1] Christian Willms, Constantin Houy, Jana-Rebecca</p><p><a id="bookmark2"></a>Rehse, Peter Fettke, and Ivana Kruijff-Korbayov<img src="/media/202408//1724838579.5790012.png" />,</p><p>“Team Communication Processing and Process Ana- lytics for Supporting Robot-Assisted Emergency Re- <a id="bookmark11"></a>sponse,” in <em>2019 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR)</em>, 2019.</p><p>[2] Aylin G<img src="/media/202408//1724838579.6341162.png" />zalan, Ole John, Thomas Lbcke, Andreas</p><p>Maier, Maximilian Reimann, Jan-Gerrit Richter, and Ivan Zverev, “Assisting Maritime Search and Rescue (SAR) Personnel with AI-Based Speech Recognition and Smart Direction Finding,” <em>Journal of Marine Sci- </em><a id="bookmark36"></a><em>ence and Engineering</em>, vol. 8, no. 10, 2020.</p><p>[3] Saeid Mokaram and Roger K. Moore, “The Sheffield Search and Rescue corpus,” in <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 2017, pp. 5840–5844.</p><p>[4] Abdelrahman Mohamed, Hung yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaloe, Tara N. Sainath, and Shinji Watanabe, “Self-Supervised</p><p>Speech Representation Learning: A Review,” <em>IEEE </em><a id="bookmark37"></a><a id="bookmark4"></a><em>Journal of Selected Topics in Signal Processing</em>, vol. 16, no. 6, pp. 1179–1210, oct 2022.</p><p>[5] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations,” 2020.</p><p>[6] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki</p><p>Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” <em>IEEE Journal of Selected Topics in Signal Processing</em>, vol. 16, no. 6, pp. 1505–1518, oct 2022.</p><p>[7] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Ro- bust Speech Recognition via Large-Scale Weak Super- vision,” 2022.</p><p>[8] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” <a id="bookmark9"></a>in <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 2017,pp. 776– 780.</p><p>[9] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 2017, pp. 5220–5224.</p><p>[10] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The Third CHiME Speech Separation and Recognition Challenge,” <em>Comput. Speech Lang.</em>, vol. 46, no. C, pp. 605–626, nov 2017.</p><p>[11] E. Vincent, S. Watanabe, A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recog- <a id="bookmark13"></a>nition,” <em>Computer Speech and Language</em>, vol. 46, pp. 535–557, 2017.</p><p>[12] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal, “The fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines,” in <a id="bookmark3"></a><em>Proc. of Interspeech</em>, 2018.</p><p>[13] Shinji Watanabe et al., “CHiME-6 Challenge: Tack- ling Multispeaker Speech Recognition for Unsegmented Recordings,” in <em>Proc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020)</em>, 2020.</p><p>[14] Mirco Ravanelli, Luca Cristoforetti, Roberto Gretter, Marco Pellin, Alessandro Sosi, and Maurizio Omologo, “The DIRHA-ENGLISH corpus and related tasks for <a id="bookmark38"></a>distant-speech recognition in domestic environments,” in <em>Proc. of ASRU</em>, 2015.</p><p>[15] Marco Matassoni, Ram<img src="/media/202408//1724838579.7425869.png" />n Fernandez Astudillo,Athana- sios Katsamanis, and Mirco Ravanelli, “The DIRHA- <a id="bookmark18"></a>GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones,” in <em>Proc. of Interspeech</em>, 2014.</p><p>[16] Mirco Ravanelli and Maurizio Omologo, “On the selec- tion of the impulse responses for distant-speech recog- <a id="bookmark19"></a>nition based on contaminated speech training,” in <em>Proc. </em><a id="bookmark39"></a><em>of Interspeech</em>, Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, Eds., 2014.</p><p>[17] Mirco Ravanelli and Maurizio Omologo, “Contami- nated speech training methods for robust DNN-HMM <a id="bookmark12"></a><a id="bookmark22"></a><a id="bookmark21"></a>distant speech recognition,” in <em>Proc. of Interspeech</em>, 2015.</p><p>[18] Steve Renals, Thomas Hain, and Herve Bourlard, “Recognition and interpretation of meetings: The AMI and AMIDA projects,” in <em>Proc. of ASRU</em>, 2007.</p><p>[19] Colleen Richey, Maria A. Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Mahesh Kumar Nandwana, Allen Stauf- fer, Julien van Hout, Paul Gamble, Jeff Hetherly, Cory Stephenson, and Karl Ni, “Voices Obscured in Complex Environmental Settings (VOICES) corpus,” 2018.</p><p>[20] Alex Stupakov, Evan Hanusa, Deepak Vijaywargi, Di- eter Fox, and Jeff A. Bilmes, “The design and collec- <a id="bookmark27"></a>tion of COSINE, a multi-microphone in situ speech cor- <a id="bookmark28"></a>pus recorded in noisy environments,” <em>Comput. Speech </em><a id="bookmark14"></a><a id="bookmark34"></a><em>Lang.</em>, vol. 26, no. 1, pp. 52–66, 2012.</p><p>[21] Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yosh- ioka, Hannes Gamper, and Robert Aichner, “ICASSP</p><p>2022 Deep Noise Suppression Challenge,” 2022.</p><p><em>International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP)</em>, 2015, pp. 4580–4584.</p><p>[25] Yusheng Xiang, Tian Tang, Tianqing Su, Christine Brach, Libo Liu, Samuel S. Mao, and Marcus Geimer, “Fast CRDNN: Towards on Site Training of Mobile Construction Machines,” <em>IEEE Access</em>, vol. 9, pp. 124253–124267, 2021.</p><p>[26] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor We- ber, “Common Voice: A Massively-Multilingual Speech Corpus,” 2019.</p><p>[27] Benjamin Milde and Arne Koehn, “Open Source Auto- maticSpeech Recognition for German,” in <em>Speech Com- munication; 13th ITG-Symposium</em>, 2018, pp. 1–5.</p><p>[28] Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Lan- <a id="bookmark26"></a>guages,” in <em>Proceedings of the Eighth International Conference on Language Resources and Evaluation</em></p><p><em>(LREC’12)</em>, Istanbul, Turkey, May 2012, pp. 759–765, European Language Resources Association (ELRA).</p><p>[29] Matthew D. Zeiler, “ADADELTA: An Adaptive Learn- ing Rate Method,” 2012.</p><p>[30] Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” 2014.</p><p>[31] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong, “Attention is All You Need in Speech Separation,” 2020.</p><p>[32] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Fran- cois Grondin, and Mirko Bronzi, “On Using Transform- ers for Speech-Separation,” 2022.</p><p>[22] Christophe Veaux, Junichi Yamagishi, and Simon King, “The voice bank corpus: Design, collection <a id="bookmark15"></a>and data analysis of a large regional accent speech <a id="bookmark40"></a>database,” in <em>2013 International Conference Orien- tal COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O- COCOSDA/CASLRE)</em>, 2013, pp. 1–4.</p><p>[23] Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux, “WHAM!: Ex- tending Speech Separation to Noisy Environments,” in <a id="bookmark17"></a><em>Proc. Interspeech</em>, Sept. 2019.</p><p>[24] Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Has¸im Sak, “Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks,” in <em>2015 IEEE</em></p>
刘世财
2024年8月28日 17:49
转发文档
收藏文档
上一篇
下一篇
手机扫码
复制链接
手机扫一扫转发分享
复制链接
Markdown文件
HTML文件
PDF文档(打印)
分享
链接
类型
密码
更新密码