16-基于旋转变压器的YOLOv5s海上遇险目标检测方法

<a href="http://crossmark.crossref.org/dialog/?doi=10.1049%2Fipr2.13024&amp;domain=pdf&amp;date_stamp=2024-01-11"><img src="/media/202408//1724838587.168684.jpeg" /></a>Received: 14 October 2023 <img src="/media/202408//1724838587.1888719.png" /> Revised: 6 December 2023 <img src="/media/202408//1724838587.1926851.png" /> Accepted: 21 December 2023 <img src="/media/202408//1724838587.196497.png" /> IET Image ProcessingDOI: 10.1049/ipr2.13024ORIGINAL RESEARCH<img src="/media/202408//1724838587.218859.png" />The Institution ofEngineering and TechnologyWILEYYOLOv5s maritime distress target detection method based on <a id="bookmark1"></a>swin transformerKun Liu<a href="#bookmark1">1</a> <img src="/media/202408//1724838587.243517.png" /> Yueshuang Qi<a href="#bookmark2">2</a><a href="https://orcid.org/0000-0003-0559-5675"><img src="/media/202408//1724838587.247562.png" /> </a><img src="/media/202408//1724838587.254153.png" /> Guofeng Xu<a href="#bookmark1">1</a> <img src="/media/202408//1724838587.268887.png" /> Jianglong Li<a href="#bookmark1">1</a>1PLA Naval Aviation University, Qingdao Campus, Qingdao, China2 School of Information Science and Engineering, Qilu Normal University, Jinan, ChinaCorrespondenceYueshuang Qi, School of Information Science and Engineering, Qilu Normal University, Jinan 250200, China.Email:<a href="mailto:free_qys@163.com">free_qys@163.com</a><a id="bookmark2"></a>AbstractIn recent years, the task of maritime emergency rescue has increased, while the cost of time for traditional methods of search and rescue is pretty long with poor effect subject to the constraints of the complex circumstances around thesea, the effective conditions, and the support capability. This paper applies deep learning and proposes a YOLOv5s-SwinDS algorithm for target detection in distress at sea. Firstly, the backbone network of the YOLOv5s algorithm is replaced by swin transformer, and a multi-level feature fusion mod- ule is introduced to enhance the feature expression ability for maritime targets. Secondly, deformable convolutional networks v2 (DCNv2) is used instead of traditional convolution to improve the recognition capability for irregular targets when the neck network features are output. Finally, the CIoU loss function is replaced with SIoU to reduce the redun- dant box effectively while accelerating the convergence and regression of the predicted box. Experimenting on the publicly dataset SeaDronesSee, the Precision, Recall, mAP0.5 and mAP0.5−0.95 of YOLOv5s-SwinDS model are 87.9%, 75.8%, 79.1% and 42.9%, respec- tively, which get higher results than the original YOLOv5s model, the YOLOv7 series of models, and the YOLOv8 series of models. The experiments veriﬁes that the algorithm has good performance in detecting maritime distress targets.1 <img src="/media/202408//1724838587.286735.png" /> INTRODUCTIONAs human activities at sea become more and more frequent, people swim, dive, surf, ride motor boats, or take boats for sea ﬁshing, etc. Maritime activities have become an important part of daily life with the number of boats operating at sea increas- ing. However, many people die every year because they fall into the sea for various reasons but are not rescued in time. Tradi- tional maritime search and rescue methods mainly rely on the experience of experts and the spot where targets fell in thesea to set the area for rescue, which might take a long time and have a relatively poor effect on rescue restricted by the com- plex circumstances around thesea, the effective conditions and the support capability.In recent years, with the rapid development of Big Data and the Internet of Things <a href="#bookmark3">[1–6]</a>, the use of unmanned intelligent equipment for target detection in distress at sea has become a new research hotspot. J. Wang et al. proposed some advanced research achievements, such as the MultiStage Self-Guided Sep-aration Network and Representation-Enhanced Status Replay Network <a href="#bookmark4">[7, 8]</a>. M. Zhang et al. proposed SOT-NET <a href="#bookmark5">[9]</a>. Due to the characteristics of low cost, sensitivity, intelligence, and autonomy, unmanned aerial vehicles (UAVs) have gradually been used in the maritime search and rescue ﬁeld <a href="#bookmark6">[10–12]</a>.Object detection by UAVs is an important research direc- tion in the ﬁeld of computer vision. Compared to the images taken from ahorizontal perspective, images taken by UAVs not only with variable angles and heights but also have the features of large view, small percentage of the target, complex back- ground, and sensitivity to sunlight <a href="#bookmark7">[13]</a>, which make it difﬁcult for object detection based on UAV images. Therefore, we pro- pose an improved Yolov5s algorithm to improve the accuracy of UAVs for object detection of maritime distress. There are two kinds of target detection methods based on depth learn- ing, one of which is two-stage target detection models based on R-CNN <a href="#bookmark8">[14]</a>, fast R-CNN <a href="#bookmark9">[15]</a> and faster R-CNN <a href="#bookmark10">[16]</a>. And the other is one-stage target detection models based on YOLO <a href="#bookmark11">[17–21]</a> and SSD <a href="#bookmark12">[22]</a>. Some of the common two-stage modelsThis is an open access article under the terms of the<a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution</a>License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.© 2024 The Authors. IET Image Processing published by John Wiley &amp; Sons Ltd on behalf of The Institution of Engineering and Technology.LIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License2have high accuracy and better effect on detection but require a large amount of computation.To overcome the low accuracy of the YOLO series of algo- rithms, paper <a href="#bookmark13">[20]</a> used cross-stage-partial (CSP) to extract features, which introduces spatial pyramid pooling (SPP) and path aggregation network (PAN) <a href="#bookmark14">[23]</a>, and then extracted fea- tures by fusing high-level and low-level feature information, combining with mosaic data enhancement method to improve the accuracy of object detection.In order to further improve the accuracy of detection, YOLOv5 <a href="#bookmark15">[24]</a> introduced adaptive candidate boxes (anchors) <a href="#bookmark16">[25]</a> and adaptive image scaling on the basis of YOLOv4 net- works to enhance data and improve the robustness of the network. YOLOv5 adopted the focus structure and designed two CSP structures, which strengthened the fusion capability for network features and retained richer feature information. Apart from this, YOLOv5 applied CIOU_loss function <a href="#bookmark17">[26]</a> to improve the accuracy and convergence stability of the net- work. YOLOv5 also introduced network depth and width as scaling factors and four models came into being according to the number of layers and channels, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, of those had a sequential improve- ment in network size and accuracy, which made the network more ﬂexible. YOLOv5s had higher efﬁciency of detection and lower requirements for hardware equipment, so the algo- rithm proposed in this paper makes the improvements based on YOLOv5s.We propose the YOLOv5s-SwinDS algorithm for target detection in distress at sea based on YOLOv5s with a swin transformer. The main contributions of this paper are as follows:1. YOLOv5s algorithm uses the traditional CNN network to extract features. Due to the limitation of CNN, it can only establish certain connections in local areas while can not establish remote dependency with farther locations. Swin transformer can connect with any location because of their unique self-attention mechanism. In response to the above issues, this article innovatively combines swin transformer with the YOLOv5s algorithm, using swin transformer as the backbone to extract features, complementing each other’s strengths and weaknesses, in order to achieve a better ability to capture remote dependency relationships. The backbone network of the YOLOv5s algorithm is replaced by swin transformer, and a multi-level feature fusion module is intro- duced to enhance the feature expression ability for maritime objects.2. The deformable convolutional networks v2 (DCNv2) is used instead of traditional convolution in the feature output of the neck network to improve the recognition capability for irregular targets.3. The CIoU_loss function is replaced with SIoU to reduce the redundant box effectively while accelerating the convergence and regression of the predicted box.The experimental results on the opening dataset SeaD- ronesSee show that the proposed YOLOv5s-SwinDS modelis superior to the traditional YOLOv5s model, YOLOv7 series of models, and YOLOv8 series of models in terms of Precision, Recall, mAP0.5 and mAP0.5−0.95 . Our algorithm takes into account the improvement of detection speed while increasing accuracy, which has a better effect on searching for distress targets.2 <img src="/media/202408//1724838587.319154.png" /> YOLOv5s NETWORKARCHITECTUREThe framework of YOLOv5s can be divided into an input end, a backbone network,a neck network, and an output end, and its structure is shown in Figure <a href="#bookmark18">1.</a>2.1 <img src="/media/202408//1724838587.3240588.png" /> InputThe main technologies of the input end are mosaic enhance- ment to dataset and adaptive anchor box calculation. Mosaic enhancement can effectively enrich the data set by scaling, cutting, arranging, and splicing images at will. In the training process of adaptive anchor box calculation, the predicted box will be output according to the pre-set initial anchor box, and then compared with the real box. The optimal anchor box value will begotten in the light of the reverse update of the difference between the predicted box and the real box.2.2 <img src="/media/202408//1724838587.352214.png" /> BackboneThe feature extraction network of YOLOv5s draws on the design idea of CSPDarkNet and is composed of spatial pyra- mid pooling fast (SPPF) and C3, which contains three standard convolutional layers and several bottlenecks. The C3 module contains a residual structure, which can reduce the num- ber of network parameters and improve the training speed. SPPF improves based on spatial pyramid pooling (SPP), which deletes redundant operations to perform feature fusion at a faster speed.2.3 <img src="/media/202408//1724838587.38029.png" /> NeckThe neck network combines image features and transmits them to the head module for prediction. It uses an FPN+PAN struc- ture, in which FPN is a pyramid-reinforced structure that is in charge of the semantic features in a high-level network and transmits from top to bottom. PAN is responsible for the posi- tioning features in the underlying network and transmits from bottom to top. FPN and PAN are shown in Figure <a href="#bookmark19">2.</a>2.4 <img src="/media/202408//1724838587.4082072.png" /> HeadThe output end, also known as the head, is the classiﬁer and regressor of YOLOv5s, which is responsible for judging theLIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License3<img src="/media/202408//1724838587.4304268.jpeg" />FIGURE 1 Structure of YOLOv5s.<img src="/media/202408//1724838587.500364.jpeg" />FIGURE 2 Feature pyramid network (FPN) and path aggregation <a id="bookmark18"></a><a id="bookmark19"></a>network (PAN).previously obtained feature points and whether there are objects corresponding to the feature points. The output end uses non- maximum suppression (NMS) and adopts CIOU_Loss as the loss function for the bounding box, which can solve the mis- alignment problem of the bounding box and effectively improve the speed and accuracy of predicted box regression.3 <img src="/media/202408//1724838587.535035.png" /> YOLOv5s-SwinDS NETWORK ARCHITECTUREIn this paper, we propose a YOLOv5s-SwinDS algorithm for target detection in distress at sea based on YOLOv5s with swin transformer. Firstly, the backbone network of YOLOv5s algo-<img src="/media/202408//1724838587.712161.jpeg" />FIGURE 3 Structure of YOLOv5s-SwinDS.rithm is replaced by the swin transformer, and a multi-level feature fusion module is introduced to enhance the fea- ture expression ability for maritime objects. Secondly, DCNv2 instead of traditional convolution is used in the feature output of the neck network to improve the recognition capability for irregular targets and enable adaptive feature sampling. Finally, at the head section, the CIoU_loss function is replaced with SIoU to reduce the redundant box effectively while accelerating the convergence and regression of the predicted box. The network structure of YOLOv5s-SwinDS is shown in Figure <a href="#bookmark19">3.</a>LIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License4<a id="bookmark20"></a>3.1 <img src="/media/202408//1724838587.810345.png" /> Replace the backbone network withswin transformerTransformer was ﬁrst used in natural language processing, and its birth came from the inability to use parallel computing and GPU for acceleration in training RNNs so that CNN was used for parallel acceleration instead of RNNs. However, there are certain challenges in applying a transformer to the ﬁeld of com- puter vision because of the differences between transformer and CNN. Firstly, the detection speed of the transformer in computer vision is slow. Secondly, there are problems such as large computational intensity while applying the transformer model used in natural language processing tasks directly as the amount of information in computer vision tasks is much greater than that of text. The proposal of the swin transformer solves the shortage of slow detection speed of Transformer in com- puter vision ﬁeld and keeps the characteristics of convolutional neural network such as displacement invariance and resolu- tion reduction by stage, which state-of-the-art effects have been achieved in many ﬁelds. Swin transformer <a href="#bookmark21">[27]</a> is a network model that introduces a sliding window and hierarchical struc- ture, it consists of four stages and each of them reduces the resolution of their input features which is similar to the role of a convolutional neural network in expanding the receptive ﬁeld layer by layer. The swin transformer model mainly includes a patch embedding segmentation coding module, swin trans- former block sliding module, patch merging mobile splicing module, and soon.The detection process of the swin transformer is as follows: Firstly, segmentation coding patch embedding is executed on input to split the image into multiple image blocks, which is convenient for operating on each block afterward to reduce the calculation amount. Secondly, each stage contains patch merging mobile splicing module and multiple blocks, in which LayerNorm, MLP, window attention, and shifted window atten- tion together constitute the swin transformer block. The patch merging mobile splicing module acts at the beginning of each stage to reduce the resolution of the image. The structure of the swin transformer is shown in Figure <a href="#bookmark20">4a,b.</a>3.2 <img src="/media/202408//1724838588.075428.png" /> Using deformable convolution instead of traditional convolutionThe objects of SeaDronesSee dataset used in this paper for mar- itime distress target detection have diverse shapes and attitudes, which are irregular targets. Traditional convolution is insufﬁ- cient to extract the feature information of irregular targets when output the neck network features.Convolution kernel is used to extract the features of input images, in which the kernel size is usually ﬁxed. The biggest problem of this kind of convolution kernel is that it has poor adaptability to unknown changes and poor generalization abil- ity. The disadvantages of the traditional standard convolution kernel areas follows:1. The convolution unit samples feature map for inputs at a ﬁxed position.<img src="/media/202408//1724838588.20849.jpeg" />FIGURE 4 (a) Structure of swin transformer. (b) Two successive swin transformer blocks.2. The pooling layer continuously reduces the size of feature map.3. RoI pooling layer generates RoI which has a limited spatial location.The localization sampling method of traditional convolu- tional neural networks is hard to adapt to the deformation of objects. The model formula of this process is as follows:<img src="/media/202408//1724838588.395589.png" />where x represents input feature map, and the convolution ker- nel samples according to the square grid points. w represents weight. As for the position p0 of output y, the output feature mapping is equal to the sum of the sample values assigned by w. R represents positioning information. The calculation formula is as follows:R = {(−1,−1), (−1, 0), … , (0, 1), (1, 1)}. (2)Deformable convolution means that the convolution kernel adds a directional parameter to each element so that the convo- lution kernel can be extended to a large range during training. Using deformable convolution in the output stage of neck net- work, not only the features of regular targets can be extracted, but also the features of irregular targets can be fully extracted. Zhu et al. <a href="#bookmark22">[28]</a> proposed deformable convolution that adjustedLIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License5<img src="/media/202408//1724838588.486451.jpeg" />FIGURE 5 Convolution: (a) Traditional standard convolution and (b) Deformable convolution.<a id="bookmark23"></a>the direction vector of convolution kernel on the basis of tra- ditional convolution, which could adaptively sample with the shape of objects. Therefore, in order to adapt to various forms of marine distress targets, this paper introduces deformable convolution to freely sample at sampling position, which is not <a id="bookmark24"></a>limited to square grid units.The main advantage of deformable convolution is that it can sample adaptively for features and have learning ability for geo- metric deformation, which is well suited for detecting objects of different sizes and shapes, while the method only increases the computation time to a certain extent. The deformable con- volution formula with additional learning target offset for each sampling point is shown below:y (p0) = ∑ w(pn) × x(p0 +pn+Δpn). (3)pn∈RAfter assigning offset p for each sampling position, the sampling becomes irregular, which makes the transformation modelling capability of the new method better than the tradi- tional convolutional neural network. Deformable convolution is shown in Figure <a href="#bookmark23">5.</a>3.3 <img src="/media/202408//1724838588.746513.png" /> Improvement of loss functionIn YOLOv5s, CIoU <a href="#bookmark25">[29]</a> is used to calculate the regression loss of predicted box. As shown in Equations <a href="#bookmark26">(4)</a> and <a href="#bookmark27">(5)</a>, the penalty term of CIoU is set by adding an inﬂuence factor based on the penalty term of DIoU.<a id="bookmark26"></a>LDIoU = 1 − IoU + <img src="/media/202408//1724838588.793935.png" /> . (4)<a id="bookmark27"></a>LCIoU = 1 − IoU + <img src="/media/202408//1724838588.802583.png" /> +αv. (5)where, α is the parameter of trade-off, deﬁned as Equation <a href="#bookmark28">(6)</a>.<a id="bookmark28"></a>α = v (6)(1 − IoU) + v ,<img src="/media/202408//1724838588.807838.jpeg" />FIGURE 6 Calculation process of angle cost.v is used to measure the consistency of aspect ratio; the deﬁnitionis shown in Equation <a href="#bookmark24">(7)</a>.v = <img src="/media/202408//1724838588.8740962.png" />arctan <img src="/media/202408//1724838588.892666.png" /> − arctan <img src="/media/202408//1724838588.8960729.png" />2 . (7)CIoU does not take into account the mismatch between the direction of predicted box and real box, which results in con- verging slowly and inefﬁciently. The predicted box may “wander around” during training and produce a worse model ultimately.SIoU <a href="#bookmark29">[30]</a> considers the direction and angle between regres- sion vectors, which introduces the vector angle between the real box and the predicted box for constraining the predicted box in a certain direction of theX or Y-axis to improve the con- vergence speed. SIoU is composed of angle cost, distance cost, shape cost and IoU cost.3.3.1 <img src="/media/202408//1724838588.906532.png" /> Angle costB (bcx , bcy ) is the centre point of the predicted box, and set thehorizontal axis or vertical axis to the centre point BGT (b , bt )of real box to reduce the DOF of the anchor box and achieve a rapid approach to the real box along the relevant axis, as shown in Figure <a href="#bookmark23">6</a>and the formula is shown below:Λ = 1 − 2 ∗ sin2 (arcsin <img src="/media/202408//1724838588.923885.png" /> − <img src="/media/202408//1724838588.9280179.png" /> = 1 − 2 ∗ sin2 (α − <img src="/media/202408//1724838588.931665.png" /> .(8)where σ and ch represent the difference in distance and height between the centre point of real box and predicted box, respectively.3.3.2 <img src="/media/202408//1724838588.943427.png" /> Distance costUsing diagonal distance of the smallest external rectangle of predicted box and real box as the distance cost, as shown in Figure <a href="#bookmark30">7</a>. The equations of distance cost are shown in Equations <a href="#bookmark31">(9)</a> and <a href="#bookmark32">(10)</a>.LIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License6<img src="/media/202408//1724838588.966534.jpeg" /><a id="bookmark33"></a>FIGURE 7 Calculation process of distance cost.<a id="bookmark31"></a>Δ = 2 − e−(2−Λ)px − e−(2−Λ)py (9)<a id="bookmark32"></a><img src="/media/202408//1724838588.983521.png" />where cw and ch represent the width and height of the smallest external rectangle.3.3.3 <img src="/media/202408//1724838588.9870489.png" /> Shape costΩ = 1 − e−ww θ + 1 − e−whθ (11)<img src="/media/202408//1724838588.994208.png" />where, w and wgt indicate the width of predicted box and real box, respectively, hand hgt indicate the height of predicted box and real box, respectively. θ is close to 4.The loss function of SIoU regression is shown in Equation<a id="bookmark34"></a><a href="#bookmark34">(13)</a>.LossSIoU = 1 − IoU + <img src="/media/202408//1724838589.0013282.png" /> (13)where IOU represents the intersection of the predicted box and the real box.4 <img src="/media/202408//1724838589.027618.png" /> EXPERIMENT4.1 <img src="/media/202408//1724838589.033191.png" /> Experimental datasetThe SeaDronesSee dataset was presented at the WACV 2022 conference by Varga and other researchers at the Univer- sity of Tuebingen in Germany. SeaDronesSee is a large-scale dataset of object detection and tracking that has collected and recorded more than 54, 000 images and 400, 000 instances that were taken by drones at altitudes of 5 to 260 m and viewing angles of 0◦ to 90◦ while providing corresponding values of height, angle and other information. Swimmer, boat, jet ski, life_saving_appliances, and buoy are selected as the detectionTABLE 1 Environment conﬁguration.<a id="bookmark30"></a>Parameter Conﬁguration<table><tr><td>Operating system PyTorch version CUDA version cuDNN version Python version OpenCV version</td><td>Windows 101.12.111.78.9.03.9.12<a href="4.6.0.66">4.6.0.66</a></td></tr></table><img src="/media/202408//1724838589.0407228.jpeg" />FIGURE 8 Loss curve on the veriﬁcation set.objects in this dataset. The image acquisition for actual dis- tress targets is difﬁcult and the amount is relatively small. The SeaDronesSee dataset contains several types of targets that can simulate the state of actual targets in maritime distress, so we chose SeaDronesSee as the experimental dataset.In order to prove the effectiveness of our algorithm and apply it to the ﬁeld of searching and rescuing targets at sea, we adopted the stratiﬁed sampling method to make a subset of the SeaD- roneSee dataset, which put 893 images into the training set and 155 images into the veriﬁcation set.4.2 <img src="/media/202408//1724838589.093432.png" /> Experimental environment andparameter settingsThe CPU used for experiment is anAMD Ryzen 9 3900XT 12- Core Processor 3.80 GHz, 32 GB, equipped with an NVIDIA GeForce RTX 3080 Ti GPU. Pytorch framework and CUDA are used to realize the detection model. Table <a href="#bookmark30">1</a> shows the detailed environment conﬁguration.The size of the input image is adjusted to 640 × 640, the ini- tial learning rate is 0.01, and the ﬁnal learning rate is 0.0001. The optimizer is SGD, and the cosine learning rate decay is used to control for changes in learning rates, where the learning rate momentum is 0.937 and the weight attenuation is 0.0005. Training rounds are set to 1000 and use the early stopping mech- anism. We use the same dataset and parameters in the training. The loss curve on the veriﬁcation set is shown in Figure <a href="#bookmark33">8.</a> ItLIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License7<a id="bookmark35"></a>TABLE 2 Results of ablation experiment.<table><tr><td>Serial number</td><td>Model</td><td>Precision</td><td>Recall</td><td>mAP0.5</td><td>mAP0.5−0.95</td></tr><tr><td>1</td><td>YOLOv5s</td><td>82.3</td><td>71.1</td><td>74.9</td><td>39.6</td></tr><tr><td>2</td><td>YOLOv5s+Swin transformer</td><td>81.8</td><td>73.4</td><td>77.6</td><td>41.2</td></tr><tr><td>3</td><td>YOLOv5s+DCNv2</td><td>89.3</td><td>71.8</td><td>75.8</td><td>41.3</td></tr><tr><td>4</td><td>YOLOv5s+SIOU</td><td>84.7</td><td>73.3</td><td>75.5</td><td>41.8</td></tr><tr><td>5</td><td>YOLOv5s+Swin transformer+DCNv2</td><td>85.3</td><td>74.4</td><td>77.4</td><td>42.1</td></tr><tr><td>6</td><td>YOLOv5s+Swin transformer+SIOU</td><td>90.6</td><td>74.4</td><td>77.7</td><td>41.8</td></tr><tr><td>7</td><td>YOLOv5s+DCNv2+SIOU</td><td>82.7</td><td>73.9</td><td>75.9</td><td>41.4</td></tr><tr><td>8</td><td>YOLOv5s-SwinDS</td><td>87.9</td><td>75.8</td><td>79.1</td><td>42.9</td></tr></table>can be seen that the model quickly ﬁts in the ﬁrst 100 epochs and converges gently at about 200 epochs.4.3 <img src="/media/202408//1724838589.149329.png" /> Evaluation metricsIn this paper, we choose precision, recall and mean average pre- cision (MAP) as evaluation metrics to evaluate the performance of our model <a href="#bookmark36">[31]</a>.Suppose that TP and FP are the number of positive and negative samples, respectively when the prediction results are true. FN is the number of positive samples when the prediction results are false. The formulas of Precision, Recall and mAP areas follows:<img src="/media/202408//1724838589.170049.png" /> (14)<img src="/media/202408//1724838589.237183.png" /> (15)<img src="/media/202408//1724838589.435358.png" /> (16)mAP = <img src="/media/202408//1724838589.4497778.png" /> × 100, (17)where NC is the number of classes.mAP0.5 and mAP0.5−0.95 are taken to evaluate the detection accuracy of our model at different IoU thresholds. mAP0.5 indicates the value calculated when the IoU threshold is 0.5. mAP0.5−0.95 is the mean of mAP0.5 values under ten IOU thresh- olds of those range from 0.5 to 0.95 and 0.05 as the step. mAP0.5−0.95 evaluates the detection capability of the model in a more comprehensive way, as shown in Equation <a href="#bookmark37">(18)</a>.<img src="/media/202408//1724838589.457026.png" /><a id="bookmark37"></a>4.4 <img src="/media/202408//1724838589.483475.png" /> Ablation experimentsTo verify the effect of each improved module, we carried out eight ablation experiments on the subset of the SeaDrone- See dataset. With YOLOv5s as the benchmark, the backbonenetwork is replaced with swin transformer, deformable con- volutional DCNv2 is adopted in the neck network instead of traditional convolution, and SIoU takes the place of the CIoU loss function. The experimental results are shown in Table <a href="#bookmark35">2,</a> where + represents the improvement of each module.By replacing the backbone network of YOLOv5s algorithm with swin transformer, the Precision, of YOLOv5s+Swin trans- former slightly decreases by 0.5%, while the Recall, mAP0.5 and mAP0.5−0.95 signiﬁcantly increase by 2.3%, 2.7%, and 1.6%,respectively.Irregular targets account for a relatively high propor- tion in the dataset so we adopt deformable convolution, DCNv2 instead of traditional convolution to improve the recognition capability for irregular objects when the neck net- work output. The Precision, Recall, mAP0.5 and mAP0.5−0.95 of YOLOv5s+DCNv2 increase by 7%, 0.7%, 0.9%, and 1.7%, respectively, in comparison to YOLOv5s.In view of the direction and angle between regression vec- tors, SIoU introduces the vector angle between the real box and the predicted box, which can improve the convergence speed by constraining the predicted box in a certain direction of the X or Y axis. The Precision, Recall, mAP0.5 and mAP0.5−0.95 of YOLOv5s+SIOU increase by 2.4%, 2.2%, 0.6%, and 2.2%, respectively, than of YOLOv5s.Combining the three improvements mentioned above, the performance of YOLOv5s+Swin transformer+DCNv2+ SIOU has been greatly improved that the Precision, Recall, mAP0.5 , and mAP0.5−0.95 increased by 5.6%, 4.7%, 4.2% and 3.3% compared to YOLOv5s.It can be seen from the results of ablation experiments as shown in Table <a href="#bookmark35">2</a> that each improvement module is able to enhance the network.After experimental veriﬁcation, the Precision, Recall, mAP0.5 and mAP0.5−0.95 of the yolov5s-SwinDS model on the untested dataset are similar to those on the validation set in this experi- ment.4.5 <img src="/media/202408//1724838589.507191.png" /> Experimental comparisonYOLOv7 series models and YOLOv8 series models are relatively advanced target detection models at present. InLIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License8TABLE 3 Evaluation metrics results.<table><tr><td>Serial number</td><td>Model</td><td>Precision</td><td>Recall</td><td><a id="bookmark38"></a>mAP0.5</td><td>mAP0.5−0.95</td></tr><tr><td>1</td><td>YOLOv5s</td><td>82.3</td><td>71.1</td><td>74.9</td><td>39.6</td></tr><tr><td>2</td><td>YOLOv7</td><td>90.7</td><td>68.4</td><td>76.7</td><td>39.4</td></tr><tr><td>3</td><td>YOLOv7tiny</td><td>94.5</td><td>65.7</td><td>71.4</td><td>37.1</td></tr><tr><td>4</td><td>YOLOv7X</td><td>81.4</td><td>73.8</td><td>77.6</td><td>41.0</td></tr><tr><td>5</td><td>YOLOv8n</td><td>74.8</td><td>58.5</td><td>60.1</td><td>33.5</td></tr><tr><td>6</td><td>YOLOv8s</td><td>86.2</td><td>56.4</td><td>67.6</td><td>38.6</td></tr><tr><td>7</td><td>YOLOv8m</td><td>74.1</td><td>60.4</td><td>67.7</td><td>41.4</td></tr><tr><td>8</td><td>YOLOv8l</td><td>81.1</td><td>58.6</td><td>66.8</td><td>40.0</td></tr><tr><td>9</td><td>YOLOv8x</td><td>90.9</td><td>58.1</td><td>71.7</td><td>42.4</td></tr><tr><td>10</td><td>YOLOv5s-SwinDS</td><td>87.9</td><td>75.8</td><td>79.1</td><td>42.9</td></tr></table><img src="/media/202408//1724838589.543094.jpeg" /><a id="bookmark39"></a>FIGURE 9 Comparison results of the ﬁrst scene: (a) YOLOv5s and (b) YOLOv5s-SwinDS.order to estimate the effectiveness and performance of our YOLOv5s-SwinDS model, we conducted comparison experi- ments between the YOLOv5s-SwinDS model and YOLOv7 <a id="bookmark40"></a>series models and YOLOv8 series models for target detection.No pretraining weights are used in comparison experiments to ensure fairness. Comparison results are shown in Table <a href="#bookmark38">3.</a> It can be seen that YOLOv5s-SwinDS is superior to the other algorithms in terms of Recall, mAP0.5 and mAP0.5−0.95 while Precision is lower than some algorithms, which could prove the superiority of our model.<a id="bookmark41"></a>4.6 <img src="/media/202408//1724838589.597236.png" /> Visual comparisonsIn order to compare and evaluate the improvement of our model more intuitively, we selected seven images in the SeaDronesSee dataset for testing, as shown in Figures <a href="#bookmark42">9 to 15.</a>It can be seen from Figure <a href="#bookmark42">9a,b</a> that YOLOv5s failed to detect the boat target due to the effect of light reﬂection from the sea surface, while YOLOv5s-SwinDS detected the target. Figure <a href="#bookmark38">10a,b</a> shows that YOLOv5s cannot work well for swim- mer targets that with less pixels and Figure <a href="#bookmark43">11a,b</a> also shows that YOLOv5s was unable to detect some swimmer targets because of the inﬂuence of sunlight reﬂection and the small size of objects.<img src="/media/202408//1724838589.613513.jpeg" />FIGURE 10 Comparison results of the second scene: (a) YOLOv5s and <a id="bookmark42"></a><a id="bookmark43"></a>(b) YOLOv5s-SwinDS.<img src="/media/202408//1724838589.707189.jpeg" />FIGURE 11 Comparison results of the third scene: (a) YOLOv5s and (b) YOLOv5s-SwinDS.<img src="/media/202408//1724838589.738742.jpeg" />FIGURE 12 Comparison results of the fourth scene: (a) YOLOv5s and (b) YOLOv5s-SwinDS.<img src="/media/202408//1724838589.884347.jpeg" />FIGURE 13 Comparison results of the ﬁfth scene: (a) YOLOv5s and (b) YOLOv5s-SwinDS.<img src="/media/202408//1724838589.937163.jpeg" />FIGURE 14 Comparison results of the sixth scene: (a)YOLOv5s and (b) YOLOv5s-SwinDS.LIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License9<img src="/media/202408//1724838589.9827971.jpeg" />FIGURE 15 Comparison results of the seventh scene: (a) YOLOv5s and (b) YOLOv5s-SwinDS.In addition, it can be seen from Figures <a href="#bookmark39">12a,b</a> and <a href="#bookmark40">13a,b</a> that YOLOv5s repeatedly detected the same swimmer targets multi- ple times. As shown in Figures <a href="#bookmark41">14a,b</a>and <a href="#bookmark44">15a,b,</a> YOLOv5s had error results that life_saving_appliances target and buoy target were detected as swimmers, while YOLOv5s-SwinDS can get correct detection results.From the above detection results, it can be seen that the YOLOv5s-SwinDS algorithm is more capable of detecting for small targets and is able to accurately detect small targets missed by YOLOv5s, which reduces the miss rate of small target detec- tion. The YOLOv5s algorithm did not detect the targets wished to be detected due to undesirable factors such as light reﬂection (Figure <a href="#bookmark33">8a,e)</a>.The improved YOLOv5s-SwinDS algorithm can effectively deal with the interference information of complex background and successfully detect the targets that were expected because the backbone network is replaced with swin transformer, which makes our model able to prioritize under complex background, increases the weight of the region of interest, and suppresses the inﬂuence of the noise.Compared with the original YOLOv5s model, although the YOLOv5s-SwinDS model has improved the Precision by 5.6%, Recall by 4.7%, mAP0.5 by 4.2% and mAP0.5−0.95 by 3.3%, the YOLOv5s-SwinD model also has some limitations.1. The weight ﬁle size obtained after training the YOLOv5s SwinDS model is 64.1 M, signiﬁcantly larger than the weight ﬁle size obtained after training the original YOLOv5s model (14.5 M).2. The ﬂoating point operations (FLOPs) of the YOLOv5s- SwinDS model are 79.0 GFLOPs, signiﬁcantly higher than <a id="bookmark3"></a>the original YOLOv5s model’s ﬂoating point operations (15.8 GFLOPs).The computing power carried by drones is limited, so how to overcome the above two limitations and load the YOLOv5s- SwinDS model onto the onboard computer is the next focus of work.5 <img src="/media/202408//1724838590.121543.png" /> CONCLUSIONWith the development of unmanned aerial vehicles (UAVs) and remote sensing techniques, target detection of aerial images taken by UAV is widely applied and has signiﬁcant value in the ﬁelds of transportation planning, military reconnaissance, and<a id="bookmark44"></a>environmental monitoring. In this paper, we apply deep learning and propose YOLOv5s-SwinDS as a target search algorithm for maritime distress based on YOLOv5s with swin transformer. Firstly, the backbone network of the YOLOv5s algorithm is replaced by the swin transformer, and a multi-level feature fusion module is introduced to enhance the feature expres- sion ability of the model for maritime distress targets. Secondly, DCNv2 is used instead of traditional convolution to improve the recognition ability of irregular targets when the neck net- work features are output. Finally, the CIoU loss function is replaced with SIoU to effectively reduce the redundant box while accelerating the convergence and regression of the pre- dicted box. Experimenting on a subset of the publicly available dataset SeaDronesSee, our proposed YOLOv5s-SwinDS model is superior to the original YOLOv5s model, the YOLOv7 series of models, and the YOLOv8 series of models, which have better recognition efﬁciency and speed and can be widely used in the ﬁeld of target detection for maritime distress.AUTHOR CONTRIBUTIONSKun Liu: Formal analysis; methodology; project admin- istration; validation; visualization; writing—original draft. Yueshuang Qi: Data curation; visualization; writing—original draft; writing—review and editing. Guofeng Xu: Conceptual- ization; validation. Jianglong Li: Supervision.CONFLICT OF INTEREST STATEMENTThe authors declare no conﬂicts of interest.DATA AVAILABILITY STATEMENTThe data that support the ﬁndings of this study are available from the corresponding author upon reasonable request.Experimental dataset:<a href="https://pan.baidu.com/s/1JnuHeCChTsnPwtjWt8HaUQ?pwd=yggf">https://pan.baidu.com/s/1JnuHeCChTsnPwtjWt8HaUQ?</a> <a href="https://pan.baidu.com/s/1JnuHeCChTsnPwtjWt8HaUQ?pwd=yggf">pwd=yggf.</a>YOLOv5s-SwinDS model source code:<a href="https://github.com/liukun6606/YOLOv5s-SwinDS">https://github.com/liukun6606/YOLOv5s-SwinDS.</a>ORCID<a href="https://orcid.org/0000-0003-0559-5675">YueshuangQi<img src="/media/202408//1724838590.228438.png" /><img src="/media/202408//1724838590.270607.png" />https://orcid.org/0000-0003-0559-5675</a>REFERENCES1. Cao, D., Ren, X., Zhu, M., Song, W.: Visual question answering research on multi-layer attention mechanism based on image target features. Hum.- centric Comput. Inf. Sci. 11, 11 (2021). <a href="https://doi.org/10.2296/HCIS.2021.11.011">https://doi.org/10.2296/HCIS.</a> <a href="https://doi.org/10.2296/HCIS.2021.11.011">2021.11.011</a>2. Yuan, H., Zhou, H., Cai, Z., Zhang, S., Wu, R.: Dynamic pyramid attention networks for multi-orientation object detection.J. Internet Technol. 23(1), 79–90 (2022)3. Wang,J., Zou, Y., Lei, P., Sherratt, R.S., Wang, L.: Research on recurrent neural network based crack opening prediction of concrete dam.J. Internet Technol. 21(4), 1161–1169 (2020)4. Wang, J., Yang, Y., Wang, T., Sherratt, R.S., Zhang, J.: Big data service architecture: A survey.J. Internet Technol. 21(2), 393–405 (2020)5. Zhang,J., Zhong, S., Wang, T., Chao, H.-C., Wang, J.: Blockchain-based systems and applications: A survey.J. Internet Technol. 21(1), 1–14 (2020)6. Wang, J., Zhao, C., He, S., Gu, Y., Alfarraj, O., Abugabah, A.: LogUAD: Log unsupervised anomaly detection based on Word2Vec. Comput. Syst. Sci. Eng. 41(3), 1207–1222 (2022)LIU ET AL.17519667, 0, Downloaded from <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024">https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.13024</a>, Wiley Online Library on [29/01/2024]. See the Terms and Conditions (<a href="https://onlinelibrary.wiley.com/terms-and-conditions">https://onlinelibrary.wiley.com/terms-and-conditions</a>) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License10<a id="bookmark4"></a>7. Wang,J., Li, W., Zhang, M., Tao, R., Chanussot,J.: Remote sensing scene classiﬁcation via multi-stage self-guided separation network. IEEE Trans. <a id="bookmark12"></a>Geosci. Remote Sens. 61, 5615312 (2023)8. Wang,J., Li, W., Wang, Y., Tao, R., Du, Q.: Representation-enhanced sta- tus replay network for multisource remote-sensing image classiﬁcation. IEEE Trans. Neural. Netw. Learn. Syst. (2023). <a href="https://doi.org/10.1109/TNNLS.2023.3286422">https://doi.org/10.1109/</a> <a id="bookmark5"></a><a id="bookmark14"></a><a href="https://doi.org/10.1109/TNNLS.2023.3286422">TNNLS.2023.3286422</a>9. Zhang, M., Li, W., Zhang, Y., Tao, R., Du, Q.: Hyperspectral and LIDAR data classiﬁcation based on structural optimization transmission. IEEE <a id="bookmark6"></a><a id="bookmark15"></a>Trans. Cybern. 53(5), 3153–3164 (2022)10. Otote, D.A., Li, B., Ai, B., Gao, S., Xu, J., Chen, X., Lv, G.: A decision-making algorithm for maritime search and rescue plan. Sustainability 11(7), <a id="bookmark16"></a><a id="bookmark17"></a>2084 (2019)11. Rahmes, M.D., Chester, D., Hunt,J., Chiasson, B.: Optimizing coopera- tive cognitive search and rescue UAVs. In: Autonomous Systems: Sensors, Vehicles, Security and the Internet of Everything. SPIE, Bellingham, WA <a id="bookmark7"></a><a id="bookmark21"></a>(2018)12. Dai,J., Xu, F., Chen, Q.: Multi-uav cooperative search on region division and path planning. Acta Aeronaut. Astronaut. Sin. 41(S1), 149–156 (2020)13. Mao, G., Deng,T., Yu, N.: Object detection in UAV images based on multi- <a id="bookmark8"></a>scale split attention. Acta Aeronaut. Astronaut. Sin 43(12), 326738 (2022)14. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies <a id="bookmark22"></a>for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, <a id="bookmark9"></a>pp. 580–587. IEEE, Piscataway, NJ (2014)15. Girshick, R.: Fast R-CNN. In: Proceedings of 2015 IEEE International Conference on Computer Vision, pp. 1440–1448. IEEE, Piscataway, NJ <a id="bookmark10"></a><a id="bookmark25"></a><a id="bookmark29"></a>(2015)16. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. <a id="bookmark11"></a><a id="bookmark36"></a>Mach. Intell. 39(6), 1137–1149 (2017)17. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Uniﬁed, real-time object detection. In: Proceedings of 2016 IEEE Con- ference on Computer Vision &amp; Pattern Recognition, pp. 779–788. IEEE, Piscataway, NJ (2016)18. Redmon,J., Farhadi,A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271. IEEE, Piscataway, NJ (2017)19. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. <a id="bookmark13"></a>arXiv:1804.02767 (2018)20. Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)21. Khalfaoui, A., Badri, A., Mourabit, I.E.: Comparative study of yolov3 and yolov5’s performances for real-time person detection. In: Proceedings of2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), pp. 1–5. IEEE, Piscataway, NJ (2022)22. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: Single shot multibox detector. In: Proceedings of 14th Euro- pean Conference on Computer Vision–ECCV 2016, pp. 21–37. Springer, Cham (2016)23. Liu, S., Qi, L., Qin, H., Shi,J., Jia,J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. IEEE, Piscataway, NJ (2018)24. Jocher, G., Nishimura, K., Mineeva, T., Vilariño, R.: yolov5. Code repository (2020)25. Zhihong, X., Xiafei, T., et al.: Anchor-free scale adaptive pedestrian detection algorithm.J. Control Decis. 36(2), 295–302 (2021)26. Rezatoﬁghi, H., Tsoi, N., Gwak,J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666. IEEE,Piscataway, NJ (2019)27. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. IEEE, Piscataway, NJ (2021)28. Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF Confer- enceon Computer Vision and Pattern Recognition, pp. 9308–9316. IEEE, Piscataway, NJ (2019)29. Zheng, Z., Wang,P., Liu,W., Li,J., Ye, R., Ren, D.: Distance-iou loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 34, 12993–13000 (2020)30. Gevorgyan, Z.: Siou loss: More powerful learning for bounding box regression. arXiv:2205.12740 (2022)31. Lin, S., Liu, M., Tao, Z.: Detection of underwater treasures using attention mechanism and improved yolov5. Trans. Chin. Soc. Agric. Eng. 37(18), 307–314 (2021)<table><tr><td>How to cite this article: Liu, K., Qi, Y., Xu, G., Li,J.: YOLOv5s maritime distress target detection method based on swin transformer. IET Image Process. 1–10 (2024). <a href="https://doi.org/10.1049/ipr2.13024">https://doi.org/10.1049/ipr2.13024</a></td></tr></table>