08-Generation of 3D models of victims within their surroundings at rescue sites

DE GRUYTER Current Directions in Biomedical Engineering 2023;9(1): 634-637<img src="/media/202408//1724856295.029881.png" />Susann Kriebisch*, Ralf Bruder, and Floris ErnstGeneration of 3D models of victims within<img src="/media/202408//1724856295.0692549.png" /><img src="/media/202408//1724856295.108438.png" />iio<img src="/media/202408//1724856295.148126.png" />/1<img src="/media/202408//1724856295.204892.png" />1<img src="/media/202408//1724856295.247112.png" />/<img src="/media/202408//1724856295.291287.png" /><img src="/media/202408//1724856295.388777.png" /><img src="/media/202408//1724856295.453254.png" />-<img src="/media/202408//1724856295.63728.png" />3n-11<img src="/media/202408//1724856295.785385.png" />ings at rescue sitesAbstract: The use of state-of-the-art technology in emer- gencies enables fast assessment of the situation. Modern res- cue robots,but also rescuers with camera systems, record vast amounts of image data. In addition to the benefits for rescue operations, a quick overview based on this data can help med- ical staff in the search for suitable treatment strategies. How- ever, such data must be examined and fused into an intuitive display. In this work, exemplary image streams of RGB-D cameras are searched for victims and transferred into sepa- rate 3D models per detected person for subsequent intuitive viewing, rotating, and zooming. The method was tested and preliminary evaluated for its functionality, detail accuracy and readability of textures at the German Rescue Robotics Center.Keywords: Locating Victims, Rescue Robotics, Rescue Site 3D Models, German Rescue Robotics Center1 IntroductionIn the context of rescue scenarios with accident victims, it is reasonable to record video material on site, capturing the sit- uation at the scene. Rescue robots, like those of the German Rescue Robotics Center [1], record huge amounts of image data at the scene of accidents so that it can be used to directly guide the rescue forces on site. For attending physicians, it is impracticable to view all the collected video data. Yet, it can be useful for them to view a victim at the scene, thus en- abling conclusions to be drawn about the type and the course of the accident. Sometimes, additional information is provided by warning signs or chemicals close by.The continuous development of the DICOM standard, specifi- cally the work done by Workgroup 17 [2] enables the integra- tion of 3D objects into a patient’s medical files.Combining this new option with the opportunity presented by the available image data recorded at accident scenes, it is de- sirable to fuse the recorded image data into a handy 3D model format.<img src="/media/202408//1724856296.118745.png" />*Corresponding author: Susann Kriebisch, University of Luebeck, Ratzeburger Allee 160, 23562 Luebeck, Germany, e-mail: susann.kriebisch@student.uni-luebeck.deRalf Bruder, Floris Ernst, Institute for Robotics and Cognitive Systems, University of Luebeck, Ratzeburger Allee 160, 23562 Luebeck, Germany<img src="/media/202408//1724856296.223093.jpeg" />Fig. 1: Simulated rescue scenario at the German RescueRobotics Center. The victim is overlaid with his detected skele- ton.2 MethodsUsing the example of an RGB-D depth imaging camera, the aim of this work is to convert image data into a 3D model of the location of a victim and thus into a data format that is easy to handle for medical professionals, so that they can view the scene accurately from all sides by rotating and zooming. A major aim to achieve is to ensure that textures of the 3D model are of high quality, thus texts on warning signs and packaging are legible as far as possible.To accomplish the goal of generating victim-specific 3D mod- els, a series of problems need to be solved: The assignment of the image data is done in two steps: First, victims in a rescue scene need to be recognized in section 2.1 and filtered for er- rors in section 2.2. In section 2.3 the observing camera’s posi- tion itself needs to be determined, and with this information in section 2.4 individual victims need to be separated regarding their location, which is done by clustering. Finally, in section 2.5 the transformation process from RGB-D image data into textured 3D models can be performed.2.1 Human detectionTo reliably identify the RGB-D images displaying victims, hu- mans must be detected by the system at the rescue site. It is important to consider the conditions, in which they are to be found. A rescue site might be devastated with debris that might also cover parts of a victim. The victims themselves maybe in a condition that deviates from normal regarding vital signs and appearance, and are typically not able to move by themselves.<img src="/media/202408//1724856296.405979.png" /><img src="/media/202408//1724856296.4880588.png" /> Open Access. © 2023 The Author(s), published by De Gruyter. <img src="/media/202408//1724856296.5311308.jpeg" /> This work is licensed under the Creative Commons Attribution 4.0 International License.<s>	</s> S. Kriebisch et al., Generation of 3D models of victims within their surroundings at rescue sitesThere are numerous ways to detect humans known in litera- ture. Sensors like thermal cameras are an option for detecting humans according to their body temperature. This, however, requires additional hardware and stays far from ideal, since body temperature is also affected by the condition of the vic- tim or the surroundings of the rescue site. In favor of simplicity and cost, sticking with only RGB-D data as source of infor- mation is desirable. For this, the usage of pre-trained object detectors, like YOLO [3] is a common option.Thinking one step further into the future, however, knowing not only a victim’s location, but also its pose, bears additional benefits for the medical documentation of a victim. For track- ing individual body parts, two common skeleton trackers, Nu- itrack by 3DiVi and Openpose [4], were considered and tested regarding their suitability for rescue scenarios with victims in various poses. Nuitrack did not perform well for the highly relevant lying poses. Openpose was not only able to identify humans in all tested poses, including lying, sitting, and stand- ing positions, but also had a high detection rate for victims, whose limbs, parts of the thorax or the head were covered by other objects such as debris. Due to its overall best capabilities, Openpose was chosen for the human detection. The detected skeletons come with certainty scores specific to each of the 25 detectable points per skeleton, namely the ’joints’ . The detec- tion takes place on single RGB images of an image stream, as can be seen in Figure 1. One problem for identifying ev- ery skeleton, unfortunately, is the high rate of false detections, where not only humans on posters, but also other non-human objects were identified as skeletons.2.2 Filtering skeleton detectionsTo handle false detections in the Openpose output, two meth- ods were applied: A simple first step to counteract errors is to filter skeleton detections that do not contain a minimum num- ber of detected joints with a certainty score above a defined threshold. Setting the minimum number of detected joints per valid skeleton to five and a certainty threshold to above 50% proved to be a useful measure in our use cases. A further in- crease in both numbers would also prevent partially covered victims from being detected and is therefore undesirable.Secondly, since Openpose’s detection algorithm solely works on color data without taking depth information into account, filtering skeleton detections with unrealistic dimensions of body parts has great impact. Using the depth information from the RGB-D camera, the lengths of detected body parts were calculated as the Euclidean distance between the correspond- ing joints. They were then compared with realistic body di- mensions derived from the DIN 3402-2 norm, with an addi-tional safety margin of up to 20 cm per body part. These two measures were able to filter most of the false detections.2.3 Self-positioning on RGB-D dataThe next step towards generating a victim-specific 3D model is to identify which of the color and depth images correspond to the individual victims at the rescue site. As Openpose is not able to re-identify a specific person across frames, it is nec- essary to know which images were taken at which location within the rescue site. The Openpose output contains informa- tion regarding coordinates of each detected skeleton joint spe- cific to the local coordinate frame of the camera. Yet, the in- formation regarding the location of these images in the global coordinate frame of the rescue site is not provided. In order to transform skeleton detections from the local coordinate frame to the global coordinate frame,odometry information must be acquired. There are several options to do this, that involve adding a source of information to the system, such as GPS. GPS, however, does not perform sufficiently well when odom- etry information is to be gathered indoors.Odometry information can also be retrieved from the color and depth information that is already acquired as basis for the 3D model generation. A variety of different visual slam algo- rithms have emerged over time [5]. Specifically, the software package RTAB-Map [6] is applicable and usable from within the robot operating system ROS, which is a framework that enables the management and processing of data from various sensors, commonly used in robotic use cases. With theodom- etry information RTAB-Map provides, it becomes possible to identify where at the rescue site images have been recorded and, using the depth information, generating a groundplan of the site. The upper image in figure 2 shows a generated ob- stacle map and, as an overlay, the detected skeleton joints in a global world coordinate system. In addition to 2D maps, it is possible to match the color and depth image pairs with the location where they have been taken and thus generate a col- orized 3D map of the rescue site as a point cloud.2.4 Clustering skeleton detectionsTo differentiate between individual victims based on their skeleton detections of section 2.2 and the self positioning of section 2.3, clustering can be used. For this, the skeleton joint coordinates are transformed to global 3D points within the res- cue site. Assuming the height above ground of a victim to be irrelevant at the rescue site, it is applicable to work with global 2D instead of 3D points. As the number of victims is unclear, a clustering algorithm has to be chosen, that does not requireS. Kriebisch et al., Generation of 3D models of victims within their surroundings at rescue sites <s> </s><img src="/media/202408//1724856296.700626.png" />Fig. 2: From top to bottom: Obstacle map for orientation, detected skeleton parts shown with color-coded likelihoods in the map and derived clusters of localized human detectionthe number of resulting clusters as input parameter. The Mean Shift algorithm [7] is a suitable choice. Its basic principle is to identify all points within the radius of a randomly chosen starting point. The mean position of all points within that ra- dius is calculated, which then serves as new point around of which all points in its radius are identified. This way, a vector field can be generated, where vectors between two points point towards their cluster center. Once the algorithm has converged to a cluster center, it starts anew with a random, not previ- ously looked at point, until all points of the data set have been inspected and thus clustered. The only parameter Mean Shift required is its radius. Here, a reasonable consideration for this radius is to take into account the expected maximum distance two Openpose joint detections, belonging to the same person, can have. Within the Openpose joint points, the pair that can be the furthest apart from each other are that of the neck and the mid-hip. Consulting the norm DIN 33402-2 once again, in a sitting position, 855mm cover the distance between eye- level and chair of 95% of German men between the ages of 18 and 65 years. Choosing the radius is a trade-off between possi- bly identifying several individuals as one, in the case of them being located too close to each other, and identifying more in- dividuals than are present, due to imperfect odometry. The for- mer is more and applicable when trying to identify locations of victims at the rescue site itself, so that rescue staff are pro- vided with positions to head towards. The later might be more suitable for the generation of 3D models of single patients, when they may not contain representations of other patients. However, in cases where several victims are situated on top or partially covering each other, this argument is weak, since the other victims’ bodies are likely to have an impact on the condi- tion of a single victim. In our use case, where victims did not occur in groups, a suitable choice of the radius proved to be100cm. Using this as an input parameter, Mean Shift enables finding a location for each victim, the corresponding cluster center. This can be seen in the bottom image of figure 2.2.5 Person-centric 3D model creationThe main goal of this paper is the fusion of color and depth images that have been recorded around stationary victims in a rescue scenario to three-dimensional overall models specific to each of said victims at the site. The resulting 3D views aim to depict detailed color information on the surface. To achieve this, point clouds that were generated from the color, depth, andodometry data were fine adjusted to each other us- ing Iterative Closest Point [8] and Sparse Bundle Adjustment [9] algorithms. Victim-specific point clouds may be cropped around each victim, using the Mean Shift generated cluster centers, which determine the location of each victim as refer- ence points. After that, the point clouds can be transformed into 3D surface models, using tessellation. In doing so, re- lations of individual points to each other are determined, so that triangles can be identified forming the surface texture of a 3D object. The last step is the projection of color informa- tion onto the surface area. The necessary algorithms for this are provided by the RTAB-Map toolbox [6]. It is then possible to filter out noise in the form of coherent tessellated constructs below a certain size. To optimize the resulting 3D model, it is advisable to apply surface smoothing techniques in order for flat surfaces to be represented with a minimum of triangles. This makes it possible for the color projection to be less dis- torted and for possible writings and warning signs in the sur- roundings of a victim to be visible more clearly. Parameters that proved to be useful can be found in table 1.The resulting 3D models can be saved in several file formats, including the OBJ format by Wavefront, which is DICOM- compatible since version 2020a.3 EvaluationThe evaluation aims to demonstrate the feasibility of generat- ing detailed 3D models of victims in realistic rescue scenarios for medical purposes. A simulated rescue site at the German Rescue Robotics Center[1] containing victims in typical poses was chosen for generating RGB-D data streams either from aboard a rescue robot or from a hand-carried camera oper- ated by rescue staff. On each stream, the process of human detection, filtering, self-localization, clustering and finally 3D model creation was performed. Human detection worked re- liably through Openpose and filtering. The estimation of vi-<s>	</s> S. Kriebisch et al., Generation of 3D models of victims within their surroundings at rescue sitesTab. 1: Useful RTAB-Map parameters for 3D model generation<table><tr><td>Parameter</td><td>Specification</td></tr><tr><td>Post-processing</td><td>Detect more loop closures: 1m cluster ra- dius; 30 degrees cluster angle; 5 Iterations;Refine links with ICP: true; Sparse BundleAdjustment (SBA): g2o; 20 Iterations; 1,00 Pixel variance</td></tr><tr><td>Cloud filteringCloud smoothing</td><td>Search radius: 5cm; Minimum neighboursin radius: 5MLS search radius: 4cm</td></tr><tr><td>MeshingPoisson Surface ReconstructionTexture mapping</td><td>Surface reconstruction approach: Poisson;Transferring color radius: 3cm; Texturemapping: trueTarget polygon size: 3cm; Point weight: 4;Min depth: 5Output texture size: 8192x8192; Maximum distance from the camera for polygons to betextured by this camera: 3m; Minimum poly-gon cluster size for texturing: 50; Distanceto camera policy: true</td></tr></table>sualodometry was in some cases affected by light conditions or chosen trajectories due to unsteady camera movements. Those findings coincide with the difficulties described in liter- ature [10]. Nonetheless, neither the resulting self-localization inaccuracies, nor inaccuracies regarding skeleton detections flawed the result of the clustering enough to be a drawback for the generation of the 3D models. Detailed 3D models were successfully created from the individual data streams per vic- tim. The parameters from table 1 enabled the detailed visual- ization of victim models at a rescue scene, as can be seen in figure 3.<img src="/media/202408//1724856296.8739939.png" />Fig. 3: Freely rotatable 3D model of a victim (left) and detailed views with good readability of text and drawings (right)4 ConclusionUsing only one RGB-D data stream, it is possible to detect and locate victims within rescue sites. Knowing the location of each victim, a 3D model displaying them at the scene and capturing their immediate surroundings can be generated with a high level of detail. Such 3D models can easily be incorpo- rated into medical files, as they are DICOM-compatible.Author StatementResearch funding: This work is funded by the Federal Ministry of Education and Research (BMBF) under grant 13N14862 (A-DRZ), cf. <a href="https://rettungsrobotik.de">https://rettungsrobotik.de</a>. We thank our project partners and collaborators. Conflict of interest: Authors state no conflict of interest. Informed consent: Informed consent has been obtained from all individuals included in this study.References[1] Kruijff-Korbayová, I, et al. &quot;German rescue robotics center (DRZ): A holistic approach for robotic systems assisting in emergency response.&quot; 2021 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE,2021.[2] Mustra, M, Kresimir D, and Mislav G. &quot;Overview of the DI- COM standard.&quot; 2008 50th International Symposium EL- MAR. Vol. 1. IEEE, 2008.[3] Jocher, G., Chaurasia, A., Qiu, J. (2023). YOLO by Ultralytics (Version 8.0.0) [Computer software]. <a href="https://github.com/ultralytics/ultralytics">https://github.com/ultralytics/ultralytics</a>[4] Cao, Zhe, et al. &quot;Realtime multi-person 2d pose estimation using part affinity fields.&quot; Proceedings of the IEEE confer- ence on computer vision and pattern recognition. 2017.[5] Ragot, Nicolas, et al. &quot;Benchmark of visual slam algorithms: Orb-slam2 vs rtab-map.&quot; 2019 Eighth International Confer-ence on Emerging Security Technologies (EST). IEEE, 2019.[6] Labbé, Mathieu, and François Michaud. &quot;RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online opera- tion.&quot; Journal of Field Robotics 36.2 (2019): 416-446.[7] Comaniciu, D. and Meer, P. Mean shift: A robust approach toward feature space anal- ysis. In: IEEE Transactions on pattern analysis and machine intelligence 24(5):603– 619, 2002.[8] Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International journal of computer vision, 13(2), 119-152.[9] Konolige, K., Garage, W. (2010, September). Sparse Bundle Adjustment. In BMVC (Vol. 10, pp. 102-1).[10] Merzlyakov, A., Macenski, S. (2021, September). A compar- ison of modern general-purpose visual SLAM approaches. In 2021 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) (pp. 9190-9197). IEEE.