06-EGFA-NAS- a neural architecture search method based on explosion gravitation field algorithm

Complex &amp; Intelligent Systems<a href="https://doi.org/10.1007/s40747-023-01230-0">https://doi.org/10.1007/s40747-023-01230-0</a><img src="/media/202408//1724856290.358969.png" /><img src="/media/202408//1724856290.442443.png" />ORIGINAL ARTICLEEGFA-NAS: a neural architecture search method based on explosiongravitation ﬁeld algorithm<a href="http://orcid.org/0009-0002-4644-3002">Xuemei Hu1c</a>· Lan Huang1,2 · Jia Zeng1 · Kangping Wang1 · Yan Wang1,3Received: 3 March 2023 / Accepted: 3 September 2023 © The Author(s) 2023AbstractNeural architecture search (NAS) is an extremely complex optimization task. Recently, population-based optimization algo- rithms, such as evolutionary algorithm, have been adopted as search strategies for designing neural networks automatically. Various population-based NAS methods are promising in searching for high-performance neural architectures. The explosion gravitation ﬁeld algorithm (EGFA) inspired by the formation process of planets is a novel population-based optimization algo- rithm with excellent global optimization capability and remarkable efﬁciency, compared with the classical population-based algorithms, such as GA and PSO. Thus, this paper attempts to develop a more efﬁcient NAS method, called EGFA-NAS, by utilizing the work mechanisms of EGFA, which relaxes the search discrete space to a continuous one and then utilizes EGFA and gradient descent to optimize the weights of the candidate architectures in conjunction. To reduce the computational cost, a training strategy by utilizing the population mechanism of EGFA-NAS is proposed. In addition, a weight inheritance strategy for the new generated dust individuals is proposed during the explosion operation to improve performance and efﬁciency. The performance of EGFA-NAS is investigated in two typical micro search spaces: NAS-Bench-201 and DARTS, and compared with various kinds of state-of-the-art NAS competitors. The experimental results demonstrate that EGFA-NAS is able to match or outperform the state-of-the-art NAS methods on imageclassiﬁcation tasks with remarkable efﬁciency improvement.Keywords Neural architecture search · Explosion gravitation ﬁeld algorithm · Complex optimization task · Deep neuralnetworksB Lan Huanghuanglan@jlu.edu.cnB Yan Wangwy6868@jlu.edu.cnXuemei Huhuxm18@mails.jlu.edu.cnJia Zengzengjia22@mails.jlu.edu.cnKangping Wangwangkp@jlu.edu.cn1 College of Computer Science and Technology, Jilin University, Changchun 130012, China2 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China3 School of Artiﬁcial Intelligence, Jilin University, Changchun 130012, ChinaIntroductionDeep neural networks (DNNs) have made signiﬁcant progress in various challenging tasks, including image clas- siﬁcation [<a href="#bookmark1">1</a>–<a href="#bookmark2">4</a>], object detection [<a href="#bookmark3">5</a>–<a href="#bookmark4">7</a>], and segmentation [<a href="#bookmark5">8,</a> <a href="#bookmark6">9]</a>. One of the key factors behind the progress lies in the innovation of neural architectures. For example, VGGNet [<a href="#bookmark1">1]</a> suggested the use of smaller convolutional ﬁlters and stacked a series of convolution layers to achieve better performance. ResNet [<a href="#bookmark7">10</a>] introduced the residual blocks to beneﬁt the training of deeper neural networks. DenseNet [<a href="#bookmark8">11</a>] designed the densely connected blocks to stack features from differ- ent depths. Generally, manually designing a powerful and efﬁcient neural network architecture requires a lot of expert experiments and domain knowledge. It is not recently until a series of neural architecture search (NAS) methods have been proposed, bringing great convenience to ordinary users and learners, and allowing them to beneﬁt from the success of deep neural networks.published online:30september2023 1 灬Generally, a NAS task can be regarded as a complex opti- mization problem. In machine learning and computational intelligence, population-based intelligent optimization algo- rithms, such as genetic algorithm (GA) and particle swarm optimization(PSO),have beenwidelyadopted as theconcept of neuroevolutionary, to optimize the topology structure and hyperparameters of neural networks in the late 1990 [<a href="#bookmark10">12</a>–<a href="#bookmark11">14]</a>. Recently, lots of NAS methods employing population-based intelligent optimization algorithms as search strategies have attracted increasing attention. Although intelligent optimiza- tion algorithms, such as GA, have a competitive search performance on various complex optimization tasks, they still suffer from high computational costs. This shortcom- ing is particularly true in NAS tasks since the NAS process involves a large number of architecture evaluations. More speciﬁcally, for the NAS task, each network architecture evaluation involves the completed training of a deep neu- ral network on a large amount of data from scratch. For example, Hierarchical EA [<a href="#bookmark12">15</a>] consumes 300 GPU days, and AmoebaNet-A [<a href="#bookmark13">16</a>] consumes 3150 GPU days to search architectures on the CIFAR-10.Inaddition,reinforcementlearning(RL)isalsoadoptedto design neural architectures automatically, such as [<a href="#bookmark4">7,</a><a href="#bookmark14">17,</a><a href="#bookmark15">18]</a>. A signiﬁcant limitation of RL-based NAS methods is also computationally expensive despite their remarkable perfor- mance. For example, it takes 2000 GPU days for the typical RL-based method NASNet-A to obtain an optimized CNN architecture on CIFAR-10. These methods require a large number of computational resources, which is unaffordable for most researchers and learners. To reduce the computa- tional cost, ENAS [<a href="#bookmark15">18</a>] proposed a parameter-sharing strat- egy, which shares weights among the architectures through the use of superset and is adopted in various gradient descent (GD) NAS methods, such as [<a href="#bookmark16">19</a>–<a href="#bookmark17">21]</a>. Compared with EA- based and RL-based NAS methods, GD-based NAS methods are usually more efﬁcient, which apply gradient descent to optimize the weights of candidate architectures. However, GD-based NAS methods still have some limitations, such as requiring excessive GPU memory during the search, and resulting in premature convergence to the local optimum [<a href="#bookmark18">22,</a> <a href="#bookmark19">23]</a>.Recently, some population-based methods, such as the various EA-based methods [<a href="#bookmark12">15,</a> <a href="#bookmark13">16,</a> <a href="#bookmark20">24</a>–<a href="#bookmark21">28</a>], have been uti- lized for NAS tasks and have achieved some progress. The explosion gravitation ﬁeld algorithm (EGFA) [<a href="#bookmark22">29</a>] inspired by the formation process of planets is a novel intelligent optimization algorithm with excellent global optimization capability and remarkable efﬁciency, compared with the classical population-based optimization algorithms, such as GA and PSO. Nowadays, computational time and resourcelimitations remain the major bottleneck in using and devel- oping NAS methods. Thus, this paper attempts to develop a more efﬁcient NAS method, by utilizing the work mecha- nisms of EGFA, to allow for discovering an optimal neural architecture with competitive learning accuracy, but only consuming a little computational time and resources. Specif- ically, the proposed EGFA-NAS utilizes EGFA and gradient descenttooptimizetheweights ofthecandidate architectures in conjunction. To reduce the computational cost, EGFA- NAS proposes a training strategy by utilizing the population mechanism of EGFA-NAS. To improve the efﬁciency and performance, EGFA-NAS proposes the weight inheritance strategy for the new generated dust individuals during the explosion operation. The main contributions of this paper are summarized as follows.1. A novel population-based NAS method is proposed, called EGFA-NAS, which utilizes EGFA and gradient descent to optimize the weights of candidate architecture jointly, and is applicable to any universal micro search space with a ﬁxed number of edges and a determined candidate operations set, such as NAS-Bench-201 and DARTS search space.2. A training strategy is proposed to reduce the compu- tational cost by utilizing the population mechanism. Speciﬁcally, all dust individuals cooperate to complete the training of the dataset at each epoch. Although each dust individual is only trained on part of batches at each epoch,itwillbetrained on allbatches over a large number of epochs.3. A weightinheritanceisproposedtoimprove performance and efﬁciency. Speciﬁcally, during the explosion opera- tion, the weights w of each new generated dust individual are inherited from the center dust. By utilizing this strat- egy, the new generated can be evaluated directly at the current epoch without retraining.4. The experimental results show that the optimal neu- ral network architectures searched by EGFA-NAS have competitive learning accuracy and require the least computational cost, compared with four kinds of state- of-the-art NAS methods.<img src="/media/202408//1724856290.587024.png" />The remainder of the paper is organized as follows. “<a href="#bookmark23">Related work</a>” introduces the related work of the work. “<a href="#bookmark24">Proposed NAS method</a>” describes the details of this pro- posed NAS method. The experimental design and results <a href="#bookmark9">are presented in “</a><a href="#bookmark25">Experimental design</a><a href="#bookmark9">” and “Experimental results”, respectively. The ﬁnal part is the conclusion placed </a>in “<a href="#bookmark26">Conclusion</a>” .<a id="bookmark23"></a>Related workGeneral formulation of NAS taskNASis an extremely complex optimization task, the primary objective of which is to transform the process of manu- ally designing neural networks into automatically searching for optimal architectures. The process of the NAS can be depictedinFig.<a href="#bookmark27">1</a>. Duringthesearch,the searchstrategy sam- ples a candidate architecture from the search space. Then we train the architecture to converge and evaluate the architec- ture’s performance. Next,thesearchstrategypicksup another candidate architecture for training and evaluation according to the evaluation result of the last architecture.In NAS tasks, denote a neural network architecture as A and the weights of all functions of the neural network as wA . Then the goal of NAS is to ﬁnd an architecture A, which can achieve the minimum validation loss LV after being trained by minimizing the training loss LT, as shown in Eq. <a href="#bookmark28">( 1)</a>.min LV(w , A)As.t. w = arg min LT (wA , A),<a id="bookmark28"></a>(1)wwhere w is the best weight of A and achieves the minimumloss of training dataset. LT and LV are the losses on train- ing dataset and validation dataset, respectively. Both losses are determined not only by the architecture A, but also the weights w. This is a bi-level optimization problem [<a href="#bookmark29">30</a>] with A as the upper-level variable and w as the lower-level vari- able.NAS methodsSearch strategy determines how to sample the neural net- workarchitectures. Accordingtothedifferentkindsofsearch strategy, NAS methods can be roughly divided into three cat- egories: EA-based NAS methods, RL-based NAS methods, GD-based NAS methods.EA-based NAS methodsEA-based NAS methods use evolutionary algorithms (EAs) to sample neural architectures. Early EA-based research for the optimization of networks was proposed as the concept of neuroevolutionary [<a href="#bookmark10">12</a>–<a href="#bookmark11">14</a>], which not only optimizes the network’s topology but also optimizes the hyperparameters andconnection weights associatedwiththenetwork. Overthe past years, EA-based NAS methods have attracted increas- ing attention. For example, Xie et al. published the ﬁrst EA-based NAS work GeNet [<a href="#bookmark30">31</a>] in 2017, which encodes the candidate architectures using ﬁxed-length binary strings. Realetal. searched network architectures byEA, and startedsearching from trivial initial conditions [<a href="#bookmark31">27]</a>. Subsequently, Real etal. evolved an imageclassiﬁer: AmoebaNet-A [<a href="#bookmark13">16</a>], which modiﬁes the tournament selection by introducing a concept of age and surpasses hand designs for the ﬁrst time. Liu et al. proposed Hierarchical EA [<a href="#bookmark12">15</a>], which combines a novel hierarchical genetic representation scheme that imi- tates the modularized design pattern and expressive search space. Elsken etal. proposed the LEMONADE [<a href="#bookmark20">24</a>], which is an evolutionary algorithm for multi-objective architecture search. Suganuma et al. constructed CNN architecturesbased on Cartesian genetic programming (CGP) [<a href="#bookmark32">25]</a>. Sun et al. proposed CNN-GA [<a href="#bookmark33">26</a>] and AE-CNN [<a href="#bookmark34">32</a>], which evolves CNN architecturesusingGA,based on ResNet andDenseNet blocks. To accelerate the ﬁtness evaluation in evolutionary deep learning, Sun and Wang et al. proposed an end-to-end ofﬂine performance predictor based on the random forest <a href="#bookmark35">[33]</a>.Although the neural network architectures searched by above EA-based NAS methods have achieved competitive performance compared with the state-of-the-art handle- designed CNNs, however, as the population-based methods, they still suffer from huge resource costs because of involv- ing a large number of ﬁtness evaluations. During the search phase, each new generated candidate architecture needs to be trained on a training dataset and evaluated on a val- idation dataset. Then most EA-based NAS methods are time-consuming. For example, to search architectures on the CIFAR-10 dataset, Hierarchical EA [<a href="#bookmark12">15</a>] needs 300 GPU days, AmoebaNet-A [<a href="#bookmark13">16</a>] needs 3150 GPU days, CNN-GA <a href="#bookmark33">[26</a>] needs 35 GPU days, and AE-CNN [<a href="#bookmark34">32</a>] needs 27 GPU days. Then it is essential to accelerate the evaluation process for EA-based NAS methods, especially under the condition of limited computational resources.RL-based NAS methodsThe agent, environment, and reward are the three factors of reinforcementlearning(RL). Inthecontext ofNAS, sampling the network architectures from the search space by the con- troller is deﬁned as the action of the agent, the performance of network is regarded as the reward, and the controller is updated based on the reward in the next iteration. The ear- liest RL-based NAS method was proposed by Zoph et al. in 2017, which used RNNs as controllers to sample the net- work architecture and generate actions via policy gradients <a href="#bookmark4">[7]</a>. Subsequently, Zophetal. used a proximal optimization strategy to optimize the RNN controller [<a href="#bookmark14">17]</a>. Cai et al. pre- sented a RL-based algorithm: ProxylessNAS [<a href="#bookmark36">34</a>], which is an alternative strategy to handle hardware metrics. Block- QNN [<a href="#bookmark37">35</a>] automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy explo- ration strategy.sample architectureFig. 1 Process of neural <a id="bookmark27"></a>architecture search<img src="/media/202408//1724856290.6924102.png" /><table><tr><td rowspan="2"><img src="/media/202408//1724856290.7155552.png" /><img src="/media/202408//1724856290.768857.png" /></td><td>from search space</td><td rowspan="2"><img src="/media/202408//1724856290.972533.png" /><img src="/media/202408//1724856291.043036.png" /></td></tr><tr><td>Re tum performance</td></tr></table>evaluationEarlier RL-based NAS methods are usually computa- tionally expensive. To reduce the computational cost, work <a href="#bookmark14">[ 17</a>] proposed the well-known NASNet search space, which allows us to search the best cell on the CIFAR-10 datasetand then apply this cell to the ImageNet dataset by stack- ing together more copies of this cell. ENAS [<a href="#bookmark15">18</a>] proposed a parameter-sharing strategy andthe one-shotestimator(OSE), which regards all candidate architectures as the subgraphs of the super-network. Thenallcandidate architectures can sharethe parameters.GD-based NAS methodsRecently, there is an increasing interest in adopting gradient descent (GD) methods for NAS tasks. A typical GD-based NAS method is DARTS [<a href="#bookmark16">19</a>], which optimizes the network architecture parameters by GD methods after converting the discrete search space into a continuous search space through a relaxation strategy. Subsequently, Dong et al. proposed the GDAS [<a href="#bookmark38">20</a>], which develops a learnable differentiable sam- pler to accelerate the search procedure. Xie et al. proposed the SNAS [<a href="#bookmark17">21</a>], which trains neural operation parameters andarchitecturedistributionparametersbyproposing a novel search gradient. Above-mentioned ProxylessNAS [<a href="#bookmark36">34</a>] pro- posed a gradient-basedapproachtohandle non-differentiable hardware objectives.Compared with EA-based and RL-based NAS meth- ods, GD-based NAS methods are every efﬁcient, because they represent the structures of the candidate networks as directed acyclic diagrams (DAGs) and use the parameter- sharing strategy. However, GD-based NAS methods have some drawbacks. For example, references [<a href="#bookmark18">22,</a> <a href="#bookmark19">23</a>] point out that the DARTS tends to select skip-connection oper- ations, which leads to performance degradation of searched architectures. To overcome the shortcoming of DARTS [<a href="#bookmark16">19</a>], several variants of DARTS methods have been proposed, such as DARTS- <a href="#bookmark39">[36</a>], DARTS + [<a href="#bookmark40">37</a>], RC-DARTS [<a href="#bookmark41">38</a>], and β-DARTS [<a href="#bookmark42">39]</a>.Besides the above three kinds of NAS methods, there are also other NAS methods that are not mentioned or do not fully fall into the above categories. For example, Liu et al.proposed the PNAS [<a href="#bookmark43">40</a>], which uses a sequential model- based optimization (SMBO) strategy.Explosion gravitation field algorithmExplosion gravitation ﬁeld algorithm (EGFA) [<a href="#bookmark22">29</a>] is a novel optimization algorithm based on the original GFA [<a href="#bookmark43">40</a>–<a href="#bookmark44">43</a>], which stimulates the formation process of planets based on SNDM [<a href="#bookmark45">44]</a>. It was proposed by our research team in 2019 and has achieved good performance when solving optimiza- tion problems and tasks, such as benchmark functions [<a href="#bookmark22">29]</a> and feature selection tasks [<a href="#bookmark46">45]</a>. Compared with the classical population-based intelligent algorithm, such as geneticalgo- rithm (GA) and particle swarm optimization (PSO), EGFA has better global optimization capability and remarkable efﬁ- ciency. Inaddition,thefactthatEGFA convergestothe global best solution with probability 1 under some conditions has been proven [<a href="#bookmark22">29]</a>.In EGFA, all individuals can be mimicked as dust parti- cles with mass, and each of them belong to a certain group. In every group, the one with the maximum mass value is regarded as the center dust and the others are surrounding dust particles. Based on the idea of SNDM [<a href="#bookmark45">44</a>], each cen- ter dust attracts its surrounding dust by the gravitation ﬁeld, and the gravitation ﬁeld makes all surrounding dust particles move toward their centers. In EGFA, each dust particle can be represented by a four-tuple (location, mass, group, ﬂag), where ﬂag is a Boolean value indicating whether it is a cen- ter, location corresponds to a solution for the problem, group indicates the group number, mass is the value of objective function. When the value of mass is bigger, the solution is better. There are six basic operations forEGFA as Fig. <a href="#bookmark47">2</a>: (1) dust sampling (DS), (2) initialize, (3) group, (4) move and rotate, (5) absorb, and (6) explode. The detailed processes of EGFA are summarized as follows:Step1:Subspacelocationby dust sampling(DS).The task of DS is to efﬁciently locate a small enough search space which more likely contains the optimal solution.Step 2: Initialize the dust population randomly based on the subspace located by Step 1.Step 3: Divide the dust population into several subgroups randomly, and calculate the mass value of all individuals. In each group, set the dust particle with maximum mass value<img src="/media/202408//1724856291.1420312.png" />Dust sampling<img src="/media/202408//1724856291.194164.png" /><img src="/media/202408//1724856291.2059789.png" /><img src="/media/202408//1724856291.215571.png" />Move and Rotate<img src="/media/202408//1724856291.230619.png" />yesExplode<img src="/media/202408//1724856291.254837.png" /><a id="bookmark47"></a>Fig. 2 Flow chart of EGFAas the center, and set its ﬂag as 1; set the other individuals as the surrounding dust particles, set their ﬂag as 0.Step 4: Check the stop condition. If the stop condition is met, return the best solution and the algorithm terminates, otherwise goes to Step 5.Step 5: Perform the movement and rotation operation. In each group, each center attracts its surrounding dust particles by the gravitation ﬁeld, and the gravitation ﬁeld makes all surrounding dust particles move toward their centers.Step 6: Perform the absorbing operation. Some surround- ing dust particles which are close to their centers enough are absorbed by the centers. The size of the dust population will decrease in this process.Step 7: Perform the explosion operation, and some new dust particles are generated around the centers. When explo- sion operation is accomplished, algorithm goes to Step 4.In addition, DS in Step 1 avoids a long iterative process because the algorithm only searches in the subspace which is small enough compared with the original search space. The explosion operation maintains the size of the population and can stop the algorithm from being in stagnation behavior because of falling into local optima.In this work, we proposed a NAS method based on explo- sion gravitation ﬁeld algorithm, EGFA-NAS for short. In EGFA-NAS, an individual (adust particle) represents a can- didate network architecture. EGFA-NAS aims to discover a network architecture with the best performance, such as accuracy on the testing dataset. For the NAS task, the sub- space small enough that contains the best architecture is hardto locate and computationally intensive. Therefore, EGFA- NAS abandons the ﬁrst operation DS. As a population-based method for the NAS task, there are several key issues to be addressed. Namely, (1) which type of search space to search, (2) how to represent and code a CNN network, (3) how to accelerate the network architecture evaluation process, (4) <a id="bookmark24"></a>how to use heuristic information to guide the search process.Proposed NAS methodMicro search spaces, such as NASNet[<a href="#bookmark14">17</a>], DARTS[<a href="#bookmark16">19</a>], and NAS-Bench-201 [<a href="#bookmark19">23</a>] search spaces, are popularly utilized for NAS tasks recently, which search for the neural cells to form the blocks and construct the macro skeleton of network bystackingmultipleblocksmultipletimes as[<a href="#bookmark13">16</a>–<a href="#bookmark38">20,</a><a href="#bookmark19">23,</a><a href="#bookmark48">46]</a>. In this work, we propose an efﬁcient NAS method for micro search space. To investigate the performance of our proposed method sufﬁciently, we choose two classical micro search spaces: i.e., NAS-Bench-201 and DARTS search space to test.<a id="bookmark49"></a>Representation of search spaceIn this work, we search for a computation cell as the build- ing block of the ﬁnal architecture and represent a cell as a directed acyclic diagram (DAG). Speciﬁcally, a node repre- sents the information ﬂow, e.g., a feature map in CNNs, and an edge between two nodes donates the candidate operation, which is known as successful modules designed by human experts. We denoted O as the candidate operations set. To process the intermediate nodes more efﬁciently in the for- ward propagation, two kinds of cells need to be searched: normal cell with stride of 1 and reduction cell (block) with stride of 2. Once the two kinds of cells are identiﬁed, we can stack multiple copies of the searched cell to make up a wholeneuralnetwork. Intherestofthissection, we introduce the two search spaces: NAS-Bench-201 and DARTS search space, respectively.NAS-Bench-201NAS-Bench-201 was proposed by Donget al. <a href="#bookmark19">[23</a>], which is an algorithm-agnostic micro search space. Speciﬁcally, a cell from NAS-Bench-201 includes one inputnode,three compu- tational nodes, the last computational node is also the output node for next cell. Every edge in a cell has ﬁve candidate options. Then a cell in NAS-Bench-201 can be represented as a DAG, the nodes of which are connected fully, and thereis 5C = 15, 625 cell candidates in total. In NAS-Bench-201, the candidate operations set O contains the following FIVE operations: (1) zeroize, (2) skip-connection, (3) 1 × 1convolution, (4) 3 × 3 convolution, and (5) 3 × 3 average pooling.As shown in Fig. <a href="#bookmark50">3,</a> the macro skeleton of NAS-Bench- 201 is mainly stacked by three normal blocks and connected by two reduction blocks. Each normal block consists of B normalcells. Thereductionblockisthebasicreductionblock <a href="#bookmark7">[ 10</a>], which serves to down-samplethe spatial sizeanddouble the channels of an input feature map. The skeleton is initiated with one 3×3 convolution, and ends up with a global average pooling layer to ﬂatten the feature map into a feature vector. 	Additionally, work [<a href="#bookmark19">23</a>] evaluates each candidate archi- tecture for NAS-Bench-201 on three different datasets: CIFAR-10, CIFAR-100 [<a href="#bookmark51">47</a>], and ImageNet-16-120 [<a href="#bookmark52">48]</a>.Then once the ﬁnal architecture is found, theretraining pro- cess is not essential, and we can directly obtain the network ﬁnal performance by the API provided by [<a href="#bookmark19">23]</a>.DARTS search spaceDARTS [<a href="#bookmark16">19</a>] search space is a popular micro search space, proposed by Liu et al. in 2019, which is similar to NASNet <a href="#bookmark14">[ 17</a>] search space but removes some unused operations and adds some powerful operations. Speciﬁcally, a cell from the DARTS search space contains two input nodes, four compu- tational nodes, and one output node. The output node is the concatenation of four computational nodes. As the depiction in Fig.<a href="#bookmark53">4</a>, there are 14 edges in a cell for search, and each edge has 8 options. Unlike theNAS-Bench-201, the nodes in a cell are not connected fully during the search phase. Moreover, during the evaluation phase, each node only connects with two previous nodes. In DARTS search space, the candidate operations set O contains the following eight operations: (1) identify, (2) zeroize, (3) 3 × 3 depth-wise separate convo- lution, (4) 3 × 3 dilated depth-wise separate convolution, (5) 5 × 5 depth-wise separate convolution, (6) 5 × 5 dilated depth-wise separate convolution, (7) 3 × 3 average pooling, (8) 3 × 3 max pooling.As shown in Fig. <a href="#bookmark53">4,</a> B normal cells are stacked as one normal block. For a given image, it forwards thought a 3 × 3 convolution and then forwards thought three normal blocks with two reduction cells in between. In this paper, we follow the work [<a href="#bookmark16">19</a>] to set up the overall network architecture of DARTS search space.Overall of search processFigure <a href="#bookmark54">5</a> shows the overall of search process in EGFA-NAS. (a) Operations on edges are initialized unknown. (b) Continuous relaxation of search space and sampling the can- didate operations for the edges with the mix probabilities. (c) Optimize the mix probabilities and the weights of cells simultaneously. (d). Inferring the ﬁnal structure of cell from the learned mixing probabilitiesRepresentation and encoding of cell<a href="#bookmark49">As discussed in “Representation of search space</a>”, the cells to search in this work can be represented by the DAGs. Specif- ically, each computational node represents one feature map, which is transformed from the previous feature map. Each edgeinthisDAGisassociatedwith an operationtransforming the feature map from one node to another node. All possi- ble operations are selected from a candidate operation set O . Then the output of any node j can be formulated as Eq. <a href="#bookmark55">(2)</a>.<a id="bookmark55"></a>Ij = Σi&lt;j oi , j(Ii), (2)where Ii and Ij represent the output of the node i and node j , respectively. oi , j represents the operation transforming the feature map from node i tonode j, which is selected from the candidate operation set O.In NAS-Bench-201 [<a href="#bookmark19">23</a>], a normal cell contains four nodes,i.e., {Ii|0 ≤ i ≤ 3}. I0 is the output tensor of the pre- vious layer, I1 , I2 , I3 are the output tensors of node 1, 2, and 3, calculated by Eq. <a href="#bookmark55">(2)</a>. According to work [<a href="#bookmark19">23</a>], a nor- mal cell contains six edges and each edge has ﬁve candidate operations.In DARTS search space, a cell contains seven nodes,i.e., {Ii|0 ≤ i ≤ 6}. I0 and I1 are the input tensors, I2 , I3 , I4 and I5 are the output tensor of node 2, 3, 4, and 5. I6 indicates the output of this cell, which is the concatenation of the four computational nodes,i.e., I6 = I2 ∩ I3 ∩ I4 ∩ I5 .Deﬁne e as the number of edges for a cell, |O| represents the size of the candidate operations set O . According to the above description of NAS-Bench-201 and DARTS search space, a cell can be encoded as A with size of e × |O| . In NAS-Bench-201, e = 6, |O| = 5, A is a tensor with size of 6 × 5. In DARTS search space, e = 14, |O| = 8, A is a tensor with size of 14 × 8. A general representation for a cell is formulated as Eq. <a href="#bookmark56">(3)</a>.<table><tr><td>A =</td><td>「I I I I I I I I I I I \</td><td><img src="/media/202408//1724856291.368602.png" />0 7 「<img src="/media/202408//1724856291.412488.png" />.1 <img src="/media/202408//1724856291.4176059.png" /> = <img src="/media/202408//1724856291.420068.png" /> <img src="/media/202408//1724856291.422949.png" />p <img src="/media/202408//1724856291.426388.png" /> <img src="/media/202408//1724856291.431987.png" /><img src="/media/202408//1724856291.435817.png" /></td><td>a , a , ··· , a , ··· , a0|O|−1 7a , a , ··· , a , ··· , a1|O|−1 <img src="/media/202408//1724856291.4446871.png" />I. . . . I.. .. ··· .. ··· .. IIa , a , ··· , a , ··· , ap|O|−1 <img src="/media/202408//1724856291.452548.png" /> ,I. . . . I.. .. ··· .. ··· .. Ia−1 , a−1 , ··· , a−1 , ··· , ae| <img src="/media/202408//1724856291.459757.png" />|1−1|<a id="bookmark56"></a>(3)</td></tr></table><img src="/media/202408//1724856291.463386.png" />candidate operations for edge p, a is the q <img src="/media/202408//1724856291.478431.png" />p and represents the probability of sampling the q th candidate operation for edge p. In fact, the way of encoding for cell as<img src="/media/202408//1724856291.482305.png" />Fig. 3 Macro skeleton of NAS-Bench-201<img src="/media/202408//1724856291.486806.png" />Fig. 4 Macro skeleton of DARTS search space<img src="/media/202408//1724856291.4908428.png" />Fig. 5 Overall of search process<a id="bookmark50"></a><a id="bookmark53"></a><a id="bookmark54"></a>Eq. <a href="#bookmark56">(3</a>) can be used for any micro search space, the searched cell of which has a ﬁxed number of edges e and a deﬁned candidate operations set O .node j to node i) that selects the kth candidate operation as the transformation function, ok represents the kth candidate operation,Ij is the output of node j , wk(i←j) is the weight for the function of ok on edge(i←j) . To make the search space continuous, we relax the probability of a particular operation αk(i←j) to a softmax over all possible operations by Eq. <a href="#bookmark58">(5)</a>.<a id="bookmark58"></a><img src="/media/202408//1724856291.500273.png" />where ck are i.i.d that samples from Gumble(0, 1), ck = −log(−log(u)) with u ~ Unif[0, 1]. τ is a softmax temper- ature; in this work, τ is set 10 as same as study [<a href="#bookmark19">23]</a>.Training strategyIn this work, we aim to reduce the computational cost by uti- lizing the population mechanism of EGFA-NAS. The mainContinuous relaxation of the search space<a href="#bookmark49">As described in “Representation of search space</a>”, a neu- ral network architecture consists of many copies of the cell. These cells are sampled from the NAS-Bench-201 and DARTS search space. Speciﬁcally, from node j to node i , we sample the transformation function from the candidate operation set O with a discrete probability α (i←j) . During the search, we calculated each node in a cell by Eq. <a href="#bookmark57">(4)</a>.<a id="bookmark57"></a><img src="/media/202408//1724856291.519548.png" />where |O| is the number of candidate operation of the set O , αk(i←j) represents the probability for the edge(i←j) (from<img src="/media/202408//1724856291.527701.png" />Fig. 6 Training strategy for EGFA-NASidea of the training strategy is illustrated as Fig. <a href="#bookmark60">6.</a> Specif- ically, deﬁne DT as the training dataset, batch_num as the number of batches of DT , n as the population size. At each epoch, each dust individual is training on k batches, where k =「batch_num/nl. All dust individuals cooperate to com- plete the training of the dataset at each epoch. This training process repeats until the maximum number of epochs is reached. Each dust individual (architecture network) will be trained on many differentbatchessincethenumberofbatches batch_num is usually larger than the population size n and the training process is repeated for a large number of epochs. In this work, set batch_num = 98, n = 20, k = 5 for the CIFAR-10, andsetthemaximumnumberofepochs as80and 200 for NAS-Bench-201 and DARTS search space, respec- tively. Although each dust individual (architecture network) is trained only on a subset (1/ntraining data) at each epoch, it will be trained on all training data over a large number of epochs by this training strategy.<img src="/media/202408//1724856291.544557.png" />In addition, due to the facts that each dust individual is responsible for part of the training work, and the com- plete training of each epoch is done with the participation of all individuals, therefore the efﬁciency of EGFA-NAS is not sensitive to the setting of population size n, which <a href="#bookmark59">will be experimentally conﬁrmed in “Parameter settings for NAS-Bench-201” .</a><a id="bookmark60"></a>Explosion operation and weights inheritanceIn the context of neural architectural search, a dust in EGFA-NAS represents a candidate architecture and not only maintains the original four attributes: location, mass, group number, and a Boolean ﬂag indicating whether it is a center as description in 2.3, but also maintains an attribute “w” to record the weights of functions in cells. Each dust particle can be represented by a ﬁve- tuple (location, w, mass, group, ﬂag). In EGFA-NAS, the location is denoted as the operations mixing probabil- ity A, then a neural network architecture can be represented as (A, w, mass, group, ﬂag).As a population-based NAS method, the main compu- tational bottleneck of EGFA-NAS is involving a lot of evaluation of architectures. Inthis work, we attemptto reduce the computational cost by taking advantage of the working mechanism of EGFA. At each epoch, additional computa- tional cost is caused because a number of new generated dust particles (architectures) need to be trained during the explo- sion operation. On the other hand, the new dust particles are generated based on the center dust, and there are closerela- tionships between the new generated dust particles and their center. Based on the above two observations, we proposed a weight inheritance strategy during explosion operation. The detail of explosion operation in EGFA-NAS is described in Algorithm 1.<img src="/media/202408//1724856291.5631711.png" /><img src="/media/202408//1724856291.566999.png" />Algorithm 1 Explosion operationInput: the size of dust population n , the absorptivity abs ,the number of epochs epochmax,the maximum radius Tpinax and minimum radius rmin , current epoch epochcunr, dust population Dustabsorb , the center dust center ,new generated dustpopulation Duste-D ·<img src="/media/202408//1724856291.571929.png" />output:the dustpopulation Dustexplode<img src="/media/202408//1724856291.579873.png" />2. for each group do3. for each ncw generated individual dust, do 4. dust.<s>A=center.A*(1-r)+A </s> *r<img src="/media/202408//1724856291.617593.png" />6. Dust Dust dust i7. end for 8. For9. for each individual dusti in Dustne, .do<img src="/media/202408//1724856291.6532328.png" />10. construct architecturebased on parametersAW Of dusti11. L, and LV<img src="/media/202408//1724856291.699084.png" /><img src="/media/202408//1724856291.708117.png" />14. end for<img src="/media/202408//1724856291.7397382.png" />exp<s> </s>16. for each group do<img src="/media/202408//1724856291.761871.png" />18. end for<img src="/media/202408//1724856291.7944078.png" />19. Return Dust.snlodeAs shown in Algorithm 1, the ﬁrst part (lines 1–8) is the process of generating new individuals based on the center dust. Themixprobabilities A ofcandidateoperationsofdusti are computed as line 4, the weights w of functions in cells are inherited from the center dust as line 5. The second part (lines 9–14) calculates the mass value for new generated dust particle, and update the parameter w. Line 15 combines the dust population Dustabsorb(output of previous process) and the new generated dust population Dustnew . The last part (lines 16–18) updates the center dust for each group. By uti- lizing the weight inheritance, the new generated dust can be evaluated directly at the current epoch without retraining.Figure <a href="#bookmark61">7</a> illustrates the process of generating new dust particlesby means ofweightinheritanceduringthe explosion operation. <img src="/media/202408//1724856291.806011.png" /> candidate operations for edge i , wi records the weights of functions for edge i. The right partition in Fig. <a href="#bookmark61">7</a> shows thenew generated dust population with size of m, the mixing probability A of new dust particles is based on their center as line 4 in Algorithm 1, and parameters w are inherited from their center dust particle as line 5 in Algorithm 1.Process of EGFA-NASAs described above, during the process of NAS, the two parameters: architecture A and weight w need to be opti- mized meanwhile. To solve the bi-level optimization prob- lem, we divide the original training dataset into two parts: the new training dataset DT and the validation dataset DV, then use the new training dataset DT to optimize the parameter w, use the validation dataset DV to optimize the parameter A. In EGFA-NAS, we apply the EGFA and gradient descent<img src="/media/202408//1724856291.821991.png" />Fig. 7 Process of generating new<a id="bookmark61"></a>dust particles by weightinheritance during explosion operationjointly to optimize the parameter wand architecture A mean- while in an iterative way. The processes of EGFA-NAS are described in detail as follows:Step 1: Initialize all parameters, including the size of dust population n, the number of group g, the absorptivity abs for absorb operation, the number of epochs epochmax , the maximum radius rmax and minimum radius rmin for explosion strategy; initialize the dust population Dust = {dust0 , dust1 , ··· , dustn−1} randomly. For each dusti, the location (the ith cell architecture dusti.A) is initialized ran- domly, whichis a e×|O| tensor asEq.(<a href="#bookmark56">3</a>). Afterinitialization, each cell can be stacked into a neural network. Then the loss on training dataset LT and the loss on validation dataset LV can be calculated. To optimize the two parameters w and A meanwhile, we use Eq. <a href="#bookmark62">(6</a>) to evaluate the performance of network architecture, and denote Eq. <a href="#bookmark62">(6</a>) as the mass value of dusti . It is noted that the LT and LV are not the loss of network architecture after full training, but the loss on the training dataset and the validation dataset at the current epoch, respectively.<a id="bookmark62"></a>dusti .mass = L + L , (6)where the losses LT and LV are calculated by Eq. <a href="#bookmark63">(7</a>), which are the cross-entropy loss functions [<a href="#bookmark64">49]</a>.<a id="bookmark63"></a><img src="/media/202408//1724856291.8679428.png" />where x represents the data sample, y is the true label, represents the predicted label, and s is the size of data.Step 2: Divide the dust population into g subgroups. In EGFA-NAS, the value of g is set as 2; set the dust particle with maximum mass as the center dust, and the others are the surrounding dust particles. For dusti, the attribute flag is set as Eq. <a href="#bookmark65">(8</a>), where best_mass j is the maximum mass value in group j.<a id="bookmark65"></a><img src="/media/202408//1724856291.8829818.png" />Step 3: Check the termination conditions. There are two termination conditions in EGFA-NAS, one is the maximum epochs, the other one is the average change of mass value of dust population. Once one condition is met, the main loop of EGFA-NAS ends. Then return the optimal network archi- tecture A and deduce the structure of the neural network; otherwise, go to Step 4.Step4: Perform themovement and rotation operation. The surrounding dust particles move toward the center dust. For each dust particle dusti, the pace of movement is calculated <a id="bookmark66"></a>by Eq. <a href="#bookmark66">(9)</a>.△A1 = p ∗ (exp (center.A +3) − exp (dusti.A +3))+ q ∗ Arandom , (9)where center.A presents the cell structure of the center dust; dusti.A represents the ith cell structure; Arandom is a 6 × 5 tensor generated randomly. p is the pace of movement, q is a value close to zero. In this work, we set p = 0.1, q = 0.001, respectively. We denote the pace of the movement and rotation operation on the location of dusti as △A1 . In addition, in EGFA-NAS, we also apply the gradient descent to optimize the parameters: A and w. We denote the pace of gradient descent on the location of dusti as △A2, which is calculated by Eq. <a href="#bookmark67">( 10)</a>.<a id="bookmark67"></a>△A2 = −ξ2 ∇dusti.ALV(dusti.w, dusti.A), (10)whereξ2 isthelearningrate,∇dusti.ALV representsthearchi- tecture gradient on validation dataset.As shown in Fig. <a href="#bookmark68">8</a>, considering the impacts of the above two factors on the cell structure A, the location of dusti is updated as Eq. <a href="#bookmark69">( 11)</a><a id="bookmark69"></a>dusti.A = dusti.A + △A1 + △A2 . (11)During this process, for each dust particle dusti, we not only need to optimize the parameter dusti.A, but also needto optimize the parameter dusti.w, which is updated by Eq. <a href="#bookmark70">( 12)</a>.dusti .w = dusti .w − ξ1 ▽dusti.wLT (dusti .w, dusti.A),<a id="bookmark70"></a>(12)where ξ1 is the learning rate, ▽dusti.wLT represents the archi- tecture gradient on training dataset.Step 5: Perform the absorption operation. Some surround- ing dust particles with small mass value will be absorbed by their center dust. During this process, the size of dust population will change, the new size is determined by the absorptivity abs as Eq. <a href="#bookmark71">( 13)</a>.<a id="bookmark71"></a>n = n * (1 − abs), (13)where n is the size of the initial population, abs represents the absorptivity. In this work, we set abs as 0.5.<img src="/media/202408//1724856291.914452.png" />Step 6: Perform the explosion operation. During the pro- cess of Step 5, some dust particles with small mass value are absorbed by their center dust particles. To maintain the size of dust population, some new dust particles will be gen- erated around the center dust particles during this process. <a href="#bookmark60">This part is descripted in “Explosion operation and weights inheritance” in detail.</a>Once Step 6 ﬁnishes, go to Step 3.According to the above detailed description of EGFA- NAS, the pseudo-code of EGFA-NAS is shown in Algorithm 1. Step 1 (lines 1–3) is the initialization. Step 2 (lines 4–5) is the operation of grouping. Step 3 (line 6) checks the ter- mination conditions. Step 4 (lines 7–12) is the process of movement and rotation. Step 5 (line 13) is the absorption operation. Step 6 (line 14) is the explosion operation.<table><tr><td></td></tr><tr><td>Algorithm 1: EGFA<s> </s>NAS</td></tr><tr><td>Input: the training data set Dr, the validation data set D, , the populationsize n , the number of group g , the absorptivity abs, the maximum radius Thax and minimum radius rmin for explosion strategy, the maximum and current number of epochs epoch , epoch cur=0 , dust population Dust <img src="/media/202408//1724856291.929507.png" /> , best dust particle best <img src="/media/202408//1724856291.937825.png" /> ·output: center , best</td></tr><tr><td><img src="/media/202408//1724856292.04976.png" />2. Dust e initialize the dustpopulationwith size of n randomly<img src="/media/202408//1724856292.0822191.png" />4.Divide thedustpopulation into g grou ups<img src="/media/202408//1724856292.1216881.png" />6. while termination conditions are not met do 7. for each individual dust (flag=0)do<img src="/media/202408//1724856292.185666.png" />8. update dustiA by Eq. (9)-(11) 9. update dustiw by Eq. (12)10. update dusti mass by Eq. (6) 11. end for<img src="/media/202408//1724856292.214438.png" />12. update the center , best<img src="/media/202408//1724856292.2662.png" /><img src="/media/202408//1724856292.425337.png" />15. update center , best 16. end while<a id="bookmark25"></a>17. return center , best</td></tr></table>i 'ΔA2<table><tr><td></td><td><img src="/media/202408//1724856292.4706602.png" />ΔA1<img src="/media/202408//1724856292.51296.png" /><img src="/media/202408//1724856292.6065652.png" />&gt;<img src="/media/202408//1724856292.671936.png" /></td></tr><tr><td></td><td></td></tr></table>Fig. 8 Changeoftheithcell structure A duringtheprocess ofmovement <a id="bookmark68"></a>and rotation operationExperimental designThe goal of EGFA-NAS is to search the optimal neural net- workarchitectureautomaticallywhich can achieve satisfying performance on a complex task, such as image classiﬁca- tion. For this purpose, a series of experiments is designed to demonstrate the advantages of the proposed EGFA-NAS compared with the state-of-the-art NAS methods. First, we utilize the proposed EGFA-NAS to search neural network architectures in the benchmark search space: NAS-Bench- 201, and evaluate the performance of proposed EGFA-NAS by investigating the classiﬁcation accuracy and compu- tational cost of the searched architecture on CIFAR-10, CIFAR-100, and ImageNet-16-120. Second, we investigate the consistency of relative evaluation with absolute evalua- tion, in terms of the accuracy and loss. Third, we investigate the effectiveness of the weight inheritance strategy. Finally, we examine the proposed EGFA-NAS in the larger and more practical search space: DARTS search space, and investigate the performance and universality of EGFA-NAS.We ﬁrst perform the proposed EGFA-NAS in the bench- mark search space: NAS-Bench-201. When the search pro- cess terminates, the absolute performance evaluation of the optimal architecture can be obtained directly by the NAS- Bench-201’s API with negligible computational cost. By utilizing NAS-Bench-201, we verify the consistency of rel- atively performance evaluation and absolute performance evaluation for the searched network architectures without retraining from scratch. In addition, we verify the effective- ness of weight inheritance in NAS-Bench-201 search space. But when the search process in DARTS search space termi- nates, the optimal network architecture needs to be retrained from scratch and be test on the test datasets. The test classiﬁ- cation accuracy is reported as the results of our experiments. 	In the rest of this section, we introduce the peer com- petitors to compare with this proposed EGFA-NAS, thebenchmark datasets, and ﬁnally the parameter setting for the two typical search spaces: NAS-Bench-201 and DARTS search space.Peer competitors<img src="/media/202408//1724856292.739166.png" />To demonstrate the advantage of the proposed EGFA-NAS, <a href="#bookmark72">a series of competitors are chosen for comparison. “Com- petitors of NAS-Bench-201” introduces the competitors </a>compared with the performance of the optimal architec- ture searched by EGFA-NAS in NAS-Bench-201 search space, and“<a href="#bookmark73">Competitors ofDARTSsearchspace</a>” introducesthe competitors compared with the performance of opti- mal architecture searched by EGFA-NAS in DARTS search <a id="bookmark72"></a>space.Competitors of NAS-Bench-201Due to thefactsthatNAS-Bench-201(onlyhasﬁve candidate operations)is a smaller search space, andthebest architecture has lower classiﬁcation accuracy compared with the best one searched in other search space, the performance of optimal architecture searched byEGFA-NASinNAS-Bench-201 are only compared with the competitors which have reported the resultsin NAS-Bench-201 search space.The selected competitors are mainly the efﬁcient GD- basedNASmethods,includingDARTS-V1[<a href="#bookmark16">19</a>], DARTS-V2 <a href="#bookmark16">[ 19</a>], SETN [<a href="#bookmark74">50</a>], iDARTS [<a href="#bookmark75">51</a>], and GDAS [<a href="#bookmark38">20]</a>. The other three selected NAS competitors, namely ENAS [<a href="#bookmark15">18</a>], RSPS <a href="#bookmark18">[22</a>], and EvNAS [<a href="#bookmark76">52</a>], utilize RL, random search, and EA as <a id="bookmark73"></a>the search strategies for NAS tasks, respectively.Competitors of DARTS search spaceDARTS search space is a functional search space for NAS tasks, in which the optimal network architecture has promis- ingperformancecomparedwiththe state-of-the-artmanually designed CNN architectures. To compare the performance of the optimal network architecture searched byEGFA-NAS in the DARTS search space, we select four different kinds of competitors for comparison.1. The ﬁrst kind of competitors are the state-of-the-art CNN architectures, manually designed by domain experts, including ResNet-101 [<a href="#bookmark7">10</a>], DenseNet-BC [<a href="#bookmark8">11</a>], SENet<a href="#bookmark77">[53</a>], IGCV3 [<a href="#bookmark78">54</a>], ShufﬂeNet [<a href="#bookmark79">55</a>], VGG [<a href="#bookmark1">1</a>], and Wide ResNet [<a href="#bookmark80">56]</a>.2. The second kind of competitors are the state-of-the- art EA-based NAS methods, including Hierarchical EA<a href="#bookmark12">[ 15</a>], AmoebaNet-A[<a href="#bookmark13">16</a>], LEMONADE[<a href="#bookmark20">24</a>],CGP-CNN<a href="#bookmark32">[25</a>], CNN-GA [<a href="#bookmark33">26</a>], AE-CNN [<a href="#bookmark34">32</a>], and AE-CNN + E2EPP[<a href="#bookmark35">33</a>],LargeEvo[<a href="#bookmark31">27</a>], GeNet[<a href="#bookmark30">31</a>], SI-EvoNet[<a href="#bookmark81">57</a>], NSGA-Net [<a href="#bookmark21">28</a>], and MOEA-PS [<a href="#bookmark82">58]</a>.<a id="bookmark83"></a>3. The third kind of competitors utilize RL to search for CNN architectures, such as NASNet-A [<a href="#bookmark14">17</a>], NASNet- A + CutOut [<a href="#bookmark14">17</a>], Proxyless NAS [<a href="#bookmark36">34</a>], BlockQNN [<a href="#bookmark37">35</a>], DPP-Net [<a href="#bookmark84">59</a>], MetaQNN [<a href="#bookmark85">60</a>], and ENAS [<a href="#bookmark15">18]</a>.4. The fourth kind of competitors are mainly the GD- based NAS methods, such as DARTS-V1 + CutOut [<a href="#bookmark16">19</a>], DARTS-V2 + CutOut [<a href="#bookmark16">19</a>], RC-DARTS [<a href="#bookmark41">38</a>], and SNAS<a href="#bookmark17">[21</a>]. In addition, PNAS [<a href="#bookmark43">40</a>] is also selected for compar- ison, which use a sequential model-based optimization (SMBO) strategy.Benchmark datasetsTo investigate the performance of EGFA-NAS on NAS tasks, we test EGFA-NAS in two different search space, including NAS-Bench-201 and DARTS search space. Allexperiments involve three benchmark datasets: CIFAR-10,CIFAR-100 [<a href="#bookmark51">47</a>], and ImageNet-16-120 [<a href="#bookmark52">48</a>], which are widely adopted in experimental studies of state-of-the-art CNNs and NAS methods. In this work, each architecture searched in NAS-Bench-201 is trained and evaluated on CIFAR-10, CIFAR-100 [<a href="#bookmark51">47</a>], and ImageNet-16-120 [<a href="#bookmark52">48]</a>. Each architecture searched in DARTS search space is trained and evaluated on CIFAR-10, CIFAR-100. Each dataset is splitintothree subsets:trainingset, validation set,andtestset.CIFAR-10: It is an imageclassiﬁcation dataset consisting of 60K images with with classes. The original set contains 50K training images and 10K test images. Due to the need for a validation set, the original training set is randomly split into two subsets withequal size, each subset contains 25K images with ten classes. In this work, we regard one subset as the new training set and another as the validation set.CIFAR-100: It has the same images as CIFAR-10, but it categorizes the images into 100 ﬁne-grained classes. The CIFAR-100 original contains 50K images in the training set and 10K images in the test set. In this work, the original training set is randomly split into two subsets with equal size. One is regarded as the training set and another as the new validation set.ImageNet-16-120:ImageNet is a large-scaleandwell-known dataset for imageclassiﬁcation. Image-16-120 was built with 16 × 16 pixels from the down-sampling variant of ImageNet <a href="#bookmark86">[61</a>] (i.e., ImageNet 16 × 16). ImageNet-16-120 contains all images with labels ∈ [0, 119]. In sum, ImageNet-16-120 consists of 151.7K images for training, 3K images for vali- dation, and 3K images for testing with 120 classes.<a id="bookmark59"></a>Parameter settingsThis section introduces the parameter setting for EGFA-NAS in detailed.Table 1 Hyperparameter settings of searching process<table><tr><td>Parameter</td><td>Value</td></tr><tr><td>Initial channels</td><td>16</td></tr><tr><td>B</td><td>5</td></tr><tr><td>Optimizer</td><td>SGD</td></tr><tr><td>Nesterov</td><td>1</td></tr><tr><td>Momentum</td><td>0.9</td></tr><tr><td>Batch size</td><td>256</td></tr><tr><td>LR scheduler</td><td>Cosine</td></tr><tr><td>Initial LR</td><td>2.5 × 10−2</td></tr><tr><td>min_LR</td><td>1 × 10−3</td></tr><tr><td>Weight decay</td><td>5 × 10−4</td></tr><tr><td>Random ﬂip</td><td>0.5</td></tr></table>Parameter settings for NAS-Bench-201For theNAS-Bench-201 search space, the parameter settings are onlyinvolvedinthesearch process,because NAS-Bench- 201 provides the absolute (ﬁnal) performance evaluation for each architecture, and we can obtain the evaluation of the optimal architecture directly without retraining from scratch. We adopt the same skeleton network following [<a href="#bookmark19">23</a>] as Fig.<a href="#bookmark50">3.</a> Speciﬁcally, we set the number of initial channels for the ﬁrst convolution layer as 16; set the number of cells in one normal block B as 5. During the search, almost parameter settings follows [<a href="#bookmark19">23</a>], as shown in Table <a href="#bookmark83">1.</a> Speciﬁcally, we train each architecture via Nesterov momentum SGD, using the cross- entropy loss as the loss function with batch size 256. We set the weightdecay as5×10−4 anddecaythelearningratefrom 2.5 × 10−2 to 1 × 10−3 with a cosine annealing scheduler.In NAS-Bench-201 search space, we set up the same hyperparameters on three different datasets: CIFAR-10, CIFAR-100 [<a href="#bookmark51">47</a>], and ImageNet-16-120 [<a href="#bookmark52">48</a>], except for the part of data augmentation due to the slightly difference of images’ resolution. For CIFAR-10 and CIFAR-100, we use the random ﬂip with probability of 0.5, the random crop 32 × 32 patch with 4 pixels padding, and the normalization over RGB channels. For ImageNet-16-120, we use the same strategies, exceptforrandom crop16 × 16patchwith2pixels padding.The parameters listed in Table <a href="#bookmark83">1</a> are related to neural network architecture. As a population-based method,EGFA- NAShas its own parameters. Speciﬁcally, we set the number of groups g as 2, set the absorptivity abs as 0.5 for absorb operation, set the maximum radius rmax as 0.1, and set the minimum radius rmin as 0.001 for the explosion operation.As a population-based NAS method, a larger number of epochs may lead to better performance, but the computa- tional cost will also increase. We investigate the impact ofTable 2 Relative and absolute <a id="bookmark87"></a>performance (accuracy) of best architectures searched byEGFA-NAS on CIFAR-10 with different number of epochsTable 3 Relative and absolute <a id="bookmark88"></a>performance (accuracy) of best architecture searched byEGFA-NAS on CIFAR-10 with different population size<table><tr><td>Dataset</td><td>Number of epochs</td><td>Relativeperformance</td><td>Absoluteperformance</td><td>Search cost (GPU days)</td></tr><tr><td rowspan="5">CIFAR-10</td><td>40</td><td>38.12</td><td>91.71</td><td>0.025</td></tr><tr><td>60</td><td>43.91</td><td>92.16</td><td>0.037</td></tr><tr><td>80</td><td>48.27</td><td>93.67</td><td>0.048</td></tr><tr><td>100</td><td>53.05</td><td>93.67</td><td>0.062</td></tr><tr><td>120</td><td>57.58</td><td>93.67</td><td>0.076</td></tr></table><table><tr><td>Dataset</td><td>Population size</td><td>Relativeperformance</td><td>Absoluteperformance</td><td>Search cost (GPU days)</td></tr><tr><td rowspan="5">CIFAR-10</td><td>10</td><td>50.08</td><td>93.28</td><td>0.0481</td></tr><tr><td>15</td><td>49.00</td><td>93.36</td><td>0.0482</td></tr><tr><td>20</td><td>51.02</td><td>93.67</td><td>0.0482</td></tr><tr><td>25</td><td>48.83</td><td>93.67</td><td>0.0481</td></tr><tr><td>30</td><td>49.61</td><td>93.67</td><td>0.0482</td></tr></table>Note that all experimental settings are constrained by the computational resources available to us. All exper- iments are implemented via PyTorch 1.7 on one NVIDIA GeForce RTX 3090 GPU card. The computational cost is evaluated in terms of “GPU days”, calculated by multiplying the number of GPU cards by the search time in the days, following <a href="#bookmark16">[ 19,</a><a href="#bookmark38">20,</a><a href="#bookmark89">62]</a>.the maximum number of epochs on the performance and computational cost on the CIFAR-10 dataset. The relative and absolute performance (accuracy) of the best architec- ture searched by EGFA-NAS on CIFAR-10 with different numbers of epochs are shown in Table <a href="#bookmark87">2.</a> The relative perfor- mance of the searched architectures is evaluated at the last epoch in the search phase without retraining. The absolute performance of the searched architecture is inquired by the API provided by NAS-Bench-201. From the results in Table <a href="#bookmark87">2</a>, we can observe that the best performance (93.67% accu- racy on CIFAR-10) is achieved when the number of epochs is set as 80. When the number of epochs is increased to 100, no improvement of the absolute performance is achieved, although the computational cost becomes more. Hence, we set the number of epochs as 80 in the experiments for NAS- Bench-201.Generally, population size is a vital factor for the per- formance and efﬁciency of the population-based method, a larger population size usually leads to better performance, but also leads to an increment in search cost. But, in EGFA- NAS, we proposed a training strategy, which utilizes all dust individuals to complete the data training at each epoch. This training strategy reduce the sensitivity of performance to the population size, which can be veriﬁed by the results in Table <a href="#bookmark88">3.</a> Speciﬁcally, EGFA-NAS not only has a similar performance, but also has similar search cost (GPU days) with different population size. In addition, the architectures searched by EGFA-NAS achieve the best absolute perfor- mance when population sizes n ≥ 20. In view of aboveobservation, we set the population size n as 20 in this work. In a word, absolute performance (accuracy) and search cost (GPU days) of EGFA-NAS are closely related to the max- imum number of epochs, but are not much related to the population size.Parameter settings for DARTS search spaceThe neural cells for CNNs are searched in DARTS search space on CIFAR-10/100 following [<a href="#bookmark4">7,</a><a href="#bookmark14">17]</a>. The macro skele- ton ofDARTS search space isshown as Fig.<a href="#bookmark53">4</a>.Theparameter settingfor DARTS search space can bedividedintotwo parts:(1) searching phase and (2) evaluation phase.During searching phase, we set the number of initial chan- nels for the ﬁrst convolutional layer as 16, set the number of cells in a normal block B as 2, set the number of epochs as 200. For training parameter w, we optimize each architec- ture via Nesterov momentum SGD with batch size of 256, set the initial learning rate as 2.5 × 10−2, and anneal it down to 1 × 10−3 with a cosine annealing scheduler. We set the momentum as 0.9 and decay weight as 5 × 10−4 . To opti- mize parameter A, we use the Adam optimizer with default settings.During evaluation phase, we train the searched network by 600 epochs in total. We set the initial channels as 33, and set the number of cells in a normal block B as 6 or 8. We start the learning rate of 2.5 × 10−2 and reduce it to 0 with the cosine scheduler. We set the probability of path drop as 0.2 and the auxiliary tower with the weight of 0.4. OtherTable 4 Hyperparameter settings for DARTS search space<table><tr><td>Parameter</td><td>Searching</td><td>Evaluation</td></tr><tr><td>Epochs</td><td>200</td><td>600</td></tr><tr><td>Initial channels</td><td>16</td><td>33</td></tr><tr><td>B</td><td>2</td><td>6/8</td></tr><tr><td>Optimizer</td><td>SGD/Adam</td><td>SGD</td></tr><tr><td>Batch size</td><td>256</td><td>256</td></tr><tr><td>Nesterov</td><td>1</td><td>1</td></tr><tr><td>Momentum</td><td>0.9</td><td>0.9</td></tr><tr><td>Scheduler</td><td>Cosine</td><td>Cosine</td></tr><tr><td>Initial LR</td><td>2.5 × 10−2</td><td>2.5 × 10−2</td></tr><tr><td>Min_LR</td><td>1 × 10−3</td><td>0</td></tr><tr><td>Decay weight</td><td>5 × 10−4</td><td>5 × 10−4</td></tr></table>parameter settings are set as same as in the searching phase (Table <a href="#bookmark90">4)</a>.<img src="/media/202408//1724856292.984596.png" />Compared with NAS-Bench-201 (e = 6, |O| = 5), DARTS search space (e = 14, |O| = 8) is a larger search space. Then we set the number of epochs as 200 to explore DARTS search space. The other parameters about EGFA- NAS, such as population size n, the number of groups g, the absorptivity abs, the maximum radius rmax , and the maxi- <a href="#bookmark59">mum radius rmin, are set as same as “Parameter settings for NAS-Bench-201” .</a><a id="bookmark9"></a>Experimental resultsOverall resultsin NAS-Bench-201 search spaceThe experimental results of the optimal network discovered by EGFA-NAS and other competitors in NAS-Bench-201, in terms of classiﬁcation accuracy and computational cost (GPU days), are presented in Table<a href="#bookmark91">5.</a> The symbol “–” means that the corresponding result was not reported. The results of iDARTS [<a href="#bookmark75">51</a>] and EvNAS [<a href="#bookmark76">52</a>] are sourced from the original published paper, and the consequences of the other competi- tors are extracted from [<a href="#bookmark19">23]</a>. The results highlighted in bold are the results of optimal best architectures and the results of the architectures searched by EGFA-NAS.From the results in Table <a href="#bookmark91">5,</a> we can observe that EGFA- NAS can achieve better performance than the peer com- petitors: DARTS-V1 [<a href="#bookmark16">19</a>], DARTS-V2 [<a href="#bookmark16">19</a>], SETN [<a href="#bookmark74">50</a>], iDARTS [<a href="#bookmark75">51</a>], GDAS [<a href="#bookmark38">20</a>], ENAS [<a href="#bookmark15">18</a>], RSPS [<a href="#bookmark18">22</a>], and EvNAS [<a href="#bookmark76">52]</a>. Speciﬁcally, in the NAS-Bench-201 search space, EGFA-NAS discovers a network architecture with only 1.29M parameters, which consumes 0.048 GPU days andachieves93.67%accuracy on CIFAR-10. Forthe CIFAR- 100 dataset, EGFA-NAS achieves 71.29% accuracy with<a id="bookmark90"></a>1.23M parameters, and consumes 0.094 GPU day. For ImageNet-16-120, the architecture searched by EGFA-NAS obtains 42.33% accuracy with 1.32M parameters and 0.236 GPU days cost. Limited by the small search space: NAS- Bench-201, the performance of the network architecture searched is not comparable with the state-of-the-art designed CNN networks. But the performance of network architec- ture searched by EGFA-NAS has the smallest difference (0.7%worse on CIFAR-10,2.22%worse on CIFAR-100,and 4.95% worse on ImageNet-16–120) with the performance of the optimal theoretical architecture, compared with the other competitorsintheNAS-Bench-201search space. Inaddition, the proposed EGFA-NAS has the best efﬁciency compared with all selected peer competitors.Note that the search cost (GPU Days) of the competi- tors listed in Table <a href="#bookmark91">5</a> is extracted from [<a href="#bookmark19">23]</a>. But reference <a href="#bookmark19">[23</a>] does not indicate to which dataset the result belongs. The number of parameters (Params) for the peer competi- tors is obtained by running the code provided by [<a href="#bookmark19">23</a>] on the CIFAR-10 dataset. The search cost (GPU Days) of EGFA- NAS is the computational consumption counted for the three datasets, respectively, on the computational platform with one NVIDIA GeForce RTX 3090 GPU card.Effectiveness of the relative performance evaluationDue to the fact that NAS-Bench-201 [<a href="#bookmark19">23</a>] provides the eval- uation information for each candidate architecture, in this section, we utilize the API provided by NAS-Bench-201 to obtain the absolute (ﬁnal) performance evaluation (loss and accuracy) for the searched architectures without retrain-ing, and verify the effectiveness of the evaluation strategyadopted by EGFA-NAS. Figure <a href="#bookmark92">9</a> shows the comparison of relative performance evaluation with absolute perfor- mance evaluation, in terms of loss (Fig. <a href="#bookmark92">9</a>a) and accuracy (Fig.<a href="#bookmark92">9</a>b)on CIFAR-10. InFig.<a href="#bookmark92">9</a>,thelabel“rel”representsthe relative performance, andthelabel“abs”representstheabso- lute performance. The relative performance of the searched architectures is obtained on the validation dataset at the cur- rent epoch during the architecture search phase. From the results in Fig. <a href="#bookmark92">9</a>, we can observe that the relative perfor- mance of searched architectures cannot be comparable with their absolute performance, this is because the architectures searched during the search phase are not trained sufﬁciently. Figure <a href="#bookmark92">9</a>illustrates that the trend of the relative performance is consistent with the absolute performance of the searched architectures. In addition, we can observe that EGFA-NAS is only not stable enough for the ﬁrst several epochs and can achieve architectures with stable performance when the number of epochs is larger than 30. The observation above veriﬁes the effectiveness of the evaluation strategy adopted by EGFA-NAS.Table 5 Comparison ofEGFA-NAS with the peer<a id="bookmark91"></a>competitors in terms of theclassiﬁcation accuracy (%) and the computational cost (GPUdays) on CIFAR-10, CIFAR-100, and ImageNet-16-120 datasets<table><tr><td>Method</td><td>Search strategy</td><td>GPU days</td><td>Params(M)</td><td>CIFAR-10</td><td>CIFAR-100</td><td>ImageNet-16-120</td></tr><tr><td>DARTS-V1 <a href="#bookmark16">[ 19]</a></td><td>GD</td><td>0.13</td><td>0.07a</td><td>54.30</td><td>15.61</td><td>16.32</td></tr><tr><td>DARTS-V2 <a href="#bookmark16">[ 19]</a></td><td>GD</td><td>0.41</td><td>0.07a</td><td>54.30</td><td>15.61</td><td>16.32</td></tr><tr><td>iDARTS <a href="#bookmark75">[51]</a></td><td>GD</td><td>–</td><td>–</td><td>93.58</td><td>70.83</td><td>40.89</td></tr><tr><td>SETN <a href="#bookmark74">[50]</a></td><td>GD</td><td>0.35</td><td>0.41a</td><td>86.19</td><td>56.87</td><td>31.90</td></tr><tr><td>GDAS <a href="#bookmark38">[20]</a></td><td>GD</td><td>0.33</td><td>1.2a</td><td>93.51</td><td>70.61</td><td>41.71</td></tr><tr><td>ENAS <a href="#bookmark15">[ 18]</a></td><td>RL</td><td>0.15</td><td>0.07a</td><td>54.30</td><td>15.61</td><td>16.32</td></tr><tr><td>RSPS <a href="#bookmark18">[22]</a></td><td>Random</td><td>0.10</td><td>0.43a</td><td>87.66</td><td>58.33</td><td>31.44</td></tr><tr><td>EvNAS <a href="#bookmark76">[52]</a></td><td>EA</td><td>0.26</td><td>–</td><td>92.18</td><td>66.74</td><td>39.00</td></tr><tr><td>Optimal</td><td>	</td><td>	</td><td>	</td><td>94.37</td><td>73.51</td><td>47.31</td></tr><tr><td>EGFA-NAS</td><td>EGFA</td><td>0.048</td><td>1.29</td><td>93.67</td><td>–</td><td>–</td></tr><tr><td>EGFA-NAS</td><td>EGFA</td><td>0.094</td><td>1.23</td><td>–</td><td>71.29</td><td>–</td></tr><tr><td>EGFA-NAS</td><td>EGFA</td><td>0.246</td><td>1.32</td><td>–</td><td>–</td><td>42.33</td></tr></table>aCalculated by running the code publicly released by <a href="#bookmark19">[23]</a><img src="/media/202408//1724856293.044161.png" />Fig. 9 Comparison of relative evaluation and absolute evaluation of the architecture searched by EGFA-NAS<a id="bookmark92"></a>Effectiveness of weight inheritance strategy<img src="/media/202408//1724856293.1771169.png" />To improve the efﬁciency ofEGFA-NAS andreducethe com- putational cost, we propose a weight inheritance strategy <a href="#bookmark60">during the explosion operation as described in “Explosion operation and weights inheritance” . Speciﬁcally, the param- </a>eters w of new generated dust individuals are inherited from their centers. In this section, we attempt to verify the effec- tiveness of the weight inheritance strategy by replacing this proposed strategy with generating the parameter w randomly on CIFAR-10, and other settings are kept unchanged. To observe the difference between our proposed strategy and the way of generating parameter w randomly more clearly, we set the number of epochs as 300 in this experiment. Theestimated (relative) performance of searched network archi- tectures using weight inheritance and the way of generating the parameter w randomly are shown in Fig. <a href="#bookmark93">10</a>a and c, in terms of accuracy and loss, respectively. The ﬁnal (absolute) performance of the network architectures searched by the two strategies is shown in Fig. <a href="#bookmark93">10</a>band d,in terms of accu- racy and loss, respectively. The results in Fig. <a href="#bookmark93">10</a> show a big difference between the estimated (relative) performance of the two strategies. Although the ﬁnal (absolute) performance of the architectures searched by the two strategies is similar on CIFAR-10,EGFA-NAS using the proposed weight inher- itance can achieve the best network architecture earlier than utilizing the way of generating the parameter w randomly. In addition, the ﬁnal performance of the architecture searched by inheritance weight is slightly better (93.67% accuracy)<img src="/media/202408//1724856293.2076328.png" />Fig. 10 Comparison of the performance of EGFA-NAS using weight inheritance strategy and by way of generation parameter w randomly on CIFAR-10than utilizing the way of generating parameter w randomly (96.36% accuracy).<a id="bookmark93"></a>Overall results in DARTS search spaceThe experimental results of the optimal network discovered by EGFA-NAS in DARTS search space, in terms of clas- siﬁcation accuracy and computational cost (GPU days), are presented in Table <a href="#bookmark94">6.</a> The symbol “–” means that the corre- sponding results were not reported. The symbol “*” means that the results are extracted from [<a href="#bookmark16">19]</a>. The mode “a/b” in Table 5.4 means that “a” is the result for CIFAR-10 and “b”is the result for CIFAR-100. The results of most competitors are extracted from the original published papers. B = 6 or 8 represents the number of normal cells in a normal block in theretraining phase. The results highlighted in bold are the result of the architectures searched by EGFA-NAS.The results in Table <a href="#bookmark94">6</a> show that EGFA-NAS (B = 8) can achieve better performance than most state-of-the- art manual-designed CNN networks, including ResNet-101, ResNet + CutOut, SENet, IGCV3, ShufﬂeNet, VGG, andWideResNet,but a little worse than DenseNet-BC(1.05%on CIFAR-100). The performance improvement of optimal net- work architecture searched byEGFA-NAS (B = 8) is 13.9% on CIFAR-100, and 3.89% on CIFAR-10, compared with VGG.Compared with the 12 EA-based NAS methods, EGFA- NAS (B = 8) achieves better performance than Hierarchical EA, AmoebaNet-A, CGP-CNN, CNN-GA, AE-CNN, AE- CNN + E2EPP, LargeEvo, GeNet, SI-EvoNet, and MOEA- PS, but slightly worse than LEMONADE (0.19%) and NSGA-Net (0.02%) on CIFAR-10. EGFA-NAS (B = 8) achieves the best classiﬁcation accuracy (81.85%) on the CIFAR-100, and consumes the least search cost (0.21 GPU days) than all selected EA-based NAS methods.Compared with the six RL-based NAS methods, EGFA- NAS (B = 8) achieves better performance than NASNet-A, NASNet-A + CutOut, BlockQNN, DPP-Net, MetaQNN, and ENAS, but a little worse than Proxyless NAS (0.86%) on the CIFAR-10. The performance improvement of the opti- mal network architecture searched by EGFA-NAS (B = 8) is 4.15% on the CIFAR-10, and 8.99% on the CIFAR-100,Table 6 Comparison ofEGFA-NAS with the peer<a id="bookmark94"></a>competitors in terms of theclassiﬁcation accuracy (%) and the computational cost (GPUdays) on CIFAR-10, CIFAR-100<table><tr><td>Method</td><td>Search strategy</td><td>GPU days</td><td>Params (M)</td><td>CIFAR-10</td><td>CIFAR-100</td></tr><tr><td>ResNet-101 <a href="#bookmark7">[ 10]</a></td><td>Manual</td><td>–</td><td>1.7</td><td>93.57</td><td>74.84</td></tr><tr><td>ResNet + CutOut <a href="#bookmark7">[ 10]</a></td><td>Manual</td><td>	</td><td>1.7</td><td>95.39</td><td>77.90</td></tr><tr><td>DenseNet-BC <a href="#bookmark8">[ 11]</a></td><td>Manual</td><td>–</td><td>25.6</td><td>96.54</td><td>82.82</td></tr><tr><td>SENet <a href="#bookmark77">[53]</a></td><td>Manual</td><td>–</td><td>11.2</td><td>95.95</td><td>–</td></tr><tr><td>IGCV3 <a href="#bookmark78">[54]</a></td><td>Manual</td><td>–</td><td>2.2</td><td>94.96</td><td>77.95</td></tr><tr><td>ShufﬂeNet <a href="#bookmark79">[55]</a></td><td>Manual</td><td>	</td><td>1.06</td><td>90.87</td><td>77.14</td></tr><tr><td>VGG <a href="#bookmark1">[ 1]</a></td><td>Manual</td><td>–</td><td>28.05</td><td>93.34</td><td>67.95</td></tr><tr><td>Wide ResNet <a href="#bookmark80">[56]</a></td><td>Manual</td><td>–</td><td>36.48</td><td>95.83</td><td>79.50</td></tr><tr><td>Hierarchical EA <a href="#bookmark12">[15]</a></td><td>EA</td><td>300</td><td>61.3</td><td>96.37</td><td>–</td></tr><tr><td>AmoebaNet-A <a href="#bookmark13">[ 16]</a></td><td>EA</td><td>3150</td><td>3.2</td><td>96.66</td><td>81.07</td></tr><tr><td>LEMONADE <a href="#bookmark20">[24]</a></td><td>EA</td><td>90</td><td>13.1</td><td>97.42</td><td>–</td></tr><tr><td>CGP-CNN <a href="#bookmark32">[25]</a></td><td>EA</td><td>27</td><td>1.7</td><td>94.02</td><td>–</td></tr><tr><td>CNN-GA <a href="#bookmark33">[26]</a></td><td>EA</td><td>35/40</td><td>2.9/4.1</td><td>96.78</td><td>79.47</td></tr><tr><td>AE-CNN <a href="#bookmark34">[32]</a></td><td>EA</td><td>27/36</td><td>2.0/5.4</td><td>95.3</td><td>77.6</td></tr><tr><td>AE-CNN + E2EPP <a href="#bookmark35">[33]</a></td><td>EA</td><td>7/10</td><td>4.3/20.9</td><td>94.7</td><td>77.98</td></tr><tr><td>LargeEvo <a href="#bookmark31">[27]</a></td><td>EA</td><td>2750/2750</td><td>5.4/40.4</td><td>94.6</td><td>77.00</td></tr><tr><td>GeNet <a href="#bookmark30">[31]</a></td><td>EA</td><td>–</td><td>–</td><td>94.61</td><td>74.88</td></tr><tr><td>SI-EvoNet <a href="#bookmark81">[57]</a></td><td>EA</td><td>0.46/0.81</td><td>0.51/0.99</td><td>96.02</td><td>79.16</td></tr><tr><td>NSGA-Net <a href="#bookmark21">[28]</a></td><td>EA</td><td>4/8</td><td>3.3/3.3</td><td>97.25</td><td>79.26</td></tr><tr><td>MOEA-PS <a href="#bookmark82">[58]</a></td><td>EA</td><td>2.6/5.2</td><td>3.0/5.8</td><td>97.23</td><td>81.03</td></tr><tr><td>NASNet-A <a href="#bookmark14">[ 17]</a></td><td>RL</td><td>2000</td><td>3.3</td><td>96.59</td><td>	</td></tr><tr><td>NASNet-A + CutOut <a href="#bookmark14">[ 17]</a></td><td>RL</td><td>2000</td><td>3.1</td><td>97.17</td><td>–</td></tr><tr><td>Proxyless NAS <a href="#bookmark36">[34]</a></td><td>RL</td><td>1500</td><td>5.7</td><td>97.92</td><td>–</td></tr><tr><td>BlockQNN <a href="#bookmark37">[35]</a></td><td>RL</td><td>96</td><td>39.8</td><td>96.46</td><td>–</td></tr><tr><td>DPP-Net <a href="#bookmark84">[59]</a></td><td>RL</td><td>8</td><td>0.45</td><td>94.16</td><td></td></tr><tr><td>MetaQNN <a href="#bookmark85">[60]</a></td><td>RL</td><td>90</td><td>11.2</td><td>93.08</td><td>72.86</td></tr><tr><td>ENAS <a href="#bookmark15">[ 18]</a></td><td>RL</td><td>0.5</td><td>4.6</td><td>97.06</td><td>–</td></tr><tr><td>ENAS <a href="#bookmark15">[ 18</a>]*</td><td>RL</td><td>4</td><td>4.2</td><td>97.09</td><td>–</td></tr><tr><td>DARTS-V1 + CutOut <a href="#bookmark16">[ 19]</a></td><td>GD</td><td>1.5</td><td>3.3</td><td>97.00</td><td></td></tr><tr><td>DARTS-V2 + CutOut <a href="#bookmark16">[ 19]</a></td><td>GD</td><td>4</td><td>3.4</td><td>97.18</td><td>82.46</td></tr><tr><td>RC-DARTS <a href="#bookmark41">[38]</a></td><td>GD</td><td>1</td><td>0.43</td><td>95.83</td><td></td></tr><tr><td>SNAS <a href="#bookmark17">[21]</a></td><td>GD</td><td>1.5</td><td>2.8</td><td>97.15</td><td>–</td></tr><tr><td>PNAS <a href="#bookmark43">[40]</a></td><td>SMBO</td><td>225</td><td>3.2</td><td>96.37</td><td>80.47</td></tr><tr><td>EGFA-NAS (B = 6)</td><td>EGFA</td><td>0.21/0.4</td><td>2.56/2.15</td><td>96.57</td><td>80.08</td></tr><tr><td>EGFA-NAS (B = 8)</td><td>EGFA</td><td>0.21/0.4</td><td>3.47/2.88</td><td>97.23</td><td>81.85</td></tr></table>*Extracted from the reference <a href="#bookmark16">[ 19]</a>compared with MetaQNN. The proposed EGFA-NAS (B = 8) has the best efﬁciency and consumes the least GPU days even compared with the ENAS, which only consumes 0.5 GPU days on the CIFAR-10 in the published paper.Compared with four GD-based NAS methods and PNAS, EGFA-NAS (B = 8) achieves better performance than DARTS-V1 + CutOut, RC-DARTS, and SNAS, but a little worse than DARTS-V2 + CutOut (0.61%) on the CIFAR- 100. Although GD-based NAS methods usually have better efﬁciency than EA-based and RL-based methods, our pro-posed EGFA-NAS (B = 8) has the best efﬁciency compared to all selected GD-based NAS methods.In addition, EGFA-NAS can obtain better ﬁnal learning accuracy when setting larger number of cells in a normal block during theretraining phase, but will lead to larger num- ber of parameters. The overall results in Table <a href="#bookmark94">6</a> show that this proposed EGFA-NAS not only has competitive learning accuracy but also has the best efﬁciency compared with the<a id="bookmark26"></a>four kinds of competitors.ConclusionThis paper proposes an efﬁcient population-based NAS method based on the EGFA, called EGFA-NAS, which can achieve an optimal neural architecture with competitive learning accuracy but consumes a little computational cost. Speciﬁcally, EGFA-NAS relaxes the discrete search space to a continuous one and then utilizes EGFA and gradient descenttooptimizetheweights ofthecandidate architectures in conjunction. The proposed training and weight inheri- tance strategies for EGFA-NAS reduce the computational cost dramatically. The experimental results in two typical micro search spaces: NAS-Bench-201 and DARTS, demon- strate that EGFA-NAS is able to match or outperform the state-of-the-art NAS methods on image classiﬁcation tasks with remarkable efﬁciency improvement. Speciﬁcally, to <a id="bookmark2"></a>searchthe CIFAR-10onthecomputationalplatformwith one NVIDIA GeForce RTX 3090 GPU card,EGFA-NAS obtains <a id="bookmark3"></a>the optimal neural architectures in NAS-Bench-201 search space with 93.67% accuracy but only consumes 0.048 GPU days, discovers the optimal neural architectures in DARTS search space with 97.23% accuracy and a cost of 0.21 GPU day.Although EGFA-NAS is promising for designing high- performance neural networks automatically, it still has one limitation. Similar to the other NAS methods using the low- ﬁdelity evaluation strategy, the relative evaluation adopted in EGFA-NAS during the search phase may lead to miss- ing some promising architectures. In future work, we will attempt to design a better evaluation strategy with better rankconsistency for lightweight NAS.Acknowledgements This work was supported by the National Natural Science Foundation of China (no. 62072212), the Development Project ofJilinProvinceofChina(no.20220508125RC,20230201065GX), and the Jilin Provincial Key Laboratory of Big Data Intelligent Cognition (no. 20210504003GH).Data availability Data will be made available on request. DeclarationsConﬂict of interest On behalf of all authors, the corresponding author states that there is no conﬂict of interest.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,adap- tation, distribution and reproduction in any medium or format, as <a id="bookmark14"></a>long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi- cate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, <a id="bookmark15"></a>unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the<img src="/media/202408//1724856293.5380611.png" />permitteduse, youwillneedtoobtainpermissiondirectlyfromthe copy- <a href="http://creativecommons.org/licenses/by/4.0/">right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.</a><a id="bookmark1"></a>References<img src="/media/202408//1724856293.5474.png" />1. Simonyan K, Zisserman A (2015) Very Deep Convolutional Net- <a href="https://arxiv.org/abs/1409.1556">works for Large-Scale Image Recognition. arXiv preprint, arXiv: 1409.1556</a>2. Huang G, Sun Y, Liu Z etal (2016) Deep networks with stochastic depth. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision—ECCV 2016. Springer International Publishing, Cham, pp 646–6613. CiresanD, MeierU, SchmidhuberJ(2012)Multi-columndeep neu- ral networks for image classiﬁcation. In: Proceedings of the IEEE international conference on computer vision. CVPR, Providence, pp 3642–36494. Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classiﬁ- cation with deep convolutional neural networks. Commun ACM 60:84–90. <a href="https://doi.org/10.1145/3065386">https://doi.org/10.1145/3065386</a>5. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE interna- tional conference on computer vision. ICCV, pp 1440–14486. Zhao Z,Zheng P, XuS (2019) Object detection with deep learning: a review. IEEE Trans Neural Netw Learning Syst 30:3212–3232.<a id="bookmark4"></a><a href="https://doi.org/10.1109/TNNLS.2018.2876865">https://doi.org/10.1109/TNNLS.2018.2876865</a>7. Zoph B, Le QV (2017) Neural Architecture Search with Reinforce- <a id="bookmark5"></a>ment Learning. arXiv preprint, <a href="https://arxiv.org/abs/1611.01578">arXiv:1611.01578</a><img src="/media/202408//1724856293.594785.png" />8. Hesamian MH, Jia W, He X, Kennedy P (2019) Deep learning techniques for medical image segmentation: achievements and <a href="https://doi.org/10.1007/s10278-019-00227-x">challenges. J Digit Imaging 32:582–596. </a><a href="https://doi.org/10.1007/">https://doi.org/10.1007/</a><a href="https://doi.org/10.1007/s10278-019-00227-x"> </a><a id="bookmark6"></a><a href="https://doi.org/10.1007/s10278-019-00227-x">s10278-019-00227-x</a>9. Ghosh S, Das N, Das I, Maulik U (2020) Understanding deep learning techniques for image segmentation. ACM Comput Surv <a id="bookmark7"></a>52:1–35. <a href="https://doi.org/10.1145/3329784">https://doi.org/10.1145/3329784</a>10. He K,ZhangX, RenS,etal(2016)Deepresiduallearningforimage recognition. In: Proceedings of the IEEE conference on computer <a id="bookmark8"></a>vision and pattern recognition. CVPR, pp 770–77811. Huang G, Liu Z, VanDer Maaten L et al (2017) Densely connected convolutionalnetworks. In:ProceedingsoftheIEEE conference on computer vision and pattern recognition. pp 4700–4708<a id="bookmark10"></a>12. Praczyk T (2016) Cooperative co-evolutionary neural networks. IFS 30:2843–2858. <a href="https://doi.org/10.3233/IFS-162095">https://doi.org/10.3233/IFS-162095</a><img src="/media/202408//1724856293.711373.png" />13. Garcia-Pedrajas N, Hervas-Martinez C, Munoz-Perez J (2003) COVNET: a cooperative coevolutionary model for evolving artiﬁ- <a href="https://doi.org/10.1109/TNN.2003.810618">cialneuralnetworks. IEEE Trans Neural Netw14:575–596. https:// </a><a id="bookmark11"></a><a href="https://doi.org/10.1109/TNN.2003.810618">doi.org/10.1109/TNN.2003.810618</a>14. Yao X (1999) Evolving artiﬁcial neural networks. Proc IEEE <a id="bookmark12"></a>87:1423–1447. <a href="https://doi.org/10.1109/5.784219">https://doi.org/10.1109/5.784219</a><img src="/media/202408//1724856293.731762.png" />15. Liu H, Simonyan K, Vinyals O, et al (2018) Hierarchical Repre- <a href="https://arxiv.org/abs/1711.00436">sentations for Efﬁcient Architecture Search. arXiv preprint, arXiv: 1711.00436</a>16. Real E, Aggarwal A, Huang Y, Le QV (2019) Regularized evolu- <a id="bookmark13"></a>tion for image classiﬁer architecture search. AAAI 33:4780–4789.<a href="https://doi.org/10.1609/aaai.v33i01.33014780">https://doi.org/10.1609/aaai.v33i01.33014780</a>17. Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transfer- able architectures for scalable image recognition. In: Proceedings oftheIEEE conference on computervisionandpatternrecognition. CVPR, pp 8697–871018. Pham H, Guan M,Zoph B, et al (2018) Efﬁcient neural architecture search via parameters sharing. In: Proceedings of the 35th Interna- tional Conference on Machine Learning. PMLR, pp 4095–410419. Liu H, Simonyan K, Yang Y (2019) DARTS: Differentiable Archi- <a id="bookmark16"></a>tecture Search. arXiv preprint, <a href="https://arxiv.org/abs/1806.09055">arXiv:1806.09055</a>20. Dong X, Yang Y (2019) Searching for a robust neural architec- <a id="bookmark38"></a><a id="bookmark43"></a>ture in four gpu hours. In: Proceedings of the IEEE international <a id="bookmark17"></a>conference on computer vision. CVPR, pp 1761–177021. Xie S, Zheng H, Liu C, Lin L (2020) SNAS: Stochastic Neural Architecture Search. arXiv preprint, <a href="https://arxiv.org/abs/1812.09926">arXiv:1812.09926</a>22. Li L, Talwalkar A (2020) Random search and reproducibility for <a id="bookmark18"></a><a id="bookmark19"></a>neural architecture search. In: Adams RP, Gogate V (eds) Proceed- ings of the 35th uncertainty in artiﬁcial intelligence conference. PMLR, pp 367–377<img src="/media/202408//1724856293.812905.png" />23. Dong X, Yang Y (2020) NAS-Bench-201: Extending the Scope of <a href="https://arxiv.org/abs/2001.00326">Reproducible Neural Architecture Search. arXiv preprint, arXiv: 2001.00326</a><a id="bookmark20"></a>24. Elsken T, Metzen JH, Hutter F (2019) Efﬁcient Multi-objective Neural Architecture Search via Lamarckian Evolution. arXiv <a id="bookmark32"></a>preprint, <a href="https://arxiv.org/abs/1804.09081">arXiv:1804.09081</a>25. Suganuma M, Shirakawa S, Nagao T (2017) A genetic pro- gramming approach to designing convolutional neural network <a id="bookmark46"></a>architectures. In: Proceedings of the genetic and evolutionary com- <a id="bookmark33"></a>putation conference. ACM, Berlin, pp 497–504<img src="/media/202408//1724856293.834065.png" />26. SunY, Xue B, Zhang Metal (2020) Automatically designing CNN architectures using the genetic algorithm for image classiﬁcation. IEEETrans Cybern50:3840–3854. <a href="https://doi.org/10.1109/TCYB">https://doi.org/10.1109/TCYB</a>. 2020.2983860<a id="bookmark31"></a>27. Real E, Moore S, Selle A et al (2017) Large-scale evolution of <a id="bookmark51"></a>image classiﬁers. In: Proceedings of the 34th international confer- <a id="bookmark21"></a>ence on machine learning. PMLR, pp 2902–291128. Lu Z, Whalen I, Boddeti V, et al (2019) NSGA-Net: neural architecture search using multi-objective genetic algorithm. In: Proceedings of the genetic and evolutionary computation confer- ence. ACM, Prague, pp 419–42729. Hu X, Huang L, Wang Y, Pang W (2019) Explosion gravitation <a id="bookmark22"></a>ﬁeld algorithm with dust sampling for unconstrained optimization. Appl Soft Comput81:105500. <a href="https://doi.org/10.1016/j.asoc.2019">https://doi.org/10.1016/j.asoc.2019</a>. 10550030. Gould S, Fernando B, Cherian A, et al (2016) On differentiating <a id="bookmark29"></a><a id="bookmark75"></a>parameterized argmin and argmax problems with application to <a id="bookmark30"></a>bi-level optimization. arXiv:1607.0544731. Xie L, Yuille A (2017) Genetic CNN. In: Proceedings of the IEEE <a id="bookmark34"></a><a id="bookmark76"></a>international conference on computer vision. ICCV, pp 1379–138832. Sun Y, Xue B, Zhang M, Yen GG (2020) Completely automated CNN architecturedesignbased on blocks. IEEE Trans Neural Netw Learn Syst 31:1242–1254. <a href="https://doi.org/10.1109/TNNLS.2019">https://doi.org/10.1109/TNNLS.2019</a>. 2919608<a id="bookmark35"></a>33. SunY, Wang H, Xue B et al (2020) Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based perfor- mance predictor. IEEE Trans Evol Comput 24:350–364. https:// <a id="bookmark36"></a>doi.org/10.1109/TEVC.2019.292446134. Cai H, Zhu L, Han S (2019) ProxylessNAS: Direct Neural Archi- tectureSearch on Target Task and Hardware. arXiv preprint, arXiv: 1812.0033235. Zhong Z, Yang Z, Deng B et al (2021) BlockQNN: efﬁcient block- <a id="bookmark37"></a>wise neural network architecture generation. IEEE Trans Pattern <a id="bookmark39"></a>Anal Mach Intell 43:2314–2328. <a href="https://doi.org/10.1109/TPAMI">https://doi.org/10.1109/TPAMI</a>. 2020.296919336. Chu X, Wang X, ZhangB,etal(2021)DARTS-:RobustlyStepping out of Performance Collapse Without Indicators. arXiv preprint,<a href="https://arxiv.org/abs/2009.01027">arXiv:2009.01027</a>37. Liang H, Zhang S, Sun J, et al (2020) DARTS+: Improved Differ- entiable Architecture Search with Early Stopping. arXiv preprint,<a href="https://arxiv.org/abs/1909.06035">arXiv:1909.06035</a><img src="/media/202408//1724856293.938485.png" />38. Jin X, Wang J, Slocum J, et al (2019) RC-DARTS: Resource Con- <a href="https://arxiv.org/abs/1912.12814">strained Differentiable Architecture Search. arXiv preprint, arXiv: 1912.12814</a>39. Ye P, LiB, Li Y, et al (2022) β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search. In:ProceedingsoftheIEEE <a id="bookmark42"></a>conference on computer vision and pattern recognition. CVPR,<img src="/media/202408//1724856293.955188.png" />New Orleans, LA, USA, pp 10864–10873. <a href="https://doi.org/10.1109/">https://doi.org/10.1109/</a> CVPR52688.2022.0106040. Liu C, Zoph B, Neumann M et al (2018) Progressive neural archi- tecture search. In: Proceedings of the European conference on computer vision. ECCV, pp 19–3441. ZhengM,LiuG,ZhouCet al(2010)Gravitationﬁeldalgorithmand its application in gene cluster. Algorithms Mol Biol 5:32. https:// doi.org/10.1186/1748-7188-5-3242. Zheng M, Sun Y, Liu G et al (2012) Improved gravitation ﬁeld algorithm and its application in hierarchical clustering. PLoS One <a id="bookmark44"></a>7:e49039. <a href="https://doi.org/10.1371/journal.pone.0049039">https://doi.org/10.1371/journal.pone.0049039</a><img src="/media/202408//1724856293.985219.png" />43. Zheng M, Wu J, Huang Y et al (2012) Inferring gene regulatory networks by singular value decomposition and gravitation ﬁeld <a href="https://doi.org/10.1371/journal.pone.0051141">algorithm. PLoS One 7:e51141. </a><a href="https://doi.org/10.1371/journal">https://doi.org/10.1371/journal</a><a href="https://doi.org/10.1371/journal.pone.0051141">. pone.0051141</a>44. Safronov VS (1972) Evolution of the protoplanetary cloud and <a id="bookmark45"></a>formation of the earth and the planets. Israel Program for Scientiﬁc Translations, Jerusalem45. Huang L, Hu X, Wang Y, Fu Y (2022) EGFAFS: a novel feature selection algorithm based on explosion gravitation ﬁeld algorithm. Entropy 24:873. <a href="https://doi.org/10.3390/e24070873">https://doi.org/10.3390/e24070873</a><a id="bookmark48"></a>46. Real E, Moore S, Selle A, et al (2017) Large-scale evolution of imageclassiﬁers. In:International conference on machinelearning. PMLR, pp 2902–291147. Krizhevsky A, Hinton G (2009) Learning multiple layers of fea- <a id="bookmark52"></a>tures from tiny images. 7.48. ChrabaszczP, Loshchilov I, Hutter F (2017) A Downsampled Vari- ant of ImageNet as an Alternative to the CIFAR datasets. arXiv <a id="bookmark64"></a>preprint, <a href="https://arxiv.org/abs/1707.08819">arXiv:1707.08819</a>49. Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Inf Process Syst. 31.<a id="bookmark74"></a>50. Dong X, Yang Y (2019) One-shot neural architecture search via self-evaluated template network. In: Proceedings of the IEEE inter- national conference on computer vision. ICCV, pp 3681–369051. Zhang M, Su SW, Shirui P et al (2021) iDARTS: Differentiable architecture search with stochastic implicit gradients. In: Proceed- ings of the 38th international conference on machine learning. PMLR, pp 12557–1256652. Sinha N, Chen K-W (2021) Evolving neural architecture using one shot model. In: Proceedings of the genetic and evolutionary computation conference. ACM, Lille France, pp 910–918<a id="bookmark77"></a>53. Jie H, Li S, Gang S (2018) Squeeze-and-excitation networks. In: ProceedingsoftheIEEE conference on computervisionandpattern <a id="bookmark78"></a>recognition. CVPR, pp 7132–714154. Sun K, Li M, Liu D, Wang J (2018) IGCV3: Interleaved Low- Rank Group Convolutions for Efﬁcient Deep Neural Networks. arXiv preprint, <a href="https://arxiv.org/abs/1806.00178">arXiv:1806.00178</a>55. Zhang X, Zhou X, Lin M, Sun J (2018) ShufﬂeNet: an extremely <a id="bookmark79"></a>efﬁcient convolutional neural network for mobile devices. In: Pro- ceedings of the IEEE conference on computer vision and pattern <a id="bookmark80"></a>recognition. CVPR, pp 6848–685656. ZagoruykoS, KomodakisN(2017)WideResidualNetworks. arXiv <a id="bookmark81"></a>preprint, <a href="https://arxiv.org/abs/1605.07146">arXiv:1605.07146</a><img src="/media/202408//1724856294.01585.png" />57. Zhang H, Jin Y, Cheng R, Hao K (2021) Efﬁcient evolutionary searchofattention convolutionalnetworks viasampledtraining and node inheritance. IEEE Trans Evol Comput 25:371–385. https:// <a id="bookmark40"></a><a id="bookmark82"></a>doi.org/10.1109/TEVC.2020.304027258. Xue Y, Chen C, Słowik A (2023) Neural architecture search based on a multi-objective evolutionary algorithm with probability stack. <a id="bookmark41"></a>IEEE Trans Evol Comput 27:778–786. <a href="https://doi.org/10.1109/">https://doi.org/10.1109/</a> TEVC.2023.325261259. Dong J, Cheng AC, Juan D, et al (2018) DPP-Net: device-aware <a id="bookmark84"></a>progressive search for pareto-optimal neural architectures. In: Pro- ceedings of the European conference on computer vision. ECCV, pp 517–53160. Baker B, Gupta O, NaikN, RaskarR (2017) Designing Neural Net- work Architectures using Reinforcement Learning. arXiv preprint,<a href="https://arxiv.org/abs/1611.02167">arXiv:1611.02167</a><a id="bookmark85"></a><a id="bookmark86"></a>61. Deng J, Dong W, Socher R, et al (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE interna- tional conference on computer vision. CVPR, Miami, pp 248–255<a id="bookmark89"></a>62. Fan L, Wang H (2022) Surrogate-assisted evolutionary neural architecture search with network embedding. Complex Intell Syst.<a href="https://doi.org/10.1007/s40747-022-00929-w">https://doi.org/10.1007/s40747-022-00929-w</a>Publisher’s Note Springer Nature remains neutral with regard to juris- dictional claims in published maps and institutional afﬁliations.