Abstract:
Optimising computing times of applications is an increasingly important task in many different areas such as scientific and industrial applications. Graphics processing unit (GPU) is considered as one of the powerful engines for computationally demanding applications since it proposes a highly parallel architecture. In this context, the authors introduce an algorithm to optimise the computing time of feature extraction methods for the colour image. They choose generalised Fourier descriptor (GFD) and generalised colour Fourier descriptor (GCFD) models, as a method to extract the image feature for various applications such as colour object recognition in real-time or image retrieval. They compare the computing time experimental results on central processing unit and GPU. They also present a case study of these experimental results descriptors using two platforms: a NVIDIA GeForce GT525M and a NVIDIA GeForce GTX480. Their experimental results demonstrate that the execution time can considerably be reduced until 34× for GFD and 56× for GCFD.
1 Introduction
Feature extraction and object recognition are subjects of extensive research in the field of image processing. Colour object recognition is widely used in the machine vision industry in real-time applications. The main issue is the recognition of objects rapidly to simplify the classification process. Unfortunately, an extracting feature of invariant descriptors on different sizes of the object remains a crucial challenge. In fact, this treatment often consumes the most important part of the computation time of the recognition process. We, therefore, focused on the acceleration of this block computation by using graphics processing units (GPUs).
Many-cores devices as GPUs have gained popularity in the past years. NVIDIA and AMD manufacture have invested widely in this category. The important distinction between these processors and traditional multi-core processors is that these devices provide a large number of low-overhead hardware threads with low-overhead context switching between them [1]. The GPU technology maybe advantageous for running experiments demanding high computing time such as embarrassingly parallel computing problems in different fields. In fact, GPU is used as a co-processor to accelerate applications running on the central processing unit (CPU) by discharging some of the compute-intensive and time-consuming portions of the code. The GPU part is using the massively parallel processing power of this device to boost performance. This is called the heterogeneous or the hybrid computing architecture. We can note two different ways in which GPUs usually contribute to the overall computation. The first is to carry out some specific tasks through user-designed kernels and the second is to execute some data-parallel primitives provided by off-the-shelf GPUaccelerated libraries such as (e.g. CUDPP, CUDNN, CUFFT, CUBLAS etc.) [2].
Many researchers have attempted to accelerate the Fourier descriptor (FD) computing, which is used for object recognition, classification and image retrieval domain. Smach et al. [3] have accelerated for the first time the FD computation based on hardware/software co-design using field programmable gate array (FPGA) technology. The image acquisition and the final classification using SVM are executed in software while the feature vector computation is implemented in hardware.
In [4], Batard et al. have proposed a definition of the Fourier transform for the L 2 functions. They have demonstrated that the previous generalisations for a colour image, where three is the number of colour channels, are particular cases of their proposal. Owing to the domain of our application (colour image processing), only the Clifford–Fourier transform defined from morphisms is used.
Mennesson et al. [5] have defined new generalised colour FD (GCFD) from the Clifford–Fourier transform and the generalised FD (GFD). They obtained better recognition rates with GCFD than those with the marginal approach for many choices of bi-vector B. Their computational costs are still low. This method was implemented purely on software using MATLAB and C.
Mohamed et al. [6] have implemented this GCFD based on codesign using, FPGA. They used the same partitioning of the system used by Smach et al. [3], but for two different architectures of hardware part. They obtained better execution time than those in [3].
In our previous works [7], we have implemented the twodimensional (2D) fast Fourier transform (FFT) [8] using many-core NVIDIA GPU architecture CUDA. The parallel version was running more than 24 times faster than a sequential version on a CPU. We extended this work in [9], to analyse the GFD and GCFD by implementing the FFT and shifting kernel under GPU, whereas the computing of FD algorithm under CPU. The speed-up of the GPU version of GFD has reached an improvement of around 9.7 times and up to 12 times for GCFD models.
In this paper, we tried to optimise the computing time of FD algorithm on GPU. We present a method to compute the different FD blocks (2D FFT, shift the FFT image and calculating FD) both on the GPU and the CPU. These computations can be performed faster than on CPU, mainly because of the high parallelism allowed by the internal structure of the GPU. Thus, we took into account some of CUDA programming optimisation such as variation in the number of threads per block and parallelising as maximum the ‘for’ loop to run simultaneously on GPU cores [10].
Here, we tried to compare the computing time of two FD models on two different platforms. We implemented these models first, on purely CPU software, then on GPU.
This paper is organised as follows. Section 2 presents an overview of FDs models. In Section 3, we describe the design of a parallel algorithm of FD. Section 4 affords case studies and experimental results of the FD models. Finally, Section 5 concludes this paper.
2 FD: GFD and GCFD
Development of storage transmission and acquisition systems induced to create great images databases. The diversity of these databases (medicals, objects and faces image) makes image representation a dynamic domain of research. Representing image by its contents is not the only way to characterise it. For all images, we can extract different attributes using some mathematical tool. FD is one of the tools that represent all information of an image in a coefficients vector. We quote two models of FD whose difference is linked to the processing image method: GFD and GCFD. These models use FFT to extract different features of an image. Those features were represented in coefficients vector that will be used as an input for learning machine to identify their class (SVM, neuron network etc.).
2.1 Fast FT
Fourier transform is the analogous of theoretical Fourier series for non-periodic functions. It is an operator that transforms an integral function to another function giving its frequency spectrum. In the Fourier domain, each point represents a particular frequency contained in the spatial domain. Fourier transform is used in a wide range of applications such as image analysis, image filtering, image reconstruction and image compression. The Fourier transform of spatial function f (x) results in a frequency function as given in (1). Its inverse transform showed in (2), where ‘v’ is a frequency variable, ‘x’ is a spatial variable and ‘i’ is a complex number
For some specialty, these equations can be recovered without losing information. They can be written in two variables u and v as in (3). Its inverse transform showed in (4)
This transform is an important image processing tool which is used to decompose an image into its sine and cosine components. The output of this transform represents the image in the frequency domain (also called Fourier domain), whereas the input image is the equivalent spatial domain. For a digital image processing, FD used discrete Fourier transform (DFT) which is the basis of most digital images processing. The DFT and inverse DFT can be written as follows:
The intensity of a digital image f (x,y) is a spatial function defined in R 2 function will be transformed into R 2 frequency function using DFT [as in (7)] as well as its inverse transform as in (8)
In (7), the DFT is used with a summation operator for the discrete function. However, in (3), it is used with an integral operator for the continuous function. DFT is a sampled Fourier transform; therefore, it does not contain all frequencies forming an image. Only a set of samples is large enough to describe the spatial domain image fully. There are various forms of FFT, but most of them restrict the size of the input image that maybe transformed. It is used to access the geometric characteristics of a spatial domain image. In most implementations, the Fourier image is shifted in such a way that the output image F(0, 0) is displayed in the centre. The FFT is based on the complex DFT, a developed model of the real DFT. The complexity of FFT varies on O[n log (n)]. However, DFT algorithms varied on O(n 2 ) with n as the number of points to transform. For n = 1024, the execution time of FFT can be 100 times faster than DFT algorithms.
2.2 Fourier descriptor
The descriptor includes all the information of the object in the image and keeps the invariance characters when applying the geometric, noise and lighting transform. It can be in a global model to represent all pixels in the image or in a local model where it shows a region of interest or interesting points. We can cite some models of descriptor such as a colour descriptors, shape descriptors and texture descriptors. We are interested in shape descriptors model as an application domain. In this context, we studied the FD for the recognition and classification of images. It uses the FFT to extract all information of a colour image. The data represented in a vector which will be utilised as input to the classifier to identify its class. FDs can be used in various applications such as object recognition and forms, tracking of objects or person and real-time image recovery. We can describe the FD steps in a few lines. Initially, divide the colour image into sub-plans. Then transform these plans using FFT. Afterwards, group the FFT images blocks and calculate the square modules of these resulting images. Finally, apply the suitable algorithm to determine the FD.
2.2.1 Generalised FD: Historically, GFD is the first model of FD used as feature vector components in colour object recognition or image retrieval. This model computes the descriptor on each colour channel of an image separately. It uses the FFT to extract the feature of the colour channel of image.
The image intensity relations were defined in Cartesian coordinates, but Zhang and Lul [11] have used them in two dimensions polar coordinates as in (9)
To use the full information of the colour object, one of the most conventional methods of GFD is to calculate the FD for the three planes red, green and blue (RGB) separately. These descriptors will be concatenated to build a single descriptor feature. This approach has been successfully applied to the handle Fourier generalised by Smach [1]. It has defined some FD invariant generalised to all digital images in colour. This following relation notes these invariants:
The integral is replaced by a finite sum in the discrete domain to produce the FD. In fact, we summarise the computation of descriptors in four main steps. First, decompose a colour image into three separate channels images. Then, apply the FFT algorithm and compute the characterising descriptor for each channel. Finally, the resulting vectors will be concatenated into a single descriptor to construct the descriptor of a colour image (see Fig. 1a) to be used as an input to the classifier.
2.2.2 Generalised colour FD: In this section, we present the second model of FD: GCFD. It can minimise the loss of data compared with GFD. This descriptor is based on CFT, the Fourier transform defined in Clifford algebra. For GCFD, we work in two channels instead of three channels (see Fig. 1b). It consists of decomposing a colour image into parallel part according to the bivector B and another orthogonal part according to I4B using CFT [5]. Like the previous model, it should apply the FFT algorithm for both channels and compute the characterising descriptor for each plan. Finally, the invariant noted by GCFD that it could be defined for the two vectors GCFD ∥ B and GCFD⊥B as follows:
The resulting descriptor is the concatenation of parallel and orthogonal parts of the descriptor. The size of the FD vector is reduced compared with the marginal method from 2 × m instead of 3 × m of GFD (m: the number of coefficients of one channel descriptor). For example, we can present an image of 128 × 128 resolution by 64 × 3 coefficients vector for GFD and 64 × 2 for GCFD instead of 1282 colour pixels.
The algorithmic complexity of CFT is O[n log (n)]; indeed, it requires the compute of eight projections [in O(n)] and two FFT [in O(n log (n))]. The GFD provides a time complexity corresponding to O[n log (n)] since it mainly uses the FFT function instead of DFT. While GCFD has the same complexity, but it provides a gain over GFD because the Clifford–Fourier transform requires only two Fourier transforms, instead of three for the marginal method.
3 Methods and algorithms
Our proposed algorithm consists of accelerating the process of FD of these previous models. We showed in Fig. 2 the designed parallel algorithm of FD that explains all transactions from CPU to GPU and inversely. The proposed algorithm is composed of three major stages. In the first stage, a colour image is generated to CPU that will be split according to the FD model. In the next stage, FFT images are computed and shifted separately. Each FFT images are
passed to the FD algorithm to extract the feature vector. Finally, the
resulting vector is normalised and concatenated for displaying.
3.1 Image acquisition and preprocessing
Currently, we use RGB colour images located in CPU for image acquisition and preprocessing stage. We read the colour image from the database to be ready for the preprocessing. Then, we split the image depending on the choice of the FD model. In the first pick, we work on the GFD model by decomposing the image into RGB plans. However, in the second pick, we work on the GCFD model, and we use the CFT transform to split the image into two plans: one in a parallel way and the second in an orthogonal way with the bi-vector. For example in Fig. 1, we show the GCFD model using a B bi-vector for the parallel component; however, a combination of R and G according to CFT transform for the orthogonal component. Therefore, we prepare the image plans to provide the feature extraction on GPU.
Algorithm 1 (see Fig. 3) shows the pseudo-code of the designing parallel algorithm of FD for the CPU portion. It starts with the declaration of algorithm variables, one to receive the image and another to store the result of FD. Before split image, it should choose one of FD model (see Fig. 1) to perform the adequate preprocessing of the colour image receipt. After splitting, each image plan accesses to its suitable variable. Thereafter, the remaining steps are the same for each image plan for both FD models. One of the most important tasks of the hybrid computing is to transfer data between CPU and GPU because it should respect the same kind of variable and the same size of memory allocated to obtain accurate results. Transfer the image plan to GPU is the next step, FD variable is also transferred and retained as a computing result at the end of this algorithm. Up to now the acquisition, the preprocessing and the transfer of data are realised. Next, the first state of computing is to launch the CUFFT kernel where resides the computing of FFT on GPU for the image plan. The result of this transform is not organised, that is why we call the FFT shift kernel from GPU which consists to reorganise and to compute the square module the FFT image. The last kernel is called from GPU to compute the FD vector using its specific algorithm, it will be clarified in Section 4.3. At the end of this algorithm, the resulting vector of the single image plan is transferred to CPU. However, before concatenating all vectors to display the FD, it shall normalise each image plan vector.
The proposed algorithm can be used for GFD and GCFD, the only difference is the appropriate number of image plans of the FD model. In addition, the kernels called from GPU, which will be explained in the next section, are linked between them. Therefore, the output data from CUFFT kernel is the input of FFT shift kernel and the same for FD kernel.
3.2 Feature extraction
At this stage, we detail the process that extracts the image feature to build the FD of a colour image. In fact, it is composed of three steps as shown in Fig. 2, all of them are executed on GPU. CUFFT is the function directly responsible for FFT transform in two dimensions on GPU [12]. Therefore, if the image plan is ready on the device, CUFFT applies the FFT transform to provide the magnitude form of frequency image. Owing to the unorganised form of the resulting image, the shifting and square module steps consist of reorganising the order of image blocks. Then, it computes the square module of the shifted image plan to complete this stage with the FD kernel.
Algorithm 2 (see Fig. 4) and Algorithm 3 (see Fig. 5) are the GPU portions of the designing parallel algorithm of FD. They describe the second and the third steps of feature extraction stage. Algorithm 2 shows the pseudo-code of the FFT shift and square module kernel that realises the exchange of the image blocks. At first, it declares the input variable where it receives the FFT image form; an auxiliary variable is required to exchange the image blocks for the shifting step and an output variable to return the resulting form of image plan of this kernel. Then, the input image plan will be divided into four equivalent blocks. Phase 1 of Algorithm 2 explains how the exchange between the first and the third blocks are made. This process is repeated for the second and the fourth blocks of the image plan. Therefore, the shifting step consists of reorganising the image blocks such that the most feature data of FFT image appear in the centre. Phase 2 computes the square module of the shifted image in order to return an organized form of image plan permitting to extract an accurate FD which is the subject of Algorithm 3.
Algorithm 3 translates the method of computing the FD of image plan after the previous treatment. It describes the last step of feature extraction stage that is FD kernel. As explained in [3], this algorithm extracts the feature vector of the image from its pixels. The first coefficient of this vector is the intensity of the pixel in the middle. Then, the second is the sum of the pixels in around of the middle, until the latest set of pixels of the image. Thereafter, this algorithm returns the FD vector that represents the whole data of the image. At the end of this stage, all FD of each image plans will be sent to CPU to finish the last stage of our designing parallel algorithm.
3.3 Display FD
The following stage is combining different results received from GPU to build the FD of our original image. However, before that we need to normalise each image plan vector with the first coefficient, and then we concatenate all vectors to present and display the FD. To conclude this section, we realised an algorithm which transforms the colour image into FD by computing the main stage through GPU. Hence, our contribution consists of implementing the feature extraction stage on GPU in order to accelerate the execution time of FD models using CUDA. In the next sections, we can check the performance of our contribution in terms of execution time and the speed-up factor between CPU and GPU.
4 Case studies and experimental results
In this section, we evaluate the parallel algorithm of FD on GPU using CUDA and compare it with the traditional algorithms on CPU. First, we define the materials used, then we present and discuss our results.
4.1 Hardware selection
The experiments were carried out on both different personal computers: the first one with a graphic card NVIDIA GeForce GT525M and an Intel Core I3-2350M 2.30 GHz CPU with 4 GB memory, but the second with NVIDIA GeForce GTX480 and an Intel Core I7-3770 3.40 GHz CPU with 8 GB memory. For a reasonable comparison, the GPU software was coded in the same programming language with CUDA Toolkits version 6.5. The CPU software was created in Visual C++ 2010. Table 1 summarises the hardware used in this paper.
4.2 Impact of block size variation
We chose to implement the FD algorithm just for a grey-scale image using the GT525M graphics card to see the influence of the block size variation in terms of execution times. Fig. 6 depicts performance comparison with different image sizes. We tested the images from 256 × 256 until 2048 × 2048 resolutions and from 1 to 1024 threads per block. The X-axis represents the variation of the number of threads per block, but the Y-axis includes the execution times in milliseconds. For 2048 × 2048 image size, we obtained the execution time just from the 32 thread per block and for 1024 × 1024 from four threads per block. These exceptions are due to the configuration limits of our equipment.
The results show that the time performances are improved by increasing the number of thread per block. The execution time obtained for 256 × 256 resolutions does not give a clear idea since all values are, very low, about some milliseconds. Therefore, we cannot confirm the best block size. However, for the other lines, we can note the decrease of the execution time for each images size when we increase the block size until 256 threads per block. After that, for 512 and 1024 threads per block, the time increases again for almost all resolutions of images. It is clear that the minimum peak of execution time at 256 threads per block. Consequently, we decided to fix this block size for all kernel implementation on GPU. The number of blocks used for this paper depends on the number of threads per block and the image size. Therefore, the number of blocks is equivalent to: num_block = image_size/ (4*threads_block).
Also very helpful is the CUDA Occupancy Calculator [13], a spreadsheet – able to determine how many blocks of threads may run in parallel on a specific GPU, depending on the number of resources (registers and shared memory) the threads use. With this tool, one can estimate how well a specific implementation makes use of the capacity of the GPU. It offers a nice way to estimate the number of threads that will be able to run at the same time. We used this tool to tune the suitable kernel launch parameters (threads and blocks) and subsequently to confirm our findings.
We empirically determined the optimal configuration using the CUDA Occupancy Calculator. When we enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes, we found a configuration consisting of 256 threads per block. This configuration requires 29 registers per thread and 68 B of shared memory per block.
4.3 GFD implementation
In this section, we tried to evaluate the GFD on two different platforms. The first one is a GT525M GPU and the second one is a GTX480 GPU. For each platform, we compared the execution time of both FD models on CPU, and on GPU. For all experimental results, the software times are evaluated using the predefined function of time library under C++ as QueryPerformanceCounter (&depart) and QueryPerformanceCounter (&fin). The GPU time is assessed using the NVIDIA Compute Visual Profiler of CUDA Toolkit 6.5 [14].
In Table 2, we can observe the GFD execution times for different image sizes on the GT525M platform including computational and communication times. It is interesting to note the significant difference between GPU and CPU execution times.
Table 3 shows the execution times for the second platform which is GTX480 GPU. At first sight, we note that the times are too lower than those obtained using the first platform. The high number of cores, of SM, and the memory bandwidth of GTX480 explain the performance of the obtained experimental results compared with those of GT525M. Similar findings are valid for the difference between CPU and GPU execution times. Indeed, GPU reduces times compared with CPU up to nine times at least. By against, the speed-up factor of Table 3 behaves differently than the previous table. It increases at the first images size until 256 × 256,then it begins the decrease for the big size of images.
For GT525M, we note that the speed-up first rise and then fall; this is due to the communication time especially for large images, since the memory bandwidth is only 28.8 GB/s. Whereas when using GTX480 the behaviour of speed-up changes, it reaches a pic for medium image size (256 × 256) because the bandwidth on these platforms is larger and it reaches 177 GB/s.
In Fig. 7a, we focused on the speed-up by using GT525M graphic card. We can note that the speed-up factor for most of the image sizes is significant. It can reach 39 times between CPU and GPU algorithms.
In the same way, Fig. 7b shows the speed-up factor comparing GTX480 GPU and its appropriate CPU. Besides, we found that GPU improved enormously the execution times of application. The speed-up compared with CPU can reach up 26 times.
4.4 GCFD implementation
In this section, we are interested in the GCFD by studying its performance and comparing the experimental implementation results under different platforms. According to the tables above, we can mention that the GPU algorithm boosts the performance comparing with CPU. In Table 4, we summarise a full experimental result of GCFD implementation under GT525M. We note an enormous gain in execution time between GPU and CPU algorithms. For the speed-up factor, the results showed in Table 4 have the same behaviour as Table 2, it provides a very high factor with the smallest image size, then it decreases until it reaches around to seven times.
Table 5 shows the importance of hardware materials to reduce the execution time for an application.
As in the third table, the speed-up factor increases for the images size
images size /???and then it decreases to achieve 12 times factor for the biggest image size. Here, we combine the efficient material which is GTX480 GPU and GCFD to achieve a speed-up more than 37 times.
The decrease of the speed-up showed from Tables 2–5 was likely due to configurations limits of both graphics cards. For GFD, we implemented the parallel algorithm of FD in three times and for GCFD in two times relative to the number of colour image plan. Despite the decline in speed-up for large image sizes, we still find a significant acceleration factor.
Least but not last, we represent below the bar charts of speed-up of GCFD indicating the GPU performance comparing with CPU. In fact for GT525M (see Fig. 8a), the speed-up factor between optimal GPU and CPU algorithms reached 55. Even for GTX 480
(see Fig. 8b), about 37 times of speed-up between GPU and CPU.
4.5 Related work
There have been some published works that deal with the GFD or CGFD optimisation. All of these effort, however, are targeted toward DSP or FPGA platforms. GFD-related GPGPU applications, on the other hand, are rare. Smach [15] has implemented the GFD using FPGA Virtex-II 3000K, and he obtained an execution time of 2.24 ms for a 128 × 128 image size against 0.86 and 0.52 ms obtained, respectively, with our two graphics cards GT525M GPU and GTX480 GPU for the same image size.
Another comparison can be made between the implementation of Mohamed et al. [6] and ours concerning the GCFD. Mohamed et al. have implemented this descriptor using FPGA Spartan 3A DSP-XC3SD3400A using two different architectures. They obtained an execution time of 0.718 ms for a 128 × 128 image size using the first architecture and for the second one they obtained 0.704 ms. However, when using GT525M GPU and GTX 480 GPU, we obtained, respectively, 0.61 and 0.34 ms. Hence, we can conclude that our implementations on GPUs are more efficient than others on FPGAs.
5 Conclusion
In this paper, we presented an extensive case study of FD programming. This work has been accentuated by applying both models of FD: GFD and GCFD, under CPU and GPU systems. We also tested our algorithms on two platforms: once containing an Intel I3-2350M CPU and a GT525M GPU. However, the second has an Intel I7-3770 CPU and a GTX480 GPU. In each case, we varied the images resolution between 64 × 64 and 2048 × 2048. We should not forget that we studied the impact of varying the block size to find the best execution time. Thus, we chose to work with the most optimal number of thread per block for all applications. Also, we have achieved a significant performance in terms of execution time on GPU. Our results indicate that the GPU algorithm may improve the overall application performance in many cases compared with a purely CPU algorithm. Our implementation attains much faster execution times, more than six and sometimes even beyond seven orders at least of time speed-up factors compared with the CPU.
6 References