Intra-frame Based Video Compression Using Deep Convolutional Neural Network (DCNN)

— In principle, a video codec is built by implementing various algorithms and their development. The next generation of codecs involves more artificial intelligence applications and their development. DCNN (Deep Convolutional Neural Network) is a multi-layer NN concept with a deep learning approach in the field of artificial intelligence development. This study has proposed a DCNN with three hidden layers for intra-frame-based video compression. DCT and fractal methods were used to compare the performance of the proposed method. The training image (obtained from the average of all down-sampled frames) is divided into several square blocks using the square block shift operation until all parts of the image are fulfilled. All pixels in each block act as input data patterns. After the training process, the trained proposed DCNN was then used to construct the feature and sub-feature image obtained through the max function operation in the feature bank and sub-feature bank. These feature and sub-feature images were then a spatial redundancy minimizer with specific manipulation techniques and simultaneously a quantizer without converting the frame's pixels to a bit-stream. The result of this process is a compressed image. Experiments on the entire dataset resulted in AAPR (Average Approximate Performance Ratio) of 147.71%, or an average of 1.5 times better than other methods. For further studies, the performance improvement of the proposed DCNN is performed by modifying its structure so that the output is direct in the form of feature and sub-feature images. Another way is to combine it with the DCT or fractal method to improve the performance of the result.


I. INTRODUCTION
Various needs for the image and video capture process impact increasing data, which inevitably requires methods and techniques to reduce the amount of data sent or stored. The application of these methods and techniques is classified as compression technology. Various video coding standards, over the years, have always had the same core problem, namely how to reduce the size of the video data as much as possible from the original video to the compressed video that is stored or transmitted [1]. Video data contains a high degree of redundancy. The pixels within a frame often repeat or are similar to adjacent pixels, and the correlation between these pixels is known as spatial redundancy. Sequential frames in a video are usually similar, and the correlation between successive frames is referred to as temporal redundancy. In principle, video compression reduces the number of video data bits by encoding the information by eliminating spatial or temporal redundancy [2].
Video compression reduces data used by encoding digital video content. This reduction is intended to meet smaller storage and lower transmission bandwidth requirements for video content clips. The process of compressing and decompressing video requires a codec (encoder-decoder). An encoder is used to compress video data at a certain target bit rate. Simultaneously, the decoder decompresses the video signal to make it similar to the original [3]. Several video compression standards have developed since the 1990s. AVI (Audio Video Interleave) is one of the oldest video formats made by Microsoft in 1992. AVI files are usually created without compression, resulting in large file sizes. The AVI files are often used when recording before converting them to other formats. The International Telecommunication Union (ITU) and the International Organization for Standardization (ISO) have developed a video compression standard called MPEG (Motion Picture Experts Group). MPEG-1 was the first MPEG standard finalized in 1992 and widely used for video CDs [4]. Meanwhile, the second generation is MPEG-2, completed in 1995 and widely used for DVD and digital TV broadcasting [5]. MPEG-4 is Advanced Video Coding (AVC / H.264) which was completed in 2003 and is widely used for HDTV and IP-based video services [6]. MPEG-H High-Efficiency Video Coding (HEVC / H.265) was completed in 2013 and is widely used for HDR video applications [7].
There are two main video compression classes: lossy and lossless. Lossy compression permanently eliminates data redundancy, especially for coding of perception based on human color perception. This method allows compressing files to a smaller size or lower bit rate. However, it impacts the quality of the image or video when it is decompressed. In contrast, lossless compression eliminates data redundancy without affecting quality, whereby the process maintains data integrity and can be completely decompressed. Unfortunately, lossless compression does not significantly reduce the number of video data bits [8]. There are several common approaches to video compression. Inter-frame-based video compression eliminates the temporal redundancy of consecutive frames. Several studies that apply inter-frame-based video compression have been conducted in [9]- [13]. Intra-framebased video compression eliminates spatial redundancy on individual frames [14]. Several studies that apply intra-framebased video compression have been conducted [14]- [19]. Block-based video compression is a combination of the two approaches. Video frames are grouped into coding blocks to predict, modify, quantize, and encode. The first frame of each block is predicted and coded using an intra-frame-based concept. Next, intra-frame and inter-frame-based concepts are then applied to the remaining frames [20]. Several studies that apply block-based video compression have been conducted in [20]- [23].
In principle, codecs are built by implementing various algorithms and their development. The next generation of codecs involves more artificial intelligence and smart applications. Deep learning, as one of the latest developments in the field of artificial intelligence, has been widely used in various studies on video compression [9], [12], [15], [24]- [27].
Deep Learning is a machine learning method that collects every detail of the learning process by manipulating various compositions of mathematical functions. The goal is to get more abstract data, more multi-level data, and more complex data features. Deep Learning is a sophisticated development of the multi-layer ANN concept [28]. DNN (Deep Neural Network) is a multi-layer neural network with more than three layers, so it is also called Deep NN. DNN's ability to solve problems increases as more layers are used. DNN can consist of various layers used (fully connected, convolution, autoencoder, min/max, dropout, SoftMax, recurrent layer, etc.) [29]. Convolutional Neural Network (CNN) is a type of ANN that consists of a convolutional layer. Each neuron in the convolutional layer is usually connected to only a few input neurons reducing the computational complexity and parameters. Neurons in this layer convolute the matrix on their input. The input connected to neurons in a convolutional layer is sometimes referred to as the neuron's visual field. Due to the inherent spatial dependence between pixels in an image or video, CNN has proven very effective in analyzing image or video structured data with this convolutional concept. CNN has been widely used in various image and video processing studies [24], [30]- [37].
This study proposes an intra-frame-based video compression using DCNN. Video compression is carried out through the main stages: (a). elimination of redundancy, (b). quantization, (c). entropy coding. DCNN is used to extract the features and sub-features of each frame. These features and sub-features will later function as a spatial redundancy minimizer with specific manipulation techniques and simultaneously as a quantizer without converting the frame's pixels to a bit-stream. The result performance will be compared with other intra-frame-based video compression methods (DCT [38]- [40] and fractal [41]), which are commonly used for image compression.

A. Convolutional Neural Network
CNN is one type of ANN that adopts the concept of image convolution operations. Neurons in the convolution layer convolute the matrix of their input. The convolutional kernel functions as a filter that extracts features from the input image. Kernel size and value can be freely selected as needed. Suppose the input image is × with a × convolutional kernel where all kernel values are 1, then the convolution operation using kernel shift is illustrated as in Error! Reference source not found..
The convolutional kernel overlaps the input image by starting from the top left corner. It then calculates the product between the numbers in the convolutional kernel and the input image according to their location. It sums all the resulting products to get a pixel value. As an example: The (*) symbol is a convolution operator. The kernel is shifted by one pixel (stride 1) to get the next convolution result until all parts of the input image are fulfilled. The concept of image convolution was then adopted by CNN, as shown in 0.

B. The proposed Deep Convolutional Neural Network
In intra-frame-based video compression, each frame's compression is like compressing an image. This study proposes using DCNN to perform convolution operations on each square block of pixels with a specific size. Block shift operation is performed until all parts of the input image are fulfilled. All pixels in a square block act as input data patterns for DCNN. If there is an image of × size, using a square block of × size, there will be a number of − square blocks. It means that there will be − number of input data patterns. The proposed DCNN uses a × kernel with three hidden layers, as illustrated in Error! Reference source not found.. After the training process is complete, DCNN is ready to build the feature image with the size of Suppose is the weight matrix of a DCNN layer, is the column vector of the input layer with N neurons. The output layer is represented by: is a convolution matrix, $ is the bias vector of a layer, and % is the output layer after being activated by an activation function. The composition of the mathematical function of the proposed DCNN is based on Eq. (1). The learning process is performed to reduce network errors by using a gradient descent algorithm. Each layer has an outer, inner, and local gradient. The backpropagation process is carried out at each gradient point for each layer by applying the chain rule principle. All layer weights are updated using the following formula [42].
In general, the proposed DCNN for intra-frame-based video compression is shown in Error! Reference source not found.. Each frame's RGB image is first down sampled to a square size '(( × '(( using bicubic interpolation. Each component (R, G, B) is used as the input image of the proposed DCNN. The proposed DCNN is trained in such a way that it can produce a feature bank containing eight feature images and a sub-feature bank containing 16 sub-feature images. This study uses a ) × ) square block. If the input image is × size, the feature image will be 9 :CI is the standard deviation of the 9 T:; , and a coefficient of 0.9 was obtained experimentally. 9 ">TE is a quantized gray image of the R component, which is also a compressed image.

C. Training Strategy
All layers in the proposed DCNN use the tangent-sigmoid activation function represented by YZ j is the ith sampled data, X is the number of sampled data, k is the average of sampled data, l is the standard deviation of the sampled data, and m j is the ith sampled of normalized data. The variance between units in a layer must be close to unity to ensure there is no correlation and to ensure the training process's convergence. For this purpose, the weighting initialization of each layer should be using a random normal distribution with zero means.
DCNN requires network errors and error functions to control its training process. SSE (Sum Squared Error) represents network errors expressed by: r j is the ith net output. For the first net output, r ' = s34t. m 0 where m is the first normalized data input pattern, while n is the number of training data. The training process is stopped if u ≤ 54,w35 3,,+, , where the target error is selected as small as possible (close to zero).

RGB
where c is the number of RGB component (R=1, G=2, B=3).

D. Dataset
This study uses the selected MMA (Mix Martial Arts) video clips downloaded from YouTube (mp4) as a dataset. This video clip is converted into the uncompressed AVI format using a commonly used video converter application. The dataset specifications used are shown in 0 E. Performance Measurement 1) PSNR (Peak Signal-to-Noise Ratio): PSNR is the ratio between the maximum possible power of the signal and the noise's destructive power, which affects the representation's accuracy. In image or video compression, noise is an error that arises from the compression process. This noise is usually expressed in MSE (Mean Squared Error), which is the difference between the compression result and the original one. Hence, the PSNR is considered an estimate of the human perception of the reconstructed compression output quality. If x and x ">T are original and compressed images, with × spatial resolution in pixels and Xy number of frames; the PSNR is denoted by: K"e is the maximum pixel value in image (for 8-bit image of 255), c is the number of RGB component, and | = 1 … Xy. •n}9ˆˆˆˆˆˆˆ is the average PSNR value across frames, the PSNR value of a compressed video.
Typical values for PSNR in the lossy image and video compression are between 30 and 50 dB for an 8-bit bit depth, where higher is better. As for 16-bit data, it is usually between 60 and 80 dB [43]. 2) SSIM (Structural Similarity Index Measurement): SSIM is a metric used to measure the similarity between two images. The SSIM index is a full reference quality metric [44] which states that image quality predictions are based on uncompressed or distorted initial images as a reference. If e = x {, | and q = x ">T {, | , then SSIM index stated by:

3) CR (Compression Ratio) and SS (Space Saving):
The video compression ratio (CR) is defined as the ratio between the original video size and the compressed video size. In contrast, space-saving (SS) is defined as a reduction in size relative to the original size. Those metrics denoted as [45] :

4) DCT (Discrete Cosine Transform):
Commonly, video compression affects the decrease in video quality. The smaller the PSNR and SSIM, the lower the quality of the compression results. The gray image in the form of a series is considered a discrete signal. Signal energy is one of the essential characteristics of a signal, like a feature. DCT is a signal transformation method with better energy compaction properties that present the main energy components in sequence with only a few transformation coefficients. Suppose there is a discrete signal X of length N. The transformation of signal X using DCT is mathematically expressed by: I"C | is the DCT coefficients of X .
The DCT coefficient of a frame is the average DCT coefficient of the R, G, and B components. The average DCT coefficient of the entire frame is considered a feature of video energy. Video manipulation with various purposes will impact the change in the average of the absolute DCT coefficient. In this study, these changes are assumed to be changes in video quality. Suppose e I"C and q I"C are the average DCT coefficients of the original and compressed video file, respectively. The percentage change in the quality of the compressed video relative to the original is represented by: ê I"C BF: = 1 } L|e I"C | | ‰Q eqˆˆˆI "C BF: = 1 } L|q I"C | | − |e I"C | | ‰Q ∆oe I"C = .eqˆˆˆI "C BF: /ê I"C BF: 0 × 100% (11) ê I"C BF: is the average absolute value of e I"C , and eqˆˆˆI "C BF: is the average of the difference between the absolute value of q I"C and e I"C . Whereas ∆oe I"C is the percent change of the absolute DCT coefficients between q I"C and e I"C . If the value is positive, then it is considered to have an improvement in quality and vice versa. The illustration is shown in 0.

5) APR (Approximate Performance Ratio) and AAPR (Average APR):
APR is used to measure the performance of the proposed method with other methods for specific performance metrics. Suppose • ? and oe ? ‰ are the ith performance of the proposed method and the kth other method, respectively. Mathematically, the APR and MAPR are expressed by: K and L are the number of other methods and performance metrics used. The proposed DCNN training process uses MATLAB programming. The reference frame for each dataset generated by using Eq. (6) uses as an input image. Then, the trained proposed DCNN is used to compress the entire dataset.

III. RESULT AND DISCUSSION
The proposed DCNN training process uses MATLAB programming. The frame of reference generated using Eq. (4) for each dataset, as shown in 0used as an input image. Furthermore, each frame's compression would be performed using trained DCNN. This section's discussion uses the "MMA Elbow KO 2" video clip file. An example of each stage's results, as shown in 0, as illustrated in 0 using 200th frame. The results of comparing all methods for the 200th frame are shown in 0. The comparison of all methods for PSNR and SSIM of the entire frame is shown in 0.
Referring to 0, the result of spatial redundancy elimination has a smaller correlation between pixels than the original (46.33% decrease). Meanwhile, the quantization process produces a smaller file size than the original (35.62% decrease).
Referring to 0, the PSNR value of the proposed DCNN is greater than the other methods. In general, the SSIM values for all methods were almost the same. Although the SSIM value of the Fractal method is greater than the proposed DCNN (it is in line with the fractal method's ∆oe I"C value, which is greater than the proposed method), the PSNR value is less than the required PSNR value for 8-bit images (between 30 and 50 dB). It proves that the proposed DCNN is still much better than other methods. It is also proven by the ∆oe I"C value of the proposed DCNN, which is smaller than other methods.   Referring to 0, the mean PSNR value of the proposed method is greater than that of other methods. It means that the proposed method produces better compression quality than other methods. Also, the variance value of the proposed method's PSNR is greater than the other methods. It showed the adaptability of the proposed method to improve the quality of the compression results better than other methods. It was in line with the variance value of the proposed method's SSIM, which is smaller than the other methods.
A summary of the performance comparison of all methods for the entire dataset is shown in TABLE I. and 0 From both tables, it was obtained AAPR = 147.71%.

IV. CONCLUSION
This study has proposed a DCNN with three hidden layers for intra-frame-based video compression. DCT and fractal methods were used to compare the performance of the proposed method. The reference frame, the average of all frames, is used as the training input image after the downsampling process. The training image is divided into several square blocks using the square blocks shift operation until all parts of the image are fulfilled. All pixels in each block act as an input data pattern. The number of square blocks of the training image is the number of training data for the proposed DCNN.
The trained proposed DCNN was then used to construct the feature and sub-feature image obtained through the max function operation in the feature bank and sub-feature bank. Minimizing spatial redundancy and quantizing the original image uses feature images and sub-features to produce a compressed image. Experiments on the entire dataset resulted in an AAPR (Average Approximate Performance Ratio) of 147.71%. For further studies, the performance improvement of the proposed DCNN is performed by modifying its structure so that the output is direct in the form of feature and sub-feature images. Another way is to combine it with the DCT or fractal method to improve the performance of the result.