Analysis: Real time super resolution using ESPCN


Super Resolution is an important image processing technique in the field of Computer vision. It has got variety of applications such as medical imaging, satellite imaging, surveillance and security, astronomical imaging, etc. Real time super resolution has a lot of significance in video streaming industries. So, Lets understand one such model known as Efficient Sub-Pixel Convolution Neural Networks (ESPCN) which has the potential to be used in real time super resolution through a Q&A format.

Comparison of super-resolution with bicubic interpolation, ESPCN and corresponding high-resolution image

Whats the ESPCN model about?

The ESPCN model reconstructs the low resolution image into high resolution using sub-pixel layers in neural network leading to reduction in computation powers.

Basically, The HR(high resolution) images from datasets are downsampled ie. the spatial resolution of the image is decreased, resulting into the LR(low resolution) images. Then, the LR images are processed through ESPCN model and output will be the reconstructed SR images with specified upscale factor. The datasets used are xiph and ultra video group database which consists of frames in YUV format for video super resolution. ImageNet, BSD500,set14, etc are used for image super resolution.

What is the ESPCN network structure?

The input YUV LR image is processed through a 3 layer ESPCN model. The first 2 layers are convolution layers which helps obtain feature maps of input image(LR) and the last layer is efficient sub-pixel convolutional layer which helps to get the output image with the specified upscale factor.

ESPCN model with two convolution layers for feature maps extraction,
and a sub-pixel convolution layer that aggregates the feature maps from LR space and builds the SR image in a single step.

Explain the ESPCN model layers.

So say, there are L layers in a network then, L-1 = convolution layers and the last layer is the sub-pixel layer. Here, f = kernel size and n = filters

So, lets consider 3 layers as shown in above figure, the parameters for each layers will turn out to be like this:

  1. The first layer consists of (f1, n1) = (5, 64) which means there are 64 filters with the kernel size of 5×5, followed by a tanh activation layer.
  2. The second layer consists of (f2, n2) = (3, 32) which means there are 32 filter with the kernel size of 3×3, followed by a tanh activation layer.
  3. The third layer which is the sub-pixel layer consists of f1 = 3, which means the kernel size is 3×3 and has only one filter. Its then followed by sigmoid function.

Why only YUV images are considered?

In YUV channel, Y denotes luminance which gives intensity related data whereas U, V denotes chrominance which gives color related data. Eventually, the Y part is more sensitive for human eye as compared to U and V channels. Also, RGB channels may get slower while super resolving as compared to YUV.

How is the super resolving done in YUV images?

So, input YUV image is segregated in its three channels Y, U, V separately. Now, the Y space of the image is super resolved using ESPCN where as the U, V are super resolved using Bicubic interpolation. After that they are resized and merged to get the output SR YUV image.

What problem does the ESPCN model solve?

So, the traditional SR Convolution Neural Networks(CNN) initially upscale the low resolution(LR) images and then perform the convolution on them to get high resolution(HR) image. Eventually, whatever convolutions will happen will all take place on the upscaled LR image leading to higher computation power and these methods are also sub-optimal.

What GPU is used for this process in the paper?

In ESPCN paper, they have used NVIDIA K2 GPU. The training took around three hours on a K2 GPU on 91 images, and seven days on images from ImageNet for upscaling factor of 3.

How good are the results of the ESPCN network?

The mean PSNR (dB) with relu activation.
PSNR(dB) for BSD500
PSNR(dB) for Xiph Database

From above tables:

For Image super resolution: the PSNR values is in range 26–30 dB and SSIM is in range of 0.8–0.9

For Video Super resolution: the PSNR values are in range from 25–31dB and SSIM is in range of 0.8–0.9

How good is the inference time for ESPCN model?

Inference time is dependent on upscale factor. So, with increase in upscale factor, inference time decreases.

In xiph or Ultra video group database, For upscale factor of 3, the inference time is around 0.038secs per frame whereas for upscale factor of 4, the inference time is around 0.029secs per frame. This means, that over 26–33 frames per second (fps) can be obtained.

For image super resolution, when super resolved single image from Set14 dataset, the inference time was around 0.0047secs

Comparision of the speed of different SR models with ESPCN

Till What resolution does the ESCPN model can super resolute?

So, the ESPCN model super resolutes an image frame up to 1080p. Like, if you input an image frame with 360p resolution for the ESPCN process, the output obtained will have 1080p resolution.

Whats the whole conclusion out of it to learn about ESPCN network model?

So, ESPCN turns out to be a great network in terms of lower consumption of computation power. Its also useful for video streaming due to lower inference time. It has that potential to super resolute HD videos in real time with good quality. After processing through ESPCN, the scenes from video frames do get disappeared leading to a data loss. But this problem is solved through its VESPCN model- an upgraded version of ESPCN.


[1] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, Zehan Wang. (2016) Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

[2] Andrew Aitken, Christian Ledig, Lucas Theis, Jose Caballero, Zehan Wang, Wenzhe Shi. (2017) Checkerboard artifact-free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize

[3] Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta,
Johannes Totz, Zehan Wang, Wenzhe Shi (2017) Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion



For more such contents, follow my page on instagram : Two-Ards AI




Love podcasts or audiobooks? Learn on the go with our new app.

The Beginning. Intro To Machine Learning

A Machine Learning Approach to Databases Indexes


Review — UHCTD: A Comprehensive Dataset for Camera Tampering Detection (Camera Tampering Detection)

Reading: CNNLF — Residual Convolutional Neural Network Based In-Loop Filter (AVS3 Codec Filtering)

Math behind K-NN

Review — YOLOv4: Optimal Speed and Accuracy of Object Detection

Automatic post-editing for machine translation: a look at the future

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jaivanti Dhokey

Jaivanti Dhokey

More from Medium

Raypier — Artist of the week #5 of the GodForge Contest: “I expect this game to be a success”.

UP-SOLVING Combination Lock

Combination Lock

Collaboration: Arena Master x Goldmand

ArtCafé blog: Friday 19 November 2021