Depth-wise [Separable] Convolution Explained in TensorFlow

Soroush Hashemifar
7 min readSep 24, 2021

--

Over-fitting: A common story of lazy networks

Training a deep neural network that can generalize well to new data is a challenging problem. A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and over-fit the training data-set. Both cases result in a model that does not generalize well. As IBM mentioned

Over-fitting is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data, defeating its purpose. Generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data.

A standard Convolution layer involve input*output*width*height parameters, where width and height are width and height of the filter. For an input channel of 10 and output of 20 with 7*7 filter, this will have 9820 parameters. Having so much parameters increases the chance of over-fitting.

Depth-wise Convolution, has been introduced as a way to reduce the number of parameters of a convolution network, and hence, handle the issue of over-fitting. The remained part of the article is categorized as follows: First, I illustrate how an ordinary Convolution works. Then, the concept of Depth-wise Convolution is discussed. And finally, we generalize this concept to Depth-wise Separable Convolution.

Ordinary Convolution

The standard Convolution is a fairly simple operation at heart: you start with a kernel, which is simply a small matrix (or tensor) of weights. This kernel slides over the input data, performing an element-wise multiplication with the part of the input it is currently on, and then summing up the results into a single output pixel. The kernel repeats this process for every location it slides over, converting a 3D matrix of features into 2D matrix of features. The output features are essentially, the weighted sums (with the weights being the values of the kernel itself) of the input features located roughly in the same location of the output pixel on the input layer.

It is worth mentioning that the part of the input where the kernel is operating on (the red-colored 3D tensor in the above figure), is called Local Receptive Field. More formally, each neuron of each layer will be connected to only a region of the previous layer. That region in the previous layer is called the Local Receptive Field for the current neuron. It’s a little window on the previous layer.

Here is the practical example of Depth-wise Convolution layer:

And the number of parameters and shapes of each layer is showed below:

Depth-wise Convolution

Depth-wise Convolution is a type of convolution where we apply a single Convolutional filter for each input channel. This type of Convolutions keep each channel separate. To summarize the steps, we:

  1. Split the input and filter into channels
  2. Convolve each input with the respective filter
  3. Stack the convolved outputs together

In Depth-wise Convolution layer, parameters are remaining same, meanwhile, this Convolution gives you three output channels with only a single 3-channel filter. While, you would require three 3-channel filters if you would use the ordinary Convolution.

For a concrete example, let’s say the input layer is of size 7 * 7 * 3 (height * width * channels), and the filter is of size 3 * 3 * 3. After the 2D Convolution with one filter, the output layer is of size 5 * 5 * 1 (with only 1 channel). Typically, multiple filters are applied between two neural net layers. Let’s say we have 128 filters here. After applying these 128 2D Convolutions, we have 128 output maps of size 5 * 5 * 1. We then stack these maps into a single layer of size 5 * 5 * 128. By doing that, we transform the input layer (7 * 7 * 3) into the output layer (5 * 5 * 128). As can be seen, in this type of Convolution the spatial dimensions, i.e. height and width, are shrunk, while the depth is extended. In contrast, let’s apply Depth-wise Convolution to the input layer. Instead of using a single filter of size 3 * 3 * 3 in 2D Convolution, we used 3 kernels, separately. Each filter has size 3 * 3 * 1. Each kernel convolves with 1 channel of the input layer (1 channel only, not all channels!). Each of such convolution provides a feature map of size 5 * 5 * 1. We then stack these feature maps together to create a 5 * 5 * 3 image. After this, we have the output with size 5 * 5 * 3. We now shrink the spatial dimensions, but the depth is still the same as before.

Here is the practical example of Depth-wise Convolution layer:

As you can see, there is no parameter indicating number of filters anymore in Depth-wise Convolution. The number of parameters and shapes of each layer is showed below:

In the figure above, you can see how much the number of parameters has shrunk in comparison to ordinary Convolution!

Now, let’s play with a new parameter depth_multiplier. depth_multiplier says how many Depth-wise Convolutions must be performed over the channels of the input image. Traditionally, this is 1, as we’ve seen in the drawing above. However, you might wish to manually set this to a larger value. Note that you must accommodate for the required resources, though.

The total number of Depth-wise Convolution output channels will be equal to num_filters * depth_multiplier.

Depth-wise Separable Convolution

The Depth-wise Separable Convolution is so named because it deals not just with the spatial dimensions, but with the depth dimension — the number of channels — as well. This Convolution originated from the idea that depth and spatial dimension of a filter can be separated.

You can separate the height and width dimension a matrix into two vectors. So you can represent a 3 * 3 matrix of 9 values, with only two 3 * 1 vectors, i.e. only 6 values. The same idea applied to separate depth dimension from horizontal (width * height) gives us Depth-wise Separable Convolution, where we perform Depth-wise Convolution and after that we use a 1 * 1 filter to cover the depth dimension.

One thing to notice is, how much parameters are reduced by this Convolution to output same number of channels. To produce one channel we need 3 * 3 * 3 parameters to perform Depth-wise Convolution, and 1 * 3 parameters to perform further Convolution in depth dimension. But If we need 3 output channels, we only need three 1 * 3 depth filter giving us total of 36 ( = 27 +9) parameters while for same number of output channels in ordinary Convolution, we need three 3 * 3 * 3 filters giving us total of 81 parameters.

Here is the practical example of Depth-wise Separable Convolution layer:

And the number of parameters and shapes of each layer is showed below:

Depth-wise Separable Convolution also provides the parameter depth_multiplier as shown below:

Note: Pay attention to number of parameters of separable layers in comparison to the case of depth_multiplier was equal to 1. As before, the total number of Depth-wise Separable Convolution output channels will be equal to num_filters * depth_multiplier.

--

--

Soroush Hashemifar
Soroush Hashemifar

Written by Soroush Hashemifar

Deep Learning Engineer, Researcher

No responses yet