**Depth-wise Separable Convolutions** (shorthand: DepSep convolution) have been proposed as an efficient alternative to traditional Convolutions. They are used in models such as MobileNet (Howard et al., 2017), EfficientNet (Tan et al., 2019), and more. They have less parameters and require less floating point operations (FLOPs) to compute. However, due to the complexities of modern compute accelerators such as GPUs, metrics such as FLOPs and parameter sizes may not correspond with real-world performance.

In this post, we will explore some of the differences between normal convolutions and DepSep convolutions. We will investigate how these differences translate to real-world performance through benchmarks, and try to explain the disparities between theoretical and real-world performance on GPUs.

###### Outline

- Comparing Convolutions to DepSep Convolutions
- Performance Benchmarks
- Estimating Arithmetic Intensity
- Conclusion

#### Comparing Convolutions to DepSep Convolutions

2D Convolutions operations act on an “image”: an array of `C`

channels of `H*W`

feature maps. The table below shows the characteristics of a normal convolution and a DepSep convolution operation when performed on a **single** image. For simplicity, we will not consider the bias term as it will not affect the outcome of our analysis. The rest of this post assumes the reader is familiar with the basic differences between a normal convolution and DepSep convolution. If you need a quick refresher, check out this blog post by Eli Bendersky.

Convolution | DepSep Convolution | |
---|---|---|

Parameters | ||

FLOPs |

Using the formulas above, we can calculate the number of parameters and FLOPs for a convolution operation. The notation used is as follows:

- Kernel size:
`(M, N)`

- Input and output channels:
`C`

,`K`

- Height and width of image:
`H`

,`W`

We will use the parameters `M=N=3`

, `C=K=128`

, `H=W=224`

, representing a typical convolution layer inside a typical ResNet-like model.

Convolution | DepSep Convolution | |
---|---|---|

Parameters | 147k | 18k |

FLOPs | 7.4 GFLOPs | 0.88 GFLOPs |

We immediately see that DepSep convolutions have about ten times less parameters and FLOPs! Surely this means that DepSep convolutions will be much faster?

#### Performance Benchmarks

We will be performing two sets of benchmarks on an NVIDIA Tesla V100 GPU. For both benchmarks, we will be using **TensorFlow 2.1** compiled from source with **CUDA 10.1** and **cuDNN 7.6.5**. Using the latest CUDA and cuDNN is important as performance optimisations are typically introduced in new versions. The wheel for the TensorFlow binary built can be downloaded here. The two benchmarks are:

- A
**microbenchmark**comparing the performance of a single convolution layer - A
**training benchmark**comparing the training throughput of a simple convolutional neural network (CNN) on CIFAR10

We will be measuring **GPU utilization metrics**, as well as **throughput metrics** such as convolutions or images per second. The GPU utilization metrics provided by NVIDIA are summarised below:

**SM utilization**: Percent of time over the past sample period during which kernels was executing on the GPU**Memory utilization**: Percent of time over the past sample period during which global (device) memory was being read or written

##### Microbenchmark

In this benchmark, we will compare the performance of a single convolution layer with `M=N=3`

, `C=K=128`

, `H=W=224`

, on two different batch sizes, 1 and 16. For this benchmark:

- We use
**half-precision**(`float16`

) since it is a more performant numerical format that works well in both training and inference, and enables the use of NVIDIA’s Tensor Cores - We use tf.function to reduce the overheads involved by running the convolution in a Graph execution mode

The results from running the benchmark are shown in the table below:

Convolution | DepSep Convolution | |

batch = 1 | ||

SM Util. | 100.0% | 100.0% |

Mem Util. | 31.0% | 48.0% |

Conv/sec | 3585 | 4241 |

FLOPS | 26.5 TFLOPS | 3.73 TFLOPS |

batch = 16 | ||

SM Util. | 100.0% | 100.0% |

Mem Util. | 35.0% | 54.0% |

Conv/sec | 4097 | 4335 |

FLOPS | 30.3 TFLOPS | 3.81 TFLOPS |

We see that not only is the performance of DepSep convolutions (measured in Conv/sec) only slightly better than normal convolutions, the achieved TFLOPS is only about 13% of normal convolutions. Memory utilization is also consistently about 50% more.

###### Results with other Accelerators

We also benchmarked two other accelerators using the same experimental setup, except for the data type. The two additional accelerators tested are Google’s **TPUv2-8** (available on Google Colab) and an **Intel Xeon E5-2698v4** 20-core CPU (view datasheet).

Of note is that the TPUv2-8 consist of **four** TPUv2 chips, each with two cores. This adds some complexity as our benchmark code can only run on one *core* (as far as I can tell). To estimate the per-*chip* (two cores) performance, we can halve the workload and run it on one core, and then multiply the score by two. For completeness, we include both benchmarks (per *core* and per *chip*).

Chip | dtype | Conv. | DepSep Conv. |
---|---|---|---|

V100 | float16 | 4097 | 4335 (1.1×) |

30 TFLOPS | 3.8 TFLOPS | ||

TPUv2 core | bfloat16 | 1459 | 519 (0.36×) |

11 TFLOPS | 0.5 TFLOPS | ||

TPUv2 chip | bfloat16 | 2606 | 536 (0.21×) |

19 TFLOPS | 0.5 TFLOPS | ||

E5-2698v4 | float32 | 85 | 245 (2.9×) |

0.6 TFLOPS | 0.2 TFLOPS |

The results are reported **per chip** for consistency reasons. Interestingly, the TPUv2 performs much worse for DepSep convolutions compared to normal convolutions, while the CPU performs much better. For both the TPUv2 and V100, the estimated TFLOPS measured by the convolution microbenchmark represents approximately a quarter of the maximum throughput (125 TFLOPS for V100 and 45 TFLOPS for a TPUv2 chip).

We also perform the same benchmark on V100 and TPUv2 across a range of Input/Output channel sizes, ranging from 8 to 2048, and plot the results below. The TPUv2 (8GB per core) reaches an out-of-memory error at 1024, and scores from then onwards is reported as 0.

##### Training Benchmark

We create a simple CNN model to classify images from the CIFAR10 dataset. The model consists of 4 convolutional layers, each with `M=N=3`

, `C=K=128`

, except for the first layer where `C=3`

. Training is being done with **automatic mixed-precision** . tf.data and XLA enabled. Model parameters are being kept in single-precision (`float32`

) while computation is done in half-precision and a dynamic loss scale is used to prevent numerical instability. Training was done with a batch size of 128.

Normal CNN | DepSep CNN | |
---|---|---|

Parameters | 529k | 136k |

SM Util. | 57.5% | 61.2% |

Mem Util. | 17.9% | 38.4% |

Images/sec | 11123 | 9094 |

Train Acc. | 81% | 74% |

We again see the disparity between performance suggested by model size (parameter count) and actual measured performance (images/sec). While the lower parameter count of the DepSep CNN suggests it should be faster, we see that the normal CNN achieves higher training throughput. The normal CNN also achieves better training accuracy.

###### Results with other Accelerators

Other accelerators were also benchmarked with the same model with batch size changes to better accommodate the different accelerators. XLA is enabled across all the benchmarks. On a Colab TPUv2-8, the batch size is quadrupled so that each TPUv2 chip uses a batch size of 128 for a global batch size of 512. Keras mixed precision policy (`bfloat16`

) is also used. A comparison with a global batch size of 128 (per-chip batch size of 32) is included for completeness.

Chip | Batch Size | Normal CNN | DepSep CNN |
---|---|---|---|

V100 | 128 | 11123 | 9094 |

TPUv2 chip | 32 per chip | 2470 | 2693 |

TPUv2 chip | 128 per chip | 6418 | 8151 |

E5-2698v4 | 32 | 150 | 687 |

The CPU benchmark is the only one that bucks the general trend, and showed almost five times better performance for DepSep CNN compared to normal CNN.

#### Estimating Arithmetic Intensity

We can understand the computation profile of each convolution better by attempting to estimate the **arithmetic intensity** for each type of convolution. This allows us to see the ratio of computation and memory access for each convolution. A naive method of estimation would be to take the number of FLOPs divided by number of memory accesses. We naively estimate memory access to be a sum of number of parameters, activations, assuming complete reuse of weights and activations. This assumption implies an ideal situation, where no redundant memory accesses are performed.

Convolution | DepSep Convolution | |
---|---|---|

Parameters | ||

FLOPs | ||

Input+Output Activations | ||

Memory Accesses | ||

Arithmetic Intensity |

Using the formula above, we can estimate the arithmetic intensity for a convolution operation. We will use the parameters `M=N=3`

, `C=K=128`

, `H=W=224`

, representing a typical convolution layer inside a ResNet-like model. We get the following results:

Convolution | DepSep Convolution | |
---|---|---|

FLOPs | 7.4 GFLOPs | 0.88 GFLOPs |

Memory Access | 26.0 MB | 25.7 MB |

Arithmetic Intensity | 569.5 | 68.4 |

We immediately see that, according to our naive estimation, **normal convolutions have almost ten times as much arithmetic intensity compared to DepSep convolutions**! In addition, this naive method assumes perfect reuse of all the weights and activations. In reality, the number of memory accesses would be higher, especially for DepSep convolutions due to the need to split and combine the image channels. With the information in hand, we can also estimate that the effective memory throughput achieved in our microbenchmarks are:

Convolution | DepSep Convolution | |
---|---|---|

Est. Mem. Throughput | 106.5 GB/s | 111.4 GB/s |

The estimated bandwidth for both types of convolution is very similar (<5% difference), and hints that this is the practical limit of our memory bandwidth due to the design of the microbenchmark.

##### Confirming Our Hypothesis

In order to confirm our hypothesis about the arithmetic intensity, we can profile each convolution (**main compute kernel only**) using Nsight Compute. Interestingly, I realised that while the cuDNN kernel was used for normal convolutions, a TensorFlow-specific kernel was used for DepSep convolutions.

First, we look at the amount of memory access, categorised by loads and stores to the device’s DRAM memory.

Convolution | DepSep Convolution | |
---|---|---|

Kernel Time | 3 ms | 2.2 ms |

Mem. Throughput | 137.7 GB/s | 218.7 GB/s |

Mem. Load | 202 MB | 193 MB |

Mem. Store | 193 MB | 193 MB |

We see more or less the *trend* that we are expecting. The amount of memory access on both types of convolution is effectively identical despite large difference in compute FLOPS.

Next, we compare this to the actual compute workload on the GPU. Since the various pipelines can execute in parallel, hence the overall utilization is **not** a sum of the individual pipeline utilization.

Convolution | DepSep Convolution | |
---|---|---|

Overall Util. | 77.1% | 74.5% |

Tensor Util. | 76.1% | 0.0% |

ALU Util. | 6.39% | 64.4% |

FP32 Util. | 0.6% | 48.9% |

FP16 Util. | 0.1% | 12.2% |

XU/SFU Util. | 2.1% | 27.1% |

From the results, we can verify that **TensorFlow’s DepSep kernel does not use Tensor Cores**. In fact, I don’t understand the workload characteristics I am seeing. It also seems to spend majority of its time executing ALU (integer/logic) and FP32 instructions instead of FP16 instructions, which theoretically have double the throughput. There is much greater utilization of the XU/SFU pipeline, which (I’m guessing in this context) is mainly used for converting between FP32 and FP16. My main conclusion here is that TensorFlow uses a highly optimal cuDNN kernel for normal convolutions, but uses a suboptimal kernel for DepSep convolutions.

## Conclusion

If we accept that the achievable memory bandwidth is the limiting factor and all our benchmarks here are **memory-bound**, then we have reached a conclusion that DepSep convolutions are not faster on GPUs due to memory bandwidth limitations. The main reason for DepSep convolution’s disappointing performance relative to convolutions is the high ratio of memory access to computation.

However, there is also some evidence to suggest that the compute kernel used for DepSep convolutions in TensorFlow is suboptimal. This can of course be a contributing factor to the disappointing performance of DepSep convolutions.

The code used in this blog post can be found in this GitHub repository.

###### Future Work

- Test with other frameworks (PyTorch? JAX?)

## 2 replies on “Depth-wise Separable Convolutions: Performance Investigations”

Hey Timothy! That’s a really nice post. I stumbled upon it while trying to figure out, why isn’t TF using the tensor cores for the depthwise convolution x_x

I wanted to point out, that from my understanding the memory access number of the depsep convolution is higher than MCN + CK + (C + K)HW. I think the intermediate channel map after the depthwise part should also be counted as memory access. So we need to access the original channel map: HWC, output the result of the depthwise convolution: HWC, load the result of the depthwise convolution again: HWC, and output the result of the 1×1 convolution: HWK. So in total we have HW(3C + K) memory accesses of inputs outputs and inner activations. Although from the fact that DRAM load/store is not changing between normal and depsep conv, these additional HW2C of accesses are to the L2 cache, right?

PS Could you perhaps add a way to subscribe to your blog per email? I am generally interested in performance aspects of network execution, and would love to read your new posts when they come out.

Hey, sorry it took so long to reply, as you might notice I am not very active on my blog. You’re right that in practice memory access number of the depsep convolution is higher than MCN + CK + (C + K)HW. However, for simplicity I was doing the comparison with an perfect/idealised implementation of depsep convolution hence disregarding the intermediate memory access. I don’t know for sure if all the parameters are stored is in the L2 cache for reuse, but I would assume that is the case for a good implementation.