Art’Em is an application that hopes to bring artistic style transfer to virtual reality. It aims to increment the stylization speed by using low precision networks.
Neural networks are all about matrix multiplication and dot products. The key to this project working out is the forward and back propagation. In the Art’Em - Week 2 article, we saw how the forward propagation involves numerous large matrix multiplications. We also delved into the proof of concept for matrix multiplication. Let’s delve into how all this looks in some basic code.
Getting Our Hands Dirty
#include<stdio.h> #include<stdlib.h> // printBits prints the binary format of the unsigned int passed to it. void printBits(size_t const size, void const * const ptr){ unsigned char *b = (unsigned char*) ptr; unsigned char byte; int i, j; for (i=size-1;i>=0;i--) for (j=7;j>=0;j--) { byte = (b[i] >> j) & 1; printf("%u", byte); } puts(""); printf("\n"); } int main() { // Declare and initialize arrays float arra [32], arrb[32]; for(int i = 0; i < 32; i++){ double x = (double) rand()/RAND_MAX; arra[i] = ((x>0.3)?1:-1); arrb[i] = ((x>0.5)?-1:1); } // Converting A and B to unsigned ints unsigned int returnera = 0, returnerb = 0, sign; for(int i = 0; i<32; i++){ sign = (arra[i] >= 0); // << is a left bitshift operator, if sign is 001 returnera = returnera | (sign<<i); // and i is 2, sign<<i will give: 100. sign = (arrb[i] >= 0); // returnera is initialized as 0, and the sign returnerb = returnerb | (sign<<i); // bitshift allows it to turn the arra matrix } // to an unsigned int. (Of length 32). printBits(sizeof(returnera), &returnera); printBits(sizeof(returnerb), &returnerb); // Dot product of matrices unsigned int tempj = ~(returnera^returnerb); // Very important part. XNOR operation. printBits(sizeof(tempj), &tempj); int jj = 2*(__builtin_popcount(tempj)) - 32; // population count implemented as pcnt int sum = 0; // given in this image. for(int i = 0; i<32; i++) sum += arra[i]*arrb[i]; printf("\nVerified sum: XNOR: %d and Normal: %d", jj, sum); }
The above code should give the following result:
00101011111010111111111110011110 11011100101111100010010101100011 00001000101010100010010100000010 Verified sum: XNOR: -14 and Normal: -14
You can see above, how arra and arrb have been converted to unsigned ints and how their XNOR operation is followed by a pcnt operation.
The same can be extended for higher dimensional matrices.
The past two weeks have been largely about benchmarking general matrix multiply operations to delve into the applications of this awesome technique.
Benchmarking
I believe that if properly trained, neural networks designed to run on XNOR-net architecture can perform almost as well as full precision networks. But the true efficacy of using this architecture for general purpose deep learning remains a mystery, till more research is done.
Before implementing XNOR for dot product and general matrix multiplication, I decided to do a small rudimentary case study on how I can use Intel® Xeon Phi™ cluster as well as GPUs for bringing my project: Art’Em to life. This is especially important as I hope to achieve really good results with the Intel® Xeon Phi™ cluster for the XNOR-net.
The case study below tests multiplication of full precision matrices, where each matrix is of the size 2n. The choice of 2n is primarily because CUDA parallelization works much better when the matrix dimensions are multiples of the block size. The method of matrix multiplication using CUDA was with shared memory. When I modified the code to support matrices of all sizes, there was a significant slowdown in the matrix multiplication.
One of the problems in this case study was the inability of my GPU to run matrix multiply operations of size greater than 8192. With a more powerful GPU, much larger matrices can be multiplied. However I have restricted myself to 8192.
I ran the same matrix multiplication algorithm on the Intel® Xeon Phi™ cluster. I used the MKL and OpenMP for this purpose. I have also benchmarked rudimentary CUDA XNOR GEMM code for fixed matrix sizes. You can find the specifications of the Intel® Xeon Phi™ cluster used as well as the GPU at the bottom of this article.
The X-axis is the size of both the square matrices, and Y-axis is the amount of time it takes for the multiplication in seconds.
GPU
I used 3 kernels, one is the cuBLAS kernel, the other is a CUDA GEMM kernel and the third is the XNOR GEMM. It is extremely important to note that the Y-axis is logarithmic.
On the GPU, the home made XNOR GEMM kernel significantly outperforms even the highly optimized cuBLAS function. This shows great promise for the XNOR-net strategy.
However, once we modify the code to support custom size matrix multiplication, we observe some slowdown in performance of the GEMM function. This is due to the extra redundant multiplications.
This should not be a problem once we design neural networks specifically for the XNOR architecture. We can still expect a significant increase in throughput with this strategy provided the networks are trained to work with this architecture.
CPU
I used 2 kernels, one being the optimized ‘cblas_sgemm’ function from Intel® Math Kernel Library (Intel® MKL) and the other was a classical matrix multiply function. A great in-depth analysis of classical matrix multiply functions and their efficiency can be found here. It is extremely important to note that the Y-axis is logarithmic.
While the XNOR GEMM code is not yet ready for general matrix multiplication, I benchmarked multiplication of full precision nets. The highly optimized CBLAS MKL outperforms the homemade classical matrix multiplication code significantly. This is not surprising at all.
However, I hope to get significant speed up once I begin to pack the matrix into unsigned integers and running a bitwise operator instead of multiplication of the matrices.
The next phase of this project will aim at creating highly optimized CUDA and MKL supported CPU and GPU compatible codes to work with XNOR general matrix multiplication of matrices of any size. I also hope to delve into optimizing convolution operations with XNOR net architecture.
While it is wiser to focus on creating networks with matrix sizes of the order 2n , it is also very important to bring XNOR-nets to existing architectures, like the VGG 16, Alex net etc.
CPU used:
Processor name: Intel® Xeon Phi™ processor 7210
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® Advanced Vector Extensions 512 (Intel® AVX-512), (Vector length 512-bit)
GPU used:
Manufacturer: NVIDIA
GPU name: NVIDIA* GeForce 840M
Core Speed:1029 MHz
Memory Speed: 2000 MHz
Max. Amount of Memory: 4096 MB
Read the Week 2 Update