By Lokesh Kumar T
Sept 2018
IIT Madras
tasks
?¶$$\vec{a} = \begin{bmatrix}1 \\2\\\vdots\\n\end{bmatrix}_{n \times 1}$$
Lets take 2 vectors $\vec{a}, \vec{b}$ both in $\mathbb{R}^n$ (n-dimensional space).
$ \vec{a} = \begin{bmatrix}a_1 \\a_2\\\vdots\\a_n\end{bmatrix} $
$ \vec{b} = \begin{bmatrix}b_1 \\b_2\\\vdots\\b_n\end{bmatrix} $
Whats $\vec{a} + \vec{b}$ ? (Vector addition)
$ \vec{a} + \vec{b} = \begin{bmatrix}a_1 + b_1\\a_2+b_2\\a_3+b_3\\\vdots\\a_n+b_n\end{bmatrix} $
for i in range(n):
c[i] = a[i] + b[i]
In other words parallelize it?
Fundamental Operation: Addition of 2 numbers
Input: One Elements from 2 vectors a[i], b[i]
Same operation is performed on different data items. (here $a[i], b[i]$)
SIMD Processing - Single Instruction Multiple Data Processing
addition
on different compute units in GPUsNumba gives you the power to speed up your applications with high performance functions written directly in Python.
We will look into a basic program and understand the Numba programming basics.
# !conda install -c numba numba
import numba
from numba import cuda
print(cuda.gpus)
<Managed Device 0>
import numpy as np
# SCALING A VECTOR BY 2
# Create the data array - usually initialized some other way
data = np.ones(256*4096) # 1,041,664
threadsperblock = 256
# Ceil function
blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock
print ("Blocks in one grid:\t" + str(blockspergrid))
print ("Threads in one Block:\t" + str(threadsperblock))
Blocks in one grid: 4096 Threads in one Block: 256
from numba import cuda
@cuda.jit
def my_kernel(io_array):
pos = cuda.grid(1)
if pos < io_array.size: # Check array boundaries
io_array[pos] *= 2 # do the computation
%%timeit
# Now start the kernel
# And time the GPU execution time also
my_kernel[blockspergrid, threadsperblock](data)
13.6 ms ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# timing the CPU Operations
data_2 = data*2
2.35 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
How will you approach this problem????
How will you assign the threads and blocks?
Is it 1D as we saw in scalar multiplication?
# Host code
# Initialize the data arrays
m = 2**11 # 2048
n = 2**11
p = 2**11
A = np.full((m, n), 1, np.float) # matrix containing all 1's
B = np.full((n, p), 1, np.float) # matrix containing all 1's
# Copy the arrays to the device
A_global_mem = cuda.to_device(A)
B_global_mem = cuda.to_device(B)
# Allocate memory on the device for the result
C_global_mem = cuda.device_array((m, p))
@cuda.jit
def matmul(A, B, C):
"""Perform matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
threadsperblock = (32, 32)
# Dimension of the matrix we defined is 2048x2048
blockspergrid_x = int(np.ceil(A.shape[0] / threadsperblock[0]))
blockspergrid_y = int(np.ceil(B.shape[1] / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)
# Start the kernel
matmul[blockspergrid, threadsperblock](A_global_mem, B_global_mem, C_global_mem)
# Copy the result back to the host
C = C_global_mem.copy_to_host()
print(C)
[[2048. 2048. 2048. ... 2048. 2048. 2048.] [2048. 2048. 2048. ... 2048. 2048. 2048.] [2048. 2048. 2048. ... 2048. 2048. 2048.] ... [2048. 2048. 2048. ... 2048. 2048. 2048.] [2048. 2048. 2048. ... 2048. 2048. 2048.] [2048. 2048. 2048. ... 2048. 2048. 2048.]]
%%timeit -n 10
matmul[blockspergrid, threadsperblock](A_global_mem, B_global_mem, C_global_mem)
299 µs ± 134 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
C = A.dot(B)
463 ms ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Computers no longer get faster, just Wider
Rethink your algorithms to be parallel
Data-Parallel Computing is the most scalable solution
This presentation and extensive resources can be found in my GitHub - tlokeshkumar.
Even projects and other codes in GPU Programming, Deep Learning etc are present in my GitHub.... Do check them out if interested!!!
Feel free to contact me at lokesh.karpagam@gmail.com