Playing with ArrayFire: 06 Introduction to Vectorization


Preface

Currently, programmers and data scientists hope to take advantage of fast parallel computing devices. In order to get the best performance from current parallel hardware and scientific computing software, it is necessary to use vectorized code. However, writing vectorized code may not be immediately intuitive. ArrayFire provides many ways to vectorize a given code segment. In this article, we will introduce several ways to vectorize code using ArrayFire, and discuss the advantages and disadvantages of each method.


One, general/default vectorization

    In its essence, ArrayFire is a vectorized library. Most functions operate on the array as a whole-parallel operations on all elements. Whenever possible, existing vectorization functions should be used instead of indexing into the array manually. For example, consider the following code:

	af::array a = af::range(10); // [0,  9]
	for(int i = 0; i < a.dims(0); ++i)
	{
    
    
	    a(i) = a(i) + 1;         // [1, 10]
	}

    Although the code is completely valid, it is very inefficient because it causes the kernel to operate on only one piece of data. Instead, developers should use ArrayFire's + operator overload:

	af::array a = af::range(10);  // [0,  9]
	a = a + 1;                    // [1, 10]

    This code will cause a kernel to operate all 10 elements of a in parallel.
    Most ArrayFire functions are vectorized. A small part of it includes:

Operator type function
Arithmetic operation +, -, *, /, %, >>, <<
logic operation &&, ||, <, >, ==, !=
Numerical function abs(), floor(), round(), min(), max()
Complex number operations real(), imag(), conj()
Exponential and logarithmic functions exp(), log(), expm1(), log1p()
Trigonometric function sin (), cos (), tan ()
Hyperbolic function sinh (), cosh (), tanh ()

    In addition to element manipulation, many other functions are also vectorized in ArrayFire.
    Please note that even if some form of aggregation (such as sum() or min()), signal processing (such as convolve()), and even image processing functions (such as rotate()) are performed, ArrayFire supports the processing of different columns or images Vectorize. For example, if we have a width WIDTH , height HEIGHT of the NUM images, we can use the convolution of each image as a vector manner:

	float g_coef[] = {
    
     1, 2, 1,
	                   2, 4, 2,
	                   1, 2, 1 };
	af::array filter = 1.f/16 * af::array(3, 3, f_coef);
	af::array signal = randu(WIDTH, HEIGHT, NUM);
	af::array conv = convolve2(signal, filter);

    Similarly, you can use the following code to rotate 100 images by 45 degrees in one call:

	// Construct an array of 100 WIDTH x HEIGHT images of random numbers
	af::array imgs = randu(WIDTH, HEIGHT, 100);
	// Rotate all of the images in a single command
	af::array rot_imgs = rotate(imgs, 45);

    Although most functions in ArrayFire support vectorization, some do not. The most obvious is all linear algebra functions. Even if they are not vectorized, linear algebra operations are still performed in parallel on your hardware.
    To vectorize any code written with ArrayFire, using the built-in vectorization operation is the best and preferred method.

2. GFOR: Parallel for loop

     Another new vectorization method proposed in ArrayFire is the GFOR loop replacement construction. GFOR allows all iterations of the loop to be started in parallel on the GPU or device, as long as the iterations are independent. The standard for loop executes each iteration sequentially, while ArrayFire's gfor loop executes each iteration simultaneously (in parallel). ArrayFire accomplishes this by "tiling" the values ​​of all loop iterations, and then performing calculations on these "tiled" values ​​once. You can think of gfor as performing automatic vectorization of your code. For example, you write a gfor loop that increments each element of the vector, but behind the scenes, ArrayFire will rewrite it to perform the entire vector in parallel operating.
    The for loop example at the beginning of this article can be rewritten using GFOR as follows:

	af::array a = af::range(10);
	gfor(seq i, n)
	    a(i) = a(i) + 1;

    In this case, each instance of the gfor loop is independent, so ArrayFire will automatically tile the a array in device memory and execute the incremental kernel in parallel.
    Looking at another example, you can run accum() on each matrix slice in the for loop, or you can "vectorize" and simply do everything in the gfor loop operation:

	   // runs each accum() in sequence
	for (int i = 0; i < N; ++i)
	   B(span,i) = accum(A(span,i));
	// runs N accums in parallel
	gfor (seq i, N)
	   B(span,i) = accum(A(span,i));

    However, back to our previous vectorization technique, accum() is already vectorized, just use:

	B = accum(A);

    It is best to use vectorized calculations as much as possible to avoid the overhead in for loops and gfor loops. However, the gfor loop structure is most effective in the narrow case of broadcast style operations. Consider a situation: we have a constant vector and we want to apply it to a set of variables, such as the value representing a linear combination of multiple vectors. Broadcasting a set of constants to multiple vectors works well in a gfor loop:

	const static int p=4, n=1000;
	af::array consts = af::randu(p);
	af::array var_terms = randn(p, n);
	gfor(seq i, n)
	    combination(span, i) = consts * var_terms(span, i);

    Using GFOR requires following several rules and guidelines for optimal performance. The details of this vectorization method will be introduced in a later section.

Three, batch processing

     The batchFunc() function allows the existing ArrayFire function to be widely applied to multiple data sets. In fact, batchFunc() allows ArrayFire functions to be executed in "batch" mode. In this mode, the function will find a dimension that contains the "batch" data to be processed, and parallelize the process.
    Consider the following example. Here, we create a filter and apply it to each weight vector. The simple solution is to use a for loop, as we saw earlier:

	// Create the filter and the weight vectors
	af::array filter = randn(5, 1);
	af::array weights = randu(5, 5);
	// Apply the filter using a for-loop
	af::array filtered_weights = constant(0, 5, 5);
	for(int i=0; i<weights.dims(1); ++i){
    
    
	    filtered_weights.col(i) = filter * weights.col(i);
	}

    However, as we discussed above, this solution will be very inefficient. One might try to implement a vectorized solution as follows:

	// Create the filter and the weight vectors
	af::array filter = randn(1, 5);
	af::array weights = randu(5, 5);
	af::array filtered_weights = filter * weights; // fails due to dimension mismatch

    However, the size and weight of the filter do not match, so ArrayFire will generate a runtime error. batchfunc() was created to solve this specific problem. The function declaration is as follows:

	array batchFunc(const array &lhs, const array &rhs, batchFunc_t func);

    Where batchFunc_t is in the form of a function pointer:

	typedef array (*batchFunc_t) (const array &lhs, const array &rhs);

    Therefore, to use batchFunc() , we need to provide the function that we want to apply as a batch operation. For ease of explanation, let us "implement" a multiplication function in the following format.

af::array my_mult (const af::array &lhs, const af::array &rhs){
    
    
    return lhs * rhs;
}

    The final batch call is not much more difficult than the ideal syntax we imagined.

	// Create the filter and the weight vectors
	af::array filter = randn(1, 5);
	af::array weights = randu(5, 5);
	// Apply the batch function
	af::array filtered_weights = batchFunc( filter, weights, my_mult );

    The batch function can be used with many of the vectorized ArrayFire functions mentioned earlier. If these functions are wrapped in a helper function that matches the batchFunc_t declaration, it can even use a combination of these functions. One limitation of batchfunc() is that currently it cannot be used in gfor() loops.

Fourth, advanced vectorization

    We have seen the different methods ArrayFire provides for vectorized code. Combining them is a slightly more complicated process, which requires consideration of data dimensions and layout, memory usage, nesting order, etc. There is a good example and discussion of these factors on the official blog:

    http://arrayfire.com/how-to-write-vectorized-code/

    It is worth noting that the content discussed in the blog has been converted into a convenient af::nearestNeighbour() function. Before writing from scratch, check if ArrayFire has already been implemented. In addition to replacing dozens of lines of code, ArrayFire's default vectorization features and a large collection of functions can speed up the speed!


Guess you like

Origin blog.csdn.net/weixin_42467801/article/details/113620857