I was curious how finicky autovectorization tends to be--gcc's documentation lists lots of types of loops that can be vectorized, and the section labeled "Unvectorizable Loops" only has one instance (reminds me of the "Bugs" section of many GNU man pages).
One article I found from last year tested out the vectorization claims in gcc 4.7. Essentially, operation ordering and memory alignment can make a huge difference in determining whether the code can be effectively vectorized. In particular, even if arrays are set up to be contiguous in memory, if gcc can't assume that the arrays are aligned on n-word boundaries (where n is the vectorization width), it can produce inefficient, half-vectorized monstrosities or just fail to vectorize altogether.
This comment was marked helpful 1 times.
According to assert((n%8) == 0), we need to pad every array whose length is not multiple of 8 to a multiple of 8?
This comment was marked helpful 0 times.
if you wanted to run this code, you would need to make sure that your array size was a multiple of 8 since the store command is writing 8 floats, yes. Otherwise you would be writing to memory that wasn't yours.
This is kind of a contrived example; in practice you'd probably instead write a special case to handle the extra slop that didn't fit into a full vector...
Why is square or rectangular tiles better than horizontal span? Is it because this gives the smallest number of tiles?
I think the sample points in a square or rectangular tile all cluster in a tight area, so it is highly possible that there are some tiles that don't overlap the triangle at all. Then we can cull the entire tiles. But, for horizontal span, every span may have a part of triangle in it. It is not possible to cull any of them.
Right. I was also confused about this until I saw the next slides. Normally you want to avoid accessing things in tiles unless you have to (i.e. box blurs) since it requires accessing stuff out of order in memory, but it's not a problem if you pre-arrange everything else into a tiled format. But this kind of makes it hard to change tile sizes on the fly.
I guess the worry here is that we could get caught in a really long conditionally-executed block that only one, or a few, lanes are using.
In my opinion, in the original graphics applications of GPU, the branch divergence is not a very critical issue since there are few branches and all the branches are pretty short compared to the other part. So the performance will not decrease significantly. But in general-purpose GPU applications such as high performance computing, people are more caring about the branch issue, and try to make some optimizations such as reorganizing the wrap (the group of threads that execute together) to get higher occupancy.
Another issue that appears here, which is not very theoretically interesting but tends to bite GPU programmers in practice, is that some types of complicated control flow are only supported on a subset of graphics cards. On the unsupported cards, these instructions kick the shader driver back into a CPU-based emulation mode, which means a decrease in efficiency much more drastic than that which occurs here. If you're writing an application that's supposed to be portable across many machines, thorough testing on lots of graphics cards with good profiling is critical to avoid these slowdowns (they usually can be avoided with clever shader writing).
There was a faculty candidate talk about this recently. It's a language for writing image processing algorithms that allows the compiler to automatically optimize your code.