12-15-2010, 12:21 AM
12-15-2010, 09:09 AM
(12-15-2010, 12:21 AM)gat3way Wrote: [ -> ]Doesn't the clc compiler generate BFI_INT code when bitselect() is used?
well i tried it with a simple opencl kernel where i put in the bitselect() but in the resulting .isa code there is no BFI_INT so i guess its not using it.
12-15-2010, 05:48 PM
(12-15-2010, 09:09 AM)atom Wrote: [ -> ]well i tried it with a simple opencl kernel where i put in the bitselect() but in the resulting .isa code there is no BFI_INT so i guess its not using it.
clc compiles C into IL and then calclCompile() used for IL to ISA compilation. As there no BFI_INT at IL level it isn't possible to use bitselect() in an intended way. Can only say "thanks" to ATI again.
12-15-2010, 07:33 PM
i did not use clc. i use the GPU_DUMP_DEVICE_KERNEL environment variable. if defined it dumps the .il and the .isa when its execing a opencl kernel.
12-15-2010, 08:24 PM
(12-15-2010, 07:33 PM)atom Wrote: [ -> ]i did not use clc. i use the GPU_DUMP_DEVICE_KERNEL environment variable. if defined it dumps the .il and the .isa when its execing a opencl kernel.clc or OpenCL routines used -- doesn't matters as it ends in OpenCL C->IL->ISA chain compiling.
In theory you can also post-process generated kernels to replace required instructions with BFI_INT. I haven't take a look at OpenCL for a long time but I guess that binary images are the same as for CAL itself (received with calclCompile(), calclLink(), calclImageGetSize(), calclImageWrite() sequence) -- ELF binary with several sections. So process should be exactly the same as for CAL/IL kernels.
12-15-2010, 09:22 PM
yes, i totally agree on that. but what i orginally wanted to say is that it is not "enough" to have C compiled to IL using clc since there is no BFI_INT instruction in IL and thats why i used the GPU_DUMP_DEVICE_KERNEL environment variable to get the ISA. the reason for this was just to find out if we can use bitselect() from opencl.
12-15-2010, 11:06 PM
It seems that bitselect() does not offer any benefit over using (a&b)|(~a&c). None at all. Not that I hoped that the compiler will somehow generate magically better IL code using some magic instructions, but I was thinking that there could be some wise trick they do. I thought that bitselect() is there for some reason.
Decided to make a simple test. Basically the round function in MD5 (round1 and 2) is:
1) (b & c) | ((~b) & d)
which can be directly replaced with
2) bitselect(b,d,c)
and the third option is to rewrite it as:
3) d ^ (b & (c ^ d))
1) and 2) give the same speed and 3) is a bit faster, but the difference is negligible.
Argh
Guess what...
Decided to make a simple test. Basically the round function in MD5 (round1 and 2) is:
1) (b & c) | ((~b) & d)
which can be directly replaced with
2) bitselect(b,d,c)
and the third option is to rewrite it as:
3) d ^ (b & (c ^ d))
1) and 2) give the same speed and 3) is a bit faster, but the difference is negligible.
Argh
Guess what...
12-16-2010, 04:21 PM
Just for completion, another possible options similar to #1 are:
4) (b & c) ^ ((~b) & d)
5) (b & c) + ((~b) & d)
Results should be the same as #1, but who knows, lets try it :-D
4) (b & c) ^ ((~b) & d)
5) (b & c) + ((~b) & d)
Results should be the same as #1, but who knows, lets try it :-D
12-17-2010, 10:02 AM
Tested yesterday, both are the same as 1) on GPU and on CPU.
I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble. But then, I interlace several MD5 operations thus utilizing the pipeline, probably that's why I see no difference. I may try to test that without interlacing, but that would require rewriting a lot of stuff just for the test.
I don't interlace md5 in my GPU code though and that behavior seems a bit irrational. Then I am very far away from knowing the ATI GPU internals, this probably has some simple explanation.
P.S another weird thing is that going from uint4s to uint8s gave a strong performance boost on the GPU code, about 20%. I don't think better VLIW5 utilization can explain that thoroughly, that's just another ATI GPU paradox I cannot explain.
I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble. But then, I interlace several MD5 operations thus utilizing the pipeline, probably that's why I see no difference. I may try to test that without interlacing, but that would require rewriting a lot of stuff just for the test.
I don't interlace md5 in my GPU code though and that behavior seems a bit irrational. Then I am very far away from knowing the ATI GPU internals, this probably has some simple explanation.
P.S another weird thing is that going from uint4s to uint8s gave a strong performance boost on the GPU code, about 20%. I don't think better VLIW5 utilization can explain that thoroughly, that's just another ATI GPU paradox I cannot explain.
12-17-2010, 03:17 PM
(12-17-2010, 10:02 AM)gat3way Wrote: [ -> ]I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble.
CPU version: If you want to use variable "b" later, you can't overwrite it, so you must make a copy...
So the first case will translate into something like this:
Code:
movdqa tmp1, b
movdqa tmp2, b
pand tmp1, c
pnand tmp2, d
por tmp1, tmp2
option 3 will produce bigger dependency chain, but one instruction less:
Code:
movdqa tmp, d
pxor tmp, c
pand tmp, b
pxor tmp, d
On GPU there is a different problem - absence of single NAND instruction, so moreover you must compute bitwise NOT.
But maybe I'm wrong, please correct me, don't have much time now...