The last few days, I tried to use JCuda on my Intel Core Quad CPU equipped Windows 7 system [Home premium, 64bit].

I installed the Cuda Toolkit, installed the 32bit and 64bit versions of JCuda’s DLLs/JARs and downloaded the JCublas sample application.

While the compile finished fine, the app immediately terminated – once it tried to load JCuda’s interface DLLs (called using JNI). The VM printed the more or less useless message: ‘JCudaRuntime-windows-x86.dll isn’t a win32-application’.

After various efforts to make the app load the DLLs, I posted a question on the JCuda forum. Short after, Marco13 – I suppose, he’s the project owner of JCuda – pointed me to the right direction:

I installed the 64 bit JDK version for Windows, passed -D=64 to the VM – and the JCublas sample application finished orderly. Thx, Marco!

Although a not that fast Nvidia GeForce 210 runs in my PC, the speedup was enormous: JCuda based matrix multiplication of size 2500 x 2500 finished 40-times faster than equivalent pure JAVA code [which utilized the 4 cores too].  Not bad ;-)

An important aspect to note: JCuda comes with JCublas [linear algebra], JCufft [fast fourier transforms) and JCudpp [data parallel primitives]. These libs provide access to ‘pre-configured’ high-level CUDA routine. E.g., a matrix multiplication is  basically a call to a JCublas provided routine.

This mode of operation is much simpler, than to write  a so called CUDA Kernel, a piece of syntactically enriched C code. On the other side, CUDA Kernels are more versatile.