Hi,
I have a great deal of experience working with gpu and multiprocess computations.
What algorithm are you using?
I imagine we will write a cuda kernel for you, that will do the computations on a gpu. If you need to run it from python, we could also make python bindings.
There are things that need to be discussed like, does your clustering need to be run on multiple machines? Which gpus are you going to use? (if I know the specifics I can make a lot of optimizations)
Hope to work with you!