Driven by the market demand for high-definition 3D graphics, commodity graphics processing units (GPUs) have evolved into highly parallel, multi-threaded, many-core processors, which are ideal for data parallel computing. Many applications have been ported to run on a single GPU with tremendous speedups using general C-style programming languages such as CUDA. However, large applications require multiple GPUs and demand explicit message passing. A message passing toolkit, called GMH (GPU Message Handler), on NVIDIA GPUs, has been developed by HPC group. This toolkit utilizes a data-parallel thread group as a way to map multiple GPUs on a single host to an MPI rank, and introduces a notion of virtual GPUs as a way to bind a thread to a GPU automatically. This toolkit provides high performance MPI style point-to-point and collective communication, but more importantly, facilitates event-driven APIs to allow an application to be managed and executed by the toolkit at runtime.