The function is very small, just adding 2 1024 dimensional vector into one another vector. It was ran for 2 times in my program. Sequentially at first, and then ran in many threads by GPU later. If comparing the processing time between the 2 of them.
As it turns out, the function ran in parallel is a little bit better, but not as good as I expected. It is 6.5 times faster than the sequential one. So I did very simple profiling: counting the time of cuda functions. What surprised me most is cudaMalloc spent the processing time most, almost 70%. That means the 70% was not spent on calculating.
I guess like cache, GPU memory also needs warm up. So I run the function parallelly once right after the second one. Bang! It runs 22.33 faster than the sequential one. The cudaMalloc was very close to zero.
Actually, I didn't wrote my program from scratch. Instead, I modified the source code of a youtube online tutorial for it. Below is a cuda work flow, and cudaMalloc is responsible to the step 1. What my program actually did is warming up GPU memory at the first parallel run, and the next one gained all the benefit of GPU multi-threading.