CPU and GPU: Two Different Beasts — Don’t Mix Them Up
Unfortunately, many explanations and books about programming GPUs (NVIDIA, AMD) mix up techniques suitable for CPUs, such as spinlocks, mutexes, traditional locks, and thread sleeping/wake-up mechanisms, which are totally at odds with the way GPUs should be programmed.
The reason for this, as many tech-bros around here know, lies in how CPU and GPU architectures differ. CPU architecture is built on time-slicing and MIMD instructions (Multiple Instruction, Multiple Data), while GPU groups its super-stellar type of instruction, SIMD (Single Instruction, Multiple Data ), into warps (NVIDIA-like) or wavefronts (AMD-like).
From an easy abstract algebra point of view, the reason that GPU and CPU screw up each other lies in the impossible perfect mapping (morphisms) of CPU code into GPU code. Why? Take the easy explanation for category theory: CPU deals with objects (processing units) and arrows (operations) on different data streams in MIMD. Each processing unit (object) can execute different operations (arrows) on independent data streams. So, you have here, instead of a massive amount of threads (GPU), a massive amount of independent categories. From this mathematical perspective, a CPU is nothing more than a lot of independent operations, each with a diverse composition of morphisms, generating a vast number of independent execution paths (and saying ‘a lot’ is still an understatement).
In contrast, GPU with its SIMD involves only a single operation (arrow in category terms) applied simultaneously across multiple data elements (in category terms: objects). This can be seen as a single category where the same morphism is applied across various objects in parallel (for the math-savvy tech-bros, in reality, we have here an endomorphism operating in the same category, kinda casting the same object into different types in traditional checked-type programming).
Ok, now, after this brief excursion into the realm of category theory, we are ready to understand the core of the problem of why Parallelism used in CPU sucks when transferred to GPU:
MIMD architectures support complex, branching operations where each processor can follow a different execution path. Translating this to SIMD would require aligning these diverse paths into a single, unified operation ( kinda a catamorphism from the general morphism of MIMD to the endomorphism within SIMD, not a trivial task, I tell you). The morphisms with MIMD allow processors to operate independently without waiting for others. SIMD requires data and operation synchronization across all processing elements, introducing complexity in aligning independent MIMD operations into a SIMD framework. And thus, my tech-bros, we lose the flexibility and independence in MIMD lost in SIMD.
So, next time, don’t mix them up, and you will be rewarded with several orders of magnitude in performance.



