![]() Because of shared memory, it’s awkward to process data bigger than a few K in size right now, so even 128K per SM would be very comforting headroom. ![]() Of course what we want is not just more control over the cache, but also to make it BIG… even if that means it’s off-die and higher latency. It’d be even better to be able to invalidate just certain ranges of the texture cache, but even if it’s an all-or-nothing option it’s useful. NV could expose this with a simple crude _flushcache() kind of opcode, it means “assume all your texture cache for this SM is dirty, reload things as you re-read them now.” So you might create a small precompute table in global memory, go _flushcache(), then repeatedly use that table in your block. Yes, other SM’s would not be able to read that data reliably, but that’s fine, if it’s locally used only it’s not important. To be honest, exposing that “write to texture” would STILL be useful, especially when used for local memory like effects. It’s really likely the GPU hardware would have no problem with you writing to a textures, it’s just that you’d get undefined results when reading those values unless you’ve never read them before. This is also why CUDA only lets you define textures between kernels (no such thing as creating a texture now and then declaring it to be “cacheable” later. The texture caches are coherent only because they’re read-only! That means you could never have cores get different values anyway. Honestly, a shared, coherent cache would be tricky to implement but not so bad on performance. (ie without any of that texture resource bs.) It could also allow writes to it, even if they’re dangerous. But yeah, nvidia can certainly expand it and make it easier to use. Then you’re really back to the texture cache. I guess you can get around that by supporting true “global” access only on non-cached portions. The current global mem is cache-coherent, and also supports atomics. Nope, Larabee isn’t very superscalar, but it DOES have two execution pipes, one is the full x86 stream and can do anything, the other does just the common math and memory ops. For hyperthreading you have to be able to store those natively (ie not on a stack) and that takes transistors. This makes sense, because the x86 compute model has quite a few fixed registers plus the new fat SSE-like math registers. Does it even have warps? Did they make it superscalar, too? Garbaaage lol ![]() What what wahaaat? Larabee isn’t massively hyper-threaded? That’s the whole fn point of the architecture! The 21st century demon of latency slayed by time-sliced parallelism. ![]()
0 Comments
Leave a Reply. |