Cooling a Tesla P40 in the living room
Nvidia’s Tesla P40 has become a pretty popular choice for running large AI models at home.
That is mostly because they are quite cheap, not because they are that great. I bought mine for less than 200€ not too long ago, though as of early 2025 the GPU market is once again a complete dumpster fire and so they’re closer to 500€ on eBay right now, plus taxes. I would never have imagined any of my hardware appreciating in price, but here we are I guess. Nvidia is obviously busy making the most of their monopoly.
But I digress. The P40 is popular because its 24GB of VRAM is just nowhere to be found in any comparably priced GPU, and the card’s inference performance is reasonable. It can’t compete with 3090s and up, but it’ll leave any CPU in the dust, including Apple Silicon. For example, using Gemma 3 27B quantized to 4 bits, I get over 120 tokens/sec during prompt processing, and 12t/s during inference. And there’s plenty of VRAM to spare for context and the vision adapter.
It’s basically a 1080Ti with 24GB of VRAM. People are even using it for games with success, even though it doesn’t have any video outputs.
There is only one catch. It’s a datacenter GPU with a passive heat sink, designed for server racks that provide airflow. To run it at home, you will need to provide this airflow yourself, and it isn’t that straightforward. Most desktop PC fans are not only way too large, but also lack the required static pressure - the card’s heatsinks produce formidable airflow resistance, more than desktop fans are designed for, and the air just escapes some other way.
You need either a radial fan, or small 40mm fans with high static pressure (i.e. designed for servers). Invariably, you end up with a rather loud solution. That is a problem if your machine is going to be running 24/7, and you don’t have a room to spare for it. Another problem is you’ll need a fan duct, or shroud, to channel the air from the fan to the heatsinks.
For example, this is what I initially used - two Noctua 40x20mm fans, the quietest fans of that size that I’m aware of, mounted to a 3D printed shroud. (By the way, don’t use PLA for this part, and don’t ask how I found out.)
This is quite a typical solution for DIY cooling these cards. Sitting behind two Noctua low-noise fan adapters, it actually was quiet enough for the living room, but it wasn’t cool enough for the card. It would throttle down to about 3.6t/s once it warmed up, and sit at a rather toasty 93C. Obviously not a long-term solution.
In my desperation, I wound up messing around with a side-mounted 120mm fan, which against my expectations immediately made quite a difference. However, it blew straight onto the card’s plastic cover, which isn’t exactly known to be a great conductor of heat. Hence the idea of deshrouding the card.
It’s a very simple job, just a few tiny Torx screws to get rid of, but the cover does serve an important purpose: by themselves, the fins don’t form a sealed duct for the air to pass through. I am particularly proud of my elegant solution to this issue:
Heat-proof duct tape isn’t pretty, but I found it to be extremely effective so far.
It wasn’t just good for sealing the fins, too. Once the plastic cover was removed, you could see that my particular fan shroud left quite a gap to the fins. The clearance between the 3D printed shroud and the card’s plastic cover wasn’t perfect either, leaving plenty of opportunity for the static pressure to escape.
The fix was obvious.
With a (mostly) airtight tunnel through the fins set up, the extra 120mm fan is now directly cooling the heatsink.
And it’s pretty effective. Whereas I’d see throttling to 3.6t/s previously, now it’s good for a consistent 10t/s. It’s fast enough to be usable, AND quiet enough for my living room. The holy grail of P40 cooling.
To be clear, under constant heavy load, it still runs hot and throttles (nvidia-smi
will show the card at 90C, power limiting to about 130W). But the 120mm fan helps cool it down much faster, and during typical LLM usage, the card gets plenty of breaks from inference. Combine those two effects, and I don’t think I’ve seen it throttle during typical usage.
P.S.: Planning to pick up blogging again! Got some very fun projects to cover, coming up.