At zero temp there is still non-determism due to sampling and the fact that floating point addition is not commutative so you will get varying results due to parallelism.
(Disclaimer: I know literally nothing about LLMs.) Wouldn't there still be issues of sensitivity, though? Like, wouldn't you still have to ensure that the wording of your commands stays exactly the same every time? And with models that take less discrete data (e.g. ChatGPT's new "advanced voice model" that works on audio directly), this seems even harder.
Not really, not in practice. The order of execution is non-deterministic when running on a cluster or a gpu, or more than one core of the CPU and rounding errors propagate differently on each run.