The suggestion that any large-scale AI model research today isn’t ingesting output of its predecessors is laughable.
Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.
Draining the battery from 100% to 0% in under an hour (as other commenters have mentioned) is the opposite of this, as that will necessarily dump a lot of heat.
No, it's not actually draining the battery faster. It's setting low level limits on maximum and minimum charge levels. (and also the charge rate, further controlling heat issues)
Some folks have hooked up devices between the phone and charge cable that do the coulomb counting (or similar), and (the one case I saw) showed that the phone is only taking in about 40% of the watt-hours that it was doing before the OTA update.
From what I've seen, no one can point to any actual incidents with these phones being dangerously hot. But the google FAQ did have a "Yes you can still take the pixel 4a on flights", reminiscent of the samsung phone issue several years ago...
Libra was shuttered due to feds saying “we have no legal reason to stop you but we still wish you to stop” and Zuckerberg (after some things we are not privy to) agreeing. There was a submission on this from an insider sometime in the past year.
CPU-only is, very unfortunately, infeasible for reasoning models. This setup would be great for deepseek v3 or (more fittingly) the 405B llama 3.1 model, but 6-7 tokens per second on a reasoning model is 100% getting (well) into seconds-per-token territory if you consider only the final answer.
(You don’t have to take it from me: if CPU were good enough, AMD’s valuation would be 100x its current value.)
Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?
The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.
If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.
No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.
How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.
It honestly depends on your use case. I often run bigger, slower models on my PC and let them just tootle along in the background grinding out their response while I work on something else.
For people who have the disciplinary background in neural networks and machine learning I imagine that replicating that paper into some type of framework would be straight forward right? Or am I mistaken?
The model itself yes. The changes from previous architectures are often quite small code-wise. Quite often just adding/changing few lines in a torch model.
Things like tweaking all the hyperparameters to make the training process actually work may be more tricky though.
With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them. The "source code" for an LLM, is the process used to create the outcome, and to an extent, the data used to train with. DeepSeek released a highly detailed paper that describes the process used to create the outcome. People/Companies are actively trying to reproduce the work of DeepSeek to confirm the findings.
It's more akin to scientific research where everyone is using the same molecules, but depending on the process you put the molecules through, you get a different outcome.
> With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them
How is that different than the 0s and 1s of a program?
Assembly instructions are literally standard. What’s more, if said program uses something like Java, the byte code is even _more_ understandable. So much so that there is an ecosystem of Java decompilers.
Binary files are not the “source” in question when talking about “open source”
There is no way to decompile an LLM's weights and obtain a somewhat meaningful, reproducible source, like with a program binary as you say. In fact, if we were to compare both in this way that would make a program binary more "open source".
Actually a lot of the hacks that mold uses to be the fastest linker would be, ironically, harder to reproduce with rust because they’re antithetical to its approach. Eg Mold intentionally eschews used resource collection to speed up execution (it’ll be cleaned up by the os when the process exits) while rust has a strong RAII approach here that would introduce slowdowns.
You can absolutely introduce free-less allocators and the like, as well as use `ManuallyDrop` or `Box::leak`. Rust just asks that you're explicit about it.
You could, for example, take advantage of bump arena allocator in rust which would allow the linker to have just 1 alloc/dealloc. Mold is still using more traditional allocators under the covers which won't be as fast as a bump allocator. (Nothing would stop mold from doing the same).
Traditional allocators are fast if you never introduce much fragmentation with free, though you may still get some gaps and have some other overhead and not be quite as fast. But why couldn't you just LD_PRELOAD a malloc for mold that worked as a bump/stack/arena allocator and just ignored free if anything party stuff isn't making that many allocations?
Really it's allocators in general. Allocations are perceived as expensive only because they are mostly dependent on the amortized cost of prior deallocations. As an extreme example, even GCs can be fast if you avoid deallocation because most typically have a long-lived object heap that rarely gets collected - so if you keep things around that can be long-lived (pooling) their cost mostly goes away.
Allocation is perceived as slow because it is. Getting memory from the OS is somewhat expensive because a page of memory needs to be allocated and stored off. Getting memory from traditional allocators is expensive because freespace needs to be tracked. When you say "I need 5 bytes" the allocator needs to find 5 free bytes to give back to you.
Bump allocators are fast because the operation of "I need 5 bytes" is incrementing the allocation pointer forward by 5 bytes and maybe doing a new page allocation if that's exhausted.
GC allocators are fast because they are generally bump allocators! The only difference is that when exhaustion happens the GC says "I need to run a GC".
Traditional allocators are a bit slower because they are typically something like an arena with skiplists used to find free space. When you free up memory, that skiplist needs to be updated.
But further, unlike bump and GC allocators another fault of traditional allocators is they have a tendency to scatter memory which has a negative impact on CPU cache performance. With the assumption that related memory tends to be allocated at the same time, GCs and bump allocators will colocate memory. But, because of the skiplist, traditional allocators will scattershot allocations to avoid free memory fragmentation.
All this said, for most apps this doesn't matter a whole lot. However, if you are doing a CPU/memory intense operation then this is stuff to know.
> Getting memory from traditional allocators is expensive because freespace needs to be tracked.
If you've never deallocated then there is no free space to track, hence the cost of allocation is _mostly_ (per my above comment) affected by the amortized cost of deallocations. The cost becomes extremely apparent with GCs because a failed allocation is usually what triggers a collection (and subsequently any deallocations that need to happen).
Still, your comment goes into detail that I probably should have.
Nothing about Rust requires the use of the heap or RAII.
Also, if wild is indeed faster than mold even without incrementalism, as the benchmarks show, then it seems quite silly to go around making the argument that it's harder to write a fast linker in Rust. It's apparently not that hard.
Not really. You would have to either wrap any standard library types in newtypes with ManuallyDrop implemented or (for some) use a custom allocator. And if you want to free some things in one go but not others that gets much harder, especially when you look at how easy a language like zig makes it.
And if you intentionally leak everything it is onerous to get the borrow checker to realize that unless you use a leaked box for all declaration/allocations, which introduces both friction and performance regressions (due to memory access patterns) because the use of custom allocators doesn’t factor into lifetime analysis.
(Spoken as a die-hard rust dev that still thinks it’s the better language than zig for most everything.)
There’s been a lot of interest in faster linkers spurred by the adoption and popularity of rust.
Even modest statically linked rust binaries can take a couple of minutes in the link stage of compilation in release mode (using mold). It’s not a rust-specific issue but an amalgam of (usually) strictly static linking, advanced link-time optimizations enabled by llvm like LTO and bolt, and a general dissatisfaction with compile times in the rust community. Rust’s (clinically) strong relationship with(read: dependency on) LLVM makes it the most popular language where LLVM link-time magic has been most heavily universally adopted; you could face these issues with C++ but it wouldn’t be chalked up to the language rather than your toolchain.
I’ve been eyeing wild for some time as I’m excited by the promise of an optimizing incremental linker, but to be frank, see zero incentive to even fiddle with it until it can actually, you know, link incrementally.
C++ can be rather faster to compile than Rust, because some compilers do have incremental compilation, and incremental linking.
Additionally, the acceptance of binary libraries across the C and C++ ecosystem, means that more often than not, you only need to care about compiling you own application, and not the world, every time you clone a repo, or switch development branch.
Imagine if Linus needed a gaming rig to develop Linux...
And he also did not had cargo at his disposal.
No need to point out it is C instead, as they share common roots, including place of birth.
Or how we used to compile C++ between 1986 and 2000's, mostly in single core machines, developing games, GUIs and distributed computing applications in CORBA and DCOM.
I solved this by using Wasm. Your outer application shell calls into Wasm business logic, only the inner logic needs to get recompiled, the outer app shell doesn't even need to restart.
Very similar, but Wasm has additional safety properties and affordances. I am trying to get away from dynamic libs as an app extension mechanism. It is especially nice when application extension is open to end users, they won't be able to crash your application shell.
My problem with python is startup time, packaging complexity (either dependency hell or full blown venv with pipx/uv). I’ve been rewriting shell scripts to either Makefiles (crazy but it works and is rigorous and you get free parallelism) or rust “scripts” [0] depending on their nature (number of outputs, number of command executions, etc)
Also, using a better shell language can be a huge productivity (and maintenance and sanity) boon, making it much less “write once, read never”. Here’s a repo where I have a mix of fish-shell scripts with some converted to rust scripts [1].
I've often read that people have a problem with Python's startup time, but that's not at all my experience.
Yes, if you're going to import numpy or pandas or other heavy packages, that can be annoyingly slow.
But we're talking using Python as a bash script alternative here. That means (at least to me) importing things like subprocess, pathlib. In my experience, that doesn't take long to start.
34 milliseconds doesn't seem a lot of time to me. If you're going to run it in a tight loop than yes, that's going to be annoying, but in interactive use I don't even notice delays as small as that.
As for packaging complexity: when using Python as a bash script alternative, I mostly can easily get by with using only stuff from the standard library. In that case, packaging is trivial. If I do need other packages then yes, that can be major nuisance.
Not necessarily. Shell scripts often embody the unix “do one thing and do it right” principle. To download a file in a bash script you wouldn’t (sanely) source a bash script that implements an http client;, you would just shell out to curl or wget. Same for parsing a json file/response: you would just depend on and defer to jq. Whereas in python you could do the same but most likely/idiomatically would pull in the imports to do that in python.
It’s what makes shell scripts so fast and easy for a lot of tasks.
Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.
reply