More

ComputerGuru · 2025-01-29T20:30:05 1738182605

The suggestion that any large-scale AI model research today isn’t ingesting output of its predecessors is laughable.

Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.

brianstrimp · 2025-01-29T21:53:52 1738187632

> We passed peak original content years ago.

In relative terms, that's obviously and most definitely true.

In absolute terms, that's obviously and most definitely false.

ComputerGuru · 2025-01-29T16:31:42 1738168302

Draining the battery from 100% to 0% in under an hour (as other commenters have mentioned) is the opposite of this, as that will necessarily dump a lot of heat.

gertlex · 2025-01-29T16:42:23 1738168943

No, it's not actually draining the battery faster. It's setting low level limits on maximum and minimum charge levels. (and also the charge rate, further controlling heat issues)

Some folks have hooked up devices between the phone and charge cable that do the coulomb counting (or similar), and (the one case I saw) showed that the phone is only taking in about 40% of the watt-hours that it was doing before the OTA update.

From what I've seen, no one can point to any actual incidents with these phones being dangerously hot. But the google FAQ did have a "Yes you can still take the pixel 4a on flights", reminiscent of the samsung phone issue several years ago...

traxys · 2025-01-29T16:42:56 1738168976

Could be that the new 100% and 0% are something like the old 80% and 20%

ComputerGuru · 2025-01-29T16:24:55 1738167895

Libra was shuttered due to feds saying “we have no legal reason to stop you but we still wish you to stop” and Zuckerberg (after some things we are not privy to) agreeing. There was a submission on this from an insider sometime in the past year.

(IMHO: no loss to the world whatsoever.)

ComputerGuru · 2025-01-29T16:23:02 1738167782

The real value in this: a new way to more easily disable Windows Defender on Windows 11.

ComputerGuru · 2025-01-29T16:15:29 1738167329

CPU-only is, very unfortunately, infeasible for reasoning models. This setup would be great for deepseek v3 or (more fittingly) the 405B llama 3.1 model, but 6-7 tokens per second on a reasoning model is 100% getting (well) into seconds-per-token territory if you consider only the final answer.

(You don’t have to take it from me: if CPU were good enough, AMD’s valuation would be 100x its current value.)

samvher · 2025-01-29T16:57:38 1738169858

Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?

ryao · 2025-01-29T18:34:51 1738175691

The answer to your question is yes. There is an open issue with llama.cpp about this very thing:

https://github.com/ggerganov/llama.cpp/issues/11333

The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.

If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.

snovv_crash · 2025-01-29T17:57:22 1738173442

No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.

menaerus · 2025-01-30T08:24:26 1738225466

How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

telotortium · 2025-01-29T17:54:59 1738173299

Maybe a little, but FLOPs and memory bandwidth don't lie.

brandall10 · 2025-01-29T19:11:47 1738177907

"This setup would be great for deepseek v3 or (more fittingly) the 405B llama 3.1 model"

v3 yes w/ 37B activated params, yes, but terrible on 405B as it's a dense model.

qingcharles · 2025-01-29T17:36:35 1738172195

It honestly depends on your use case. I often run bigger, slower models on my PC and let them just tootle along in the background grinding out their response while I work on something else.

grahamj · 2025-01-29T18:01:38 1738173698

Yeah this is what I was thinking, or maybe use smaller models to work on the prompt then fire it off to the biggie while you do something else.

ComputerGuru · 2025-01-29T16:06:06 1738166766

They “open sourced” it enough (via the whitepaper) that huggingface is trying to reproduce their training now.

dartos · 2025-01-29T16:11:49 1738167109

How is it open source at all with no source?

Paxos isn’t open source just because you can read the paxos paper.

nexus_six · 2025-01-29T16:23:53 1738167833

For people who have the disciplinary background in neural networks and machine learning I imagine that replicating that paper into some type of framework would be straight forward right? Or am I mistaken?

jampekka · 2025-01-29T16:51:43 1738169503

The model itself yes. The changes from previous architectures are often quite small code-wise. Quite often just adding/changing few lines in a torch model.

Things like tweaking all the hyperparameters to make the training process actually work may be more tricky though.

ru552 · 2025-01-29T16:24:05 1738167845

With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them. The "source code" for an LLM, is the process used to create the outcome, and to an extent, the data used to train with. DeepSeek released a highly detailed paper that describes the process used to create the outcome. People/Companies are actively trying to reproduce the work of DeepSeek to confirm the findings.

It's more akin to scientific research where everyone is using the same molecules, but depending on the process you put the molecules through, you get a different outcome.

dartos · 2025-01-29T16:41:08 1738168868

> With an LLM, the actual 0s and 1s of the model are fairly standard, common, and freely available to anyone that wants to use them

How is that different than the 0s and 1s of a program?

Assembly instructions are literally standard. What’s more, if said program uses something like Java, the byte code is even _more_ understandable. So much so that there is an ecosystem of Java decompilers.

Binary files are not the “source” in question when talking about “open source”

fuzzbazz · 2025-01-29T17:34:54 1738172094

There is no way to decompile an LLM's weights and obtain a somewhat meaningful, reproducible source, like with a program binary as you say. In fact, if we were to compare both in this way that would make a program binary more "open source".

dartos · 2025-01-29T19:30:37 1738179037

Yes, that is my exact argument.

ComputerGuru · 2025-01-29T16:18:17 1738167497

As an actual FOSS developer: they didn’t open source it.

But I was merely adding the missing context using the (sorry) lingua Franca of AI.

og_kalu · 2025-01-29T16:17:03 1738167423

They released the weights

jayd16 · 2025-01-29T17:23:58 1738171438

Publishing a white paper doesn't qualify as open source in any other context.

Google Spanner has a nice white paper but you wouldn't consider it open source, for example.

ComputerGuru · 2025-01-27T15:10:08 1737990608

That means it’ll never be used. ffast-math is verboten in most serious codebases.

ComputerGuru · 2025-01-24T16:54:38 1737737678

Actually a lot of the hacks that mold uses to be the fastest linker would be, ironically, harder to reproduce with rust because they’re antithetical to its approach. Eg Mold intentionally eschews used resource collection to speed up execution (it’ll be cleaned up by the os when the process exits) while rust has a strong RAII approach here that would introduce slowdowns.

junon · 2025-01-24T17:40:40 1737740440

You can absolutely introduce free-less allocators and the like, as well as use `ManuallyDrop` or `Box::leak`. Rust just asks that you're explicit about it.

cogman10 · 2025-01-24T17:02:34 1737738154

Depends on how things are approached.

You could, for example, take advantage of bump arena allocator in rust which would allow the linker to have just 1 alloc/dealloc. Mold is still using more traditional allocators under the covers which won't be as fast as a bump allocator. (Nothing would stop mold from doing the same).

cma · 2025-01-24T17:36:10 1737740170

Traditional allocators are fast if you never introduce much fragmentation with free, though you may still get some gaps and have some other overhead and not be quite as fast. But why couldn't you just LD_PRELOAD a malloc for mold that worked as a bump/stack/arena allocator and just ignored free if anything party stuff isn't making that many allocations?

zamalek · 2025-01-24T17:54:08 1737741248

> Traditional allocators are fast

Really it's allocators in general. Allocations are perceived as expensive only because they are mostly dependent on the amortized cost of prior deallocations. As an extreme example, even GCs can be fast if you avoid deallocation because most typically have a long-lived object heap that rarely gets collected - so if you keep things around that can be long-lived (pooling) their cost mostly goes away.

cogman10 · 2025-01-24T19:07:37 1737745657

Slight disagreement here.

Allocation is perceived as slow because it is. Getting memory from the OS is somewhat expensive because a page of memory needs to be allocated and stored off. Getting memory from traditional allocators is expensive because freespace needs to be tracked. When you say "I need 5 bytes" the allocator needs to find 5 free bytes to give back to you.

Bump allocators are fast because the operation of "I need 5 bytes" is incrementing the allocation pointer forward by 5 bytes and maybe doing a new page allocation if that's exhausted.

GC allocators are fast because they are generally bump allocators! The only difference is that when exhaustion happens the GC says "I need to run a GC".

Traditional allocators are a bit slower because they are typically something like an arena with skiplists used to find free space. When you free up memory, that skiplist needs to be updated.

But further, unlike bump and GC allocators another fault of traditional allocators is they have a tendency to scatter memory which has a negative impact on CPU cache performance. With the assumption that related memory tends to be allocated at the same time, GCs and bump allocators will colocate memory. But, because of the skiplist, traditional allocators will scattershot allocations to avoid free memory fragmentation.

All this said, for most apps this doesn't matter a whole lot. However, if you are doing a CPU/memory intense operation then this is stuff to know.

zamalek · 2025-01-25T00:20:59 1737764459

> Getting memory from traditional allocators is expensive because freespace needs to be tracked.

If you've never deallocated then there is no free space to track, hence the cost of allocation is _mostly_ (per my above comment) affected by the amortized cost of deallocations. The cost becomes extremely apparent with GCs because a failed allocation is usually what triggers a collection (and subsequently any deallocations that need to happen).

Still, your comment goes into detail that I probably should have.

dralley · 2025-01-24T18:02:00 1737741720

Nothing about Rust requires the use of the heap or RAII.

Also, if wild is indeed faster than mold even without incrementalism, as the benchmarks show, then it seems quite silly to go around making the argument that it's harder to write a fast linker in Rust. It's apparently not that hard.

Philpax · 2025-01-24T16:57:58 1737737878

I mean, that's pretty easy to do in Rust: https://doc.rust-lang.org/std/mem/struct.ManuallyDrop.html

Also see various arena allocator crates, etc.

ComputerGuru · 2025-01-24T17:03:28 1737738208

Not really. You would have to either wrap any standard library types in newtypes with ManuallyDrop implemented or (for some) use a custom allocator. And if you want to free some things in one go but not others that gets much harder, especially when you look at how easy a language like zig makes it.

And if you intentionally leak everything it is onerous to get the borrow checker to realize that unless you use a leaked box for all declaration/allocations, which introduces both friction and performance regressions (due to memory access patterns) because the use of custom allocators doesn’t factor into lifetime analysis.

(Spoken as a die-hard rust dev that still thinks it’s the better language than zig for most everything.)

Aurornis · 2025-01-24T17:37:42 1737740262

> You would have to either wrap any standard library types in newtypes with ManuallyDrop implemented

ManuallyDrop would presumably be implemented on large data structures where it matters, not on every single type involved in the program.

ComputerGuru · 2025-01-24T16:52:20 1737737540

There’s been a lot of interest in faster linkers spurred by the adoption and popularity of rust.

Even modest statically linked rust binaries can take a couple of minutes in the link stage of compilation in release mode (using mold). It’s not a rust-specific issue but an amalgam of (usually) strictly static linking, advanced link-time optimizations enabled by llvm like LTO and bolt, and a general dissatisfaction with compile times in the rust community. Rust’s (clinically) strong relationship with(read: dependency on) LLVM makes it the most popular language where LLVM link-time magic has been most heavily universally adopted; you could face these issues with C++ but it wouldn’t be chalked up to the language rather than your toolchain.

I’ve been eyeing wild for some time as I’m excited by the promise of an optimizing incremental linker, but to be frank, see zero incentive to even fiddle with it until it can actually, you know, link incrementally.

pjmlp · 2025-01-24T21:35:00 1737754500

C++ can be rather faster to compile than Rust, because some compilers do have incremental compilation, and incremental linking.

Additionally, the acceptance of binary libraries across the C and C++ ecosystem, means that more often than not, you only need to care about compiling you own application, and not the world, every time you clone a repo, or switch development branch.

yosefk · 2025-01-25T15:30:42 1737819042

compiling crates in parallel is fast on a good machine. OTOH managing C++ dependencies without a standard build & packaging system is a nightmare

pjmlp · 2025-01-25T17:09:22 1737824962

Imagine if Linus needed a gaming rig to develop Linux...

And he also did not had cargo at his disposal.

No need to point out it is C instead, as they share common roots, including place of birth.

Or how we used to compile C++ between 1986 and 2000's, mostly in single core machines, developing games, GUIs and distributed computing applications in CORBA and DCOM.

sitkack · 2025-01-24T17:46:40 1737740800

I solved this by using Wasm. Your outer application shell calls into Wasm business logic, only the inner logic needs to get recompiled, the outer app shell doesn't even need to restart.

ComputerGuru · 2025-01-24T17:52:10 1737741130

I don’t think I can use wasm with simd or syscalls, which is the bulk of my work.

sitkack · 2025-01-24T18:53:13 1737744793

I haven't used SIMD in Rust (or Wasm). Syscalls can be passed into the Wasm env.

https://doc.rust-lang.org/core/arch/wasm32/index.html#simd

https://nickb.dev/blog/authoring-a-simd-enhanced-wasm-librar...

Could definitely be more effort than it is worth just to speed up compilation.

SkiFire13 · 2025-01-24T21:04:29 1737752669

How is this different than dynamically linking the business logic library?

sitkack · 2025-01-25T00:04:05 1737763445

Very similar, but Wasm has additional safety properties and affordances. I am trying to get away from dynamic libs as an app extension mechanism. It is especially nice when application extension is open to end users, they won't be able to crash your application shell.

https://wasmtime.dev/ https://github.com/bytecodealliance/wasmtime

ComputerGuru · 2025-01-23T16:54:28 1737651268

My problem with python is startup time, packaging complexity (either dependency hell or full blown venv with pipx/uv). I’ve been rewriting shell scripts to either Makefiles (crazy but it works and is rigorous and you get free parallelism) or rust “scripts” [0] depending on their nature (number of outputs, number of command executions, etc)

Also, using a better shell language can be a huge productivity (and maintenance and sanity) boon, making it much less “write once, read never”. Here’s a repo where I have a mix of fish-shell scripts with some converted to rust scripts [1].

[0]: https://neosmart.net/blog/self-compiling-rust-code/

[1]: https://github.com/mqudsi/ffutils

roelschroeven · 2025-01-23T17:08:06 1737652086

I've often read that people have a problem with Python's startup time, but that's not at all my experience.

Yes, if you're going to import numpy or pandas or other heavy packages, that can be annoyingly slow.

But we're talking using Python as a bash script alternative here. That means (at least to me) importing things like subprocess, pathlib. In my experience, that doesn't take long to start.

$ cat helloworld.py #!/usr/bin/env python3 import subprocess from pathlib import Path print("Hello, world!\n")

$ time ./helloworld.py Hello, world!

real 0m0.034s user 0m0.016s sys 0m0.016s

34 milliseconds doesn't seem a lot of time to me. If you're going to run it in a tight loop than yes, that's going to be annoying, but in interactive use I don't even notice delays as small as that.

As for packaging complexity: when using Python as a bash script alternative, I mostly can easily get by with using only stuff from the standard library. In that case, packaging is trivial. If I do need other packages then yes, that can be major nuisance.

BobbyTables2 · 2025-01-24T00:43:36 1737679416

Python startup time gets much worse on low powered ARM systems executing from an sdcard — before the first import even occurs!

It certainly takes more effort for this to be a problem on modern x86 systems.

drdrey · 2025-01-23T17:42:51 1737654171

once you start importing more packages, you easily end up with 100+ ms startup time

robertlagrant · 2025-01-24T10:47:49 1737715669

At that point you're far beyond a bash script alternative though, aren't you?

ComputerGuru · 2025-01-24T17:00:09 1737738009

Not necessarily. Shell scripts often embody the unix “do one thing and do it right” principle. To download a file in a bash script you wouldn’t (sanely) source a bash script that implements an http client;, you would just shell out to curl or wget. Same for parsing a json file/response: you would just depend on and defer to jq. Whereas in python you could do the same but most likely/idiomatically would pull in the imports to do that in python.

It’s what makes shell scripts so fast and easy for a lot of tasks.

robertlagrant · 2025-01-24T17:03:42 1737738222

Fair enough, although in your examples Python comes with JSON support, but not a (very usable) HTTP client, and Bash has the reverse problem.

I would like it if Python just had a sane nice HTTP client built in, but it can also just shell out to curl.

jasfi · 2025-01-24T06:26:37 1737699997

Take a look at Nim, it solves those problems and integrates well with existing Python code.