I would prefer a computer that was slower, but worked the way it was supposed to (i.e. without rowhammer & spec-ex bugs, AMT/secret minix kernel, encrypted microcode mystery-meat, etc).
The typical user doesn't need to bean-count every microsecond. He desperately needs a computer that is his own tool rather than an infosec quagmire & the business-end of tech sector money-proboscis.
It is, sadly, the path of least resistance. "Big number good" is much easier for a consumer to grasp than the (qualitative) corners that were cut to achieve those big numbers. Lies, damn lies, & benchmarks!
The combinatorics of making distributed the new default are astronomical. That he dashes this all off so lightly, and in so little detail, makes me wonder if he's ever really done the work. Whether he has, or he hasn't, this is still 100% the wrong direction for anyone outside of a data center.
> The typical user [...] He desperately needs a [...]
citation needed. typical users want communication, entertainment, in general they want to achieve ends, and for them the important thing about the means is that it should look good/cool. hence Apple things.
My computers from 15 years ago, PC or Mac, did all of these things--even high-res FPS games--beautifully. If all you do is surf/email/music/video/games--and I don't think it requires a citation to believe that this is the lion's share of users--current speeds are beyond sufficient.
The quality issues in modern machines are the byproduct of a quantity-cult pissing contest rather than necessity. If anything, the excesses are only wasted by sloppier software (which was enabled by the excess to begin with), piggier websites, covert crypto miners, etc. Wirth's law is a vicious cycle.
And it's a false economy: cut corners on the clean-up of discarded speculative branches, get some great numbers on the latest benchmark; the chipmakers who stoop to this get an edge in the market, but by the time it hits the consumer (who has no understanding of such things), the mitigation costs exceed the original "improvement". Because it is such a multi-variate & workload-dependent situation, it may not be cut-and-dry fraud, even though it smells a lot like it.
>How many would notice, if we hijacked runtime calls and wrote to a remote blob storage instead of disks
We replaced all file access with calls to S3 storage a few months ago (the goal was to make the service completely stateless, for other technical reasons) and just yesterday we had yet another connectivity problem to S3 storage. Disks break too but it feels like connectivity issuses are much more frequent, at least in our country
Which is innately more fragile, a communication system you rely on for data that runs dozens to hundreds of miles through some communication medium that you don't control, to disks.
> Which is innately more fragile, a communication system […]
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — https://en.wikiquote.org/wiki/Leslie_Lamport
An alternative perspective - which is innately more fragile: a communication system with support and incident management by professionals whose sole job it is to keep the system running 24x7 and bring it back online in the rare event it fails; or, you, whose main job is something else entirely, and you would rather not have to think about disks at all.
For: I've seen a company very nearly destroyed by not having the skills to deal with a single disk failure in raid.
Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.
BTW read the SLAs of your provider and my guess is you'll agree with what a lawyer who worked for us said - shite. Basically, while it was down our provider wouldn't charge us for it. The end. What does yours do? If similar, what motivation have they to fix breaks? And is the provider responsible for loss of connectivity between them and you?
Put another way, who gets hurt more in downtime, the provider or you?
"SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.
If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total."
yep, that sounds about right.
Edit for context:
"In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident."
The AWS SLA compensation is also very much rigged against you beyond percent based outage durations.
For example a couple of months ago AWS had an outage that caused all of our customer facing domains to go down in us-west-2. It meant going to example.com wasn't resolving with our site due to a confirmed AWS outage.
For a few hours all of our RDS instances, EC2 instances, etc. were being charged for but providing $0 value since the entire org's sites were down. All revenue halted because the site wasn't accessible. When I contacted AWS support about the outage they mentioned we only qualified for some microscopic amount because the outage wasn't directly related to RDS, EC2 or VPC related things, etc..
>Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.
One of our previous incidents happened because an employee at a large hosting company misunderstood the ticket and manually shut down our entire live server without warning
This is only feasible for applications where latency is not a concern. The overhead of just the HTTP call to an s3 bucket (not to mention all the other bucket access overhead) is much higher than the overhead of a disk read request. Try performing 1000 random file accesses to a bucket, and 1000 random accesses to a disk. The performance won’t even be close.
Should mention Kerbal Space Program in this context. Rather than deal with complicated IO, moving data between disks and ram, the devs just loaded absolutely everything into memory at program start. This had all sorts of trickle-down benefits. In my experience, many Linux versions of that huge game were more stable than basic office software.
That's pretty standard approach in games. This negates problems with loads during gameplay potentially making game choppy, and also makes you aware of your total maximum memory budget (important esp. for consoles) - whereas, with dynamic loads and mallocs, you'll only find that out via extensive testing.
Not unless your game is less than a few gigs in total size. The average size of a 3d simulation game is much larger than the 8/16gb of memory on typical gaming rigs. KSP loaded literally everything, every texture, at startup. The only time the drive was touched after that was for saves.
KSP (the original) use the Unity engine. When the original devs quit, the game got more and more buggy, to the point it was unplayable, kept crashing every other hour or so... I played without quicksaving for a more tense experience - and the crashes ruined that.
Never had a crash with my M1 (8gb RAM), although it's pretty choppy even with nothing else open, especially if I go into maneuver mode zoomed out to show other planets.
> Rather than deal with complicated IO, moving data between disks and ram, the devs just loaded absolutely everything into memory at program start.
Don't most games do this at some level? That's what the whole "loading screen" is for: bring in all the bits for the level (or whatever) in questions so there's less chance of a I/O hiccup during gameplay.
I would have assumed that video game engines would have solved the problem of loading assets dynamically / on demand already, I guess not. It reminds me of Cyberpunk's bad PS4 / xbone release and issues, where a famous clip on the internet shows assets not having loaded yet during a cutscene; again, I would've thought loading assets was a solved problem.
There's a talk by one of the developers of Crash Bandicoot of all the hacky (and brilliant) things they had to do to get the game to load quickly off the (2X?) CD-ROM into the very limited RAM of the Playstation 1.
In large part, while a "solved problem" in aggregate, every engine has different solutions in play.
In part, because of trade-offs baked into engine assumptions. It's certainly possible no two engines will ever agree on asset loading due to such tradeoffs and how engines themselves are built/optimized.
In part, because there's still not much coordination between different game companies and few shared libraries/middleware/"content/asset pipelines". Think about how many bundlers there are in JS today that we know of because of open source (and HN headlines about them) and multiply that by how proprietary most game companies still see every bit of their code in 2022 and generally avoid HN headlines about low level details of their tech stacks (for "competitive advantage").
Loading assets in a simple, naive way is indeed solved. If you want "next-gen graphics", you need to push against the limit of your system, much as always. That is always tricky, new techniques are being developed every day, game studios have departments dedicated to finding new and better ways to stream data to-and-fro as needed.
When we are talking nanoseconds or low microseconds the distance is still important - networking has become a lot faster, but we are still talking milliseconds when doing stuff in the "cloud", if not seconds if we are talking enterprise/business software in the cloud. Distributed systems have few advantages and many disadvantages. For speed/performance keep the data as local as possible.
> For speed/performance keep the data as local as possible.
Yeah, I agree. For scalability and resilience though, the cloud based design patterns - async queues and workers, eventual consistency, etc - is probably better. I've yet to really have a need for something like that though.
Filesystems are close to being obsolete in the cloud. We still need them to get applications "started", but after that, applications shouldn't really be using a filesystem. I/O should take place with a remote (or local) service that deals with the many moving parts behind the thing the application really wants (look up some data, read some data, write some data, delete some data) in a large distributed system (logging, tracing, authn+z, storing/retrieving, sharing, archiving, etc). How it gets stored and on what media is pretty inconsequential to the app, and in many cases things are just more difficult because of the inherent limitations of storing data in files in filesystems on disks.
The other thing is, once disks are as fast as RAM, there's really no point in having RAM. It only exists because running programs would have been really slow if we had to wait for the disk to seek 100 times to print "Hello World".
Combine these two scenarios and you land smack into one new reality: all applications should be doing I/O in a giant pool of state that is both data storage and computational memory, as a global, distributed, virtual storage memory manager. In effect, clusters of computers should be sharing CPU and storage, and applications should run across all of those resources, like one big computer cut into 100 pieces. The idea of this was called SSI before, but it can actually become reality a lot easier if we can make the abstractions so simple that we don't have to code around ugly problems like threaded-shared-memory apps.
Basically we need to make a new "layer" that will allow us to delete a couple of old ones, and that simplicity will enable whole new designs and make currently-hard problems easy.
There is sadly still a huge performance gap between most filesystems and other storage API's.
If you tried to boot windows with all system files stored on S3, it would take forever.
It turns out the overhead of HTTP, encoding, many memory copies, splitting into TCP/UDP packets, etc. Is huge compared to a DMA transfer straight from an SSD into an applications memory space.
> all applications should be doing I/O in a giant pool of state
I'm somewhat infamous as one of the last people to have given up on distributed filesystems, but even I have always believed the "giant pool of state" approach is a better model for applications. Storage should be designed for applications, not the other way around. Hide the tiers and locality and layouts behind an abstraction layer as much as possible, let 90% of the code "think" in objects or rows or arrays or whatever makes sense for it. Yes, the abstraction will leak. You'll have to deal with it when performance tuning, and probably add some other user-visible concept of consistency/durability points, but all of that should be minimized. In addition to lessening the cognitive load for most programmers, having such a layer makes it easier to adapt to new technologies in a rapidly changing landscape.
> The idea of this was called SSI before, but it can actually become reality
This is where we diverge a bit. SSI vs. explicit distribution is really orthogonal to storage models and abstractions. Most SSI attempts failed because the coordination/coherency cost even for compute/memory stuff was too high. Also the semantics around things like signals, file descriptors, process termination and so on were always a mess. POSIX contains too many things that only really work in a single system, barely a shared-memory multiprocessor let alone a more loosely coupled type of system, but that goes well beyond the storage parts. (Yes, young 'uns, there's more to POSIX than the filesystem part.) Tanenbaum's and Deutsch's critiques of RPC mostly apply to any kind of SSI, especially with respect to handling partial failure. While abstracting away the storage part makes sense, I don't think abstracting away a system's distributed nature is a good idea (or even possible) for most domains.
The irony is that this circles back to the early days of Java Application Servers and the idea that all that that the EAR depends on (besides configuration resources) should live on the data layers, configured via JNDI.
It's intriguing that XFS, which lacks the integrity checks that ZFS or BTRFS offer, is the recommended filesystem for databases, even distributed ones like CochroachDB.
I don't have any experience with ZFS, and very little with BTRFS, but the reason I pick XFS is that has "just worked", out of the box, on every Linux distribution I've managed for about 20 years. Back then, it was one of the few choices that didn't have a static inode limit, and it could be grown while mounted. That was a big convenience on file servers when combined with LVM and/or hardware RAID. ZFS sounds hard to manage, I thought BTRFS was still eating peoples' data for a while, and XFS just keeps working fine.
The main log provides CoW-like functionality, & archive logs provide snapshot-like functionality. Having the DB & the file system duel over these responsibilities is trouble waiting to happen.
It's interesting that XFS is the recommended filesystem for databases (event distributed ones like CochroachDB), when it does not have integrity checks that ZFS or BTRFS have.
Is performance really more important than integrity for databases?
XFS has a long and solid track record, but without the workload-dependent regressions which can be encountered in a copy-on-write filesystem. Decent hard drives have had internal checksums for ages. If the drive is even halfway functioning, it will catch the bad blocks--making that particular ZFS feature largely redundant, and databases that matter are going to be on a RAID anyway, which has its own additional layer of parity.
ZFS is fantastic for some applications, not so much for others.
Ive had many checksum errors with ZFS, really no difference between enterprise Hdd or cheapest possible hdd (new ones). Or i just got unlucky. But some hdds are 15+ years, bad sectors but still usable.
Checksum vs parity. I would wager that the drive firmware also detected the error. This is why I think ZFS checksumming is largely redundant. Parity / error correction over a zpool being another issue entirely.
This is not to ding ZFS. If I were in Sun's position of having to take responsibility for 3rd-party manufactured drives, I'd have done the same thing. And circa 2006, the reliability of in-drive error detection may have been in a poorer state.
Never had a corruption issue the past 10 years that wasn't preceded by months of SMART errors/warnings.
It is absolutely not redundant, ZFS RAIDZ can reconstruct damaged data, thanks to the extra checksum provided by ZFS. While traditional RAID does not recover from silent data corruption because it does not know which block to reconstruct. This is exceptionally sensitive when backups are encrypted, because even a few bits off and you can't decrypt the backups. Since this happens transparently, regularly testing backups are very important in non-ZFS systems.
Read what I wrote more carefully. Checksumming is error detection only, while parity schemes can handle both error detection & correction. Checksumming blocks in software that are already being checksummed in hardware is redundant. Parity, on the other hand, is an essential part of any data redundancy system.
For a transactional database, XFS + hardware RAID is going to be a more reliable setup than ZFS + JBOD. Database logs are similar in function to CoW, & the fancier DBs have time-machine features similar to snapshots. Having filesystem & software struggling against each other to solve the same problem, but with different solutions, is a situation that invites so many Heisenbugs. That, and a good RAID controller is going to handle drive-reported errors sooner and more gracefully than a ZFS scrub.
I don't have the answer, but it's an interesting question. Given that XFS doesn't do replication, like ZFS or BTRFS does, having integrity checks is a bit weird. If a check fails, what is XFS suppose to do about it?
If the filesystem can "recover" or deal with failed integrity checks, it makes sense to have them at that layer. If it can't, I'd say that the application, in this case a database, should do it's own integrity checks. If it's just a few rows or tables, the operator can easily recover those from a backup. Having the filesystem, which knows nothing about the data, report the failure will mean having to do a full restore from backups.
ZFS RAIDZ can reconstruct damaged data, thanks to the extra checksum provided by ZFS. While traditional RAID does not recover from silent data corruption because it does not know which block to reconstruct. This is exceptionally sensitive when backups are encrypted, because even a few bits off and you can't decrypt the backups. Since this happens transparently, regularly testing backups are very important in non-ZFS systems.
ZFS is more akin to a logical volume manager (inc. software raid) and a filesystem though.
XFS is just a filesystem, it's not trying to solve the same issues.
Higher end RAID cards also do block level validations in the background. It isn't filesystem aware but it is sufficient to suss out bad clusters and relocate data proactively.
The point of ZFS systems is to avoid such hardware based checksumming systems, but RAID cards have been doing this validation for many years now.
The hardware solutions we have now provide silent corruption protection from the storage controller down to the actual media.
ZFS extends that protection to the filesystem layer in the kernel and when used in isolation of modern hardware solutions can protect the whole stack.
The typical user doesn't need to bean-count every microsecond. He desperately needs a computer that is his own tool rather than an infosec quagmire & the business-end of tech sector money-proboscis.
It is, sadly, the path of least resistance. "Big number good" is much easier for a consumer to grasp than the (qualitative) corners that were cut to achieve those big numbers. Lies, damn lies, & benchmarks!
The combinatorics of making distributed the new default are astronomical. That he dashes this all off so lightly, and in so little detail, makes me wonder if he's ever really done the work. Whether he has, or he hasn't, this is still 100% the wrong direction for anyone outside of a data center.