>How many would notice, if we hijacked runtime calls and wrote to a remote blob storage instead of disks
We replaced all file access with calls to S3 storage a few months ago (the goal was to make the service completely stateless, for other technical reasons) and just yesterday we had yet another connectivity problem to S3 storage. Disks break too but it feels like connectivity issuses are much more frequent, at least in our country
Which is innately more fragile, a communication system you rely on for data that runs dozens to hundreds of miles through some communication medium that you don't control, to disks.
> Which is innately more fragile, a communication system […]
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — https://en.wikiquote.org/wiki/Leslie_Lamport
An alternative perspective - which is innately more fragile: a communication system with support and incident management by professionals whose sole job it is to keep the system running 24x7 and bring it back online in the rare event it fails; or, you, whose main job is something else entirely, and you would rather not have to think about disks at all.
For: I've seen a company very nearly destroyed by not having the skills to deal with a single disk failure in raid.
Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.
BTW read the SLAs of your provider and my guess is you'll agree with what a lawyer who worked for us said - shite. Basically, while it was down our provider wouldn't charge us for it. The end. What does yours do? If similar, what motivation have they to fix breaks? And is the provider responsible for loss of connectivity between them and you?
Put another way, who gets hurt more in downtime, the provider or you?
"SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.
If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total."
yep, that sounds about right.
Edit for context:
"In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident."
The AWS SLA compensation is also very much rigged against you beyond percent based outage durations.
For example a couple of months ago AWS had an outage that caused all of our customer facing domains to go down in us-west-2. It meant going to example.com wasn't resolving with our site due to a confirmed AWS outage.
For a few hours all of our RDS instances, EC2 instances, etc. were being charged for but providing $0 value since the entire org's sites were down. All revenue halted because the site wasn't accessible. When I contacted AWS support about the outage they mentioned we only qualified for some microscopic amount because the outage wasn't directly related to RDS, EC2 or VPC related things, etc..
>Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.
One of our previous incidents happened because an employee at a large hosting company misunderstood the ticket and manually shut down our entire live server without warning
This is only feasible for applications where latency is not a concern. The overhead of just the HTTP call to an s3 bucket (not to mention all the other bucket access overhead) is much higher than the overhead of a disk read request. Try performing 1000 random file accesses to a bucket, and 1000 random accesses to a disk. The performance won’t even be close.
We replaced all file access with calls to S3 storage a few months ago (the goal was to make the service completely stateless, for other technical reasons) and just yesterday we had yet another connectivity problem to S3 storage. Disks break too but it feels like connectivity issuses are much more frequent, at least in our country