Hacker News new | past | comments | ask | show | jobs | submit login

>How many would notice, if we hijacked runtime calls and wrote to a remote blob storage instead of disks

We replaced all file access with calls to S3 storage a few months ago (the goal was to make the service completely stateless, for other technical reasons) and just yesterday we had yet another connectivity problem to S3 storage. Disks break too but it feels like connectivity issuses are much more frequent, at least in our country




Which is innately more fragile, a communication system you rely on for data that runs dozens to hundreds of miles through some communication medium that you don't control, to disks.

Or a disk (or five) in your own chassis.

It's not quite that simple, but still...


> Which is innately more fragile, a communication system […]

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — https://en.wikiquote.org/wiki/Leslie_Lamport


An alternative perspective - which is innately more fragile: a communication system with support and incident management by professionals whose sole job it is to keep the system running 24x7 and bring it back online in the rare event it fails; or, you, whose main job is something else entirely, and you would rather not have to think about disks at all.


That's a very good point.

For: I've seen a company very nearly destroyed by not having the skills to deal with a single disk failure in raid.

Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.

BTW read the SLAs of your provider and my guess is you'll agree with what a lawyer who worked for us said - shite. Basically, while it was down our provider wouldn't charge us for it. The end. What does yours do? If similar, what motivation have they to fix breaks? And is the provider responsible for loss of connectivity between them and you?

Put another way, who gets hurt more in downtime, the provider or you?

Some reading for me <https://journal.uptimeinstitute.com/cloud-slas-punish-not-co...>

Quick extract:

"SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.

If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total."

yep, that sounds about right.

Edit for context:

"In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident."


The AWS SLA compensation is also very much rigged against you beyond percent based outage durations.

For example a couple of months ago AWS had an outage that caused all of our customer facing domains to go down in us-west-2. It meant going to example.com wasn't resolving with our site due to a confirmed AWS outage.

For a few hours all of our RDS instances, EC2 instances, etc. were being charged for but providing $0 value since the entire org's sites were down. All revenue halted because the site wasn't accessible. When I contacted AWS support about the outage they mentioned we only qualified for some microscopic amount because the outage wasn't directly related to RDS, EC2 or VPC related things, etc..


>Against: A Large Hosting Company we used couldn't read my simple instructions and lost a backup.

One of our previous incidents happened because an employee at a large hosting company misunderstood the ticket and manually shut down our entire live server without warning


It sounds like AWS gives almost a 10x credit for the quick cases and a 6x credit for the not quick cases.

Business interruption insurance should be covering the actual downtime cost.


> It sounds like...

How very generous.

The insurance is an interesting idea, why not have MS/google/amazon roll that insurance into their offerings, I mean, it makes sense.


All else being equal: those people aren’t going to care that it’s down for you specifically.

Their nines are not your nines.

https://rachelbythebay.com/w/2019/07/15/giant/


I would much (much) rather support a simple server with standard file access than a complex network abstraction.


This is only feasible for applications where latency is not a concern. The overhead of just the HTTP call to an s3 bucket (not to mention all the other bucket access overhead) is much higher than the overhead of a disk read request. Try performing 1000 random file accesses to a bucket, and 1000 random accesses to a disk. The performance won’t even be close.


I always forget the factor but I remember disk operations generally being 100,000 times faster than network calls.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: