Hacker News new | past | comments | ask | show | jobs | submit login
WD Sets the Record Straight: Lists All Drives That Use Slower SMR Tech (tomshardware.com)
237 points by rbanffy on April 24, 2020 | hide | past | favorite | 147 comments



We had to drag them kicking and screaming, but we got them to do it. Good job internet.

I wonder where they keep the voices of common sense at most companies- the ones that realize a company is stronger in the long term if they are upfront with their customers rather than playing games with misdirection and delay tactics.


By the time the decision is made and the reasoning is something like money it's got momentum.

I was in a meeting once where some new support / warranty policies were announced. Now this wasn't a decision making meeting so I've no idea what happened there. The policy change was a straight up violation of some contracts we had with some customers. You didn't have to understand law or anything, it was pretty obvious. I spoke up if only out of curiosity to see what would happen.

Everyone with any level of power in the room seemed convinced that it couldn't be illegal, because the folks higher up the chain made that decision....

Everyone had delegated common sense to the folks upstairs.

Often when I hear of these decision I think "Someone had to say something at a meeting right?" but actually, I suspect that isn't the case.

Epilogue: Company got sued by customers first, then a state AG got in on the action, then the feds came after them. The law firm hired by the company actually quit on them before they finally settled.


That sounds so beautiful


Wow. Love how you buried the lede in the epilogue! ;) Womp, womp.


The voices with common sense are removed vertically and horizontally in the organizational structure.

Concrete example: The compliance department in a different country mandates that all passwords must be salted (great) and use modern hash algorithms from a whitelist. That whitelist consisted mostly they sha2 and ripemd family. We were using bcrypt to hash passwords, which is an actual password hashing algorithm instead of a general-purpose hashing algorithm. Since we were organizationally so far removed from the compliance department even escalating the issue several management levels had no effect other than being told "compliance is important, don't waste time on meetings about this, just change it" and "this is not a hill to die on". So we worked around this by using scrypt instead (which is not in the whitelist but based on a hash in the whitelist) but the asinine policy is still in place because it was considered too much effort to get them to change it. Of course the next team that wants to implement a web service with passwords will face the very same issue and may take them by their word and use SHA2 directly. Great success.


They get kicked out because they're not making as much money. Ultimately it goes to the top. CEO's are appointed by shareholders who generally care only about the bottom line. And have shown themselves to be quite short-termist in this more often than not.

If we want to change this, then we need to change how control of companies is determined in our economy.


I think there is a perceived need to be short term. Mainly, because most CEOs are appointed. In the case of Microsoft, Amazon, Google, etc. the founders had enough equity to think long term.

Apple is an interesting one, in that it fired Job’s. Only to rehire and let the vision go to work. Even if it takes a few years.I think that’s the exception though, not the rule.


And once they rehired Jobs, they let him stop paying dividends and build up an unbelievable war chest worth of cash, which nominally was against the short term interests of the shareholders.


> Job’s

You just gave me an aneurysm.


I would assume autocorrect is to blam'e


Of course given the premium of their Red Pro line, I probably won't be buying from them... It does make me curious what this will mean for shuckers on the externals... as I would expect SMR to actually be a good use case for most external drives.


It's probably worth changing the link to the official blog post: https://blog.westerndigital.com/wd-red-nas-drives/

Anyway, this is all I wanted from them. "Here are what technologies each drive uses, and we'll be more transparent in the future."

Now I can be confident in what I'm getting when I buy a Western Digital drive. Good for them!


> Now I can be confident in what I'm getting when I buy a Western Digital drive. Good for them!

Are you confident that you'll get a drive that can work in a NAS when on the very same blog page they claim "WD Red HDDs are ideal for home and small businesses using NAS systems" and, indeed, on the data sheet[1], they claim that WD Red drives are "the hard drive of choice for 1 to 8 bay systems"?

At the very least they need to fix the data sheet; that's where I go to assess a drive's capabilities, not a vague blog post.

[1] https://documents.westerndigital.com/content/dam/doc-library...


All this has done is shown me that WD are willing to repeatedly lie to their customers about things.

Just because they're no longer going to lie about this specific thing because they got enough bad press about it to force them to be honest doesn't mean they deserve any respect.


Case in point: on April 23, they proposed to retire the only flag that identifies device-managed SMR drives in the new SCSI standard.

Weber, Ralph O (April 23, 2020). "SBC-5, ZBC-2: Obsolete the ZONED field". <https://www.t10.org/cgi-bin/ac.pl?t=d&f=20-054r0.pdf > (registration needed)


Hopefully Seagate and the other manufacturers that've been caught doing this will follow suit. It sucks to be a tech consumer buying expensive hardware like this and not feeling like you can trust the company selling it.


Which is precisely I was surprised when Western Digital's initial response didn't include a list like this one.

From my perspective as a consumer, WD has turned a disaster into an advantage. I now know that WD drives will disclose which technology they use—just like I know that Apple will bend over backwards to not secretly throttle battery performance. I can't say as much for other manufacturers.


> From my perspective as a consumer, WD has turned a disaster into an advantage.

I disagree. They explicitly denied it for months until the tech press started reporting on it. They showed nothing but contempt for their customers.

Edit: I’m never going to forget this: “Well the higher team contacted me back and informed me that the information I requested about whether or not the WD60EFAX was a SMR or PMR would not be provided to me. They said that information is not disclosed to consumers. LOL. WOW.“

From: https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shin...


Vote with your money and don’t buy WD.


Who to buy? Seagate did the same thing.


What exact same thing? Using SMR, or using SMR without mentioning it while claiming that the drives were suitable for RAID?


I read here they used SMR without mentioning it and clarified it on their website. Don't know if they said the drives were suitable for RAID.


They only used SMR without disclosing for general purpose drives... while bad, not as bad... their IronWolf and IronWolf Pro (NAS) drives are not SMR.

As I mentioned in another comment, equally pissed WD did this on (albiet smallest size) a Black label drive. WTF would they do this on their "high performance" segment is beyond me, and completely unreasonable.


They did not, Seagate did not disclose some of the their Desktop Channel drives were SMR,

the NAS, aka IronWolf, drives are all CMR, have always been CMR, and according to seagate will always be CMR

https://arstechnica.com/information-technology/2020/04/seaga...


All HDD manufacturers shipped SMR drives without telling anyone, WD was just the one that got the press because they tried lying about it first.


No, Western Digital is the only one that advertised (and continues to advertise!) SMR drives as optimized for RAID. The drives are failing in RAID rebuilds, that's why this is getting press. Seagate and Toshiba are not using SMR in their NAS drives.


WD got press because they shipped SMR drive for a use case that SMR can not work in and peoples drives failed to work properly


It makes since that the Parent does not view WD actions as bad since they also did not view Apple's actions as bad.

Of course both companies are on my never buy list now, but some people have lower standards.


I disagree. These drives are not fit for installation into a RAID, and even being transparent about that doesn't fix the technical issues with SMR drives in that setting. And yet, WD is still explicitly advertising them for that purpose. From their site:

"Reliable

Designed to operate in the always-on environment of a NAS or RAID configuration"

So right now, today, they are steering people to buy these drives for an application that they are 100% not fit for. To me, that's still a disaster.


That's certainly an interpretation.

From my point of view, by doing the wrong thing initially, and then (especially) by persisting in it for so long, they've just turned a disaster into a bigger disaster.

I haven't walked away from this feeling like WD is fundamentally honest and open and can be trusted to disclose critical information without prompting; I walked away feeling like WD is fundamentally dishonest, and will lie even after being caught. "They stopped lying eventually!" is not a strong defense.

> I now know that WD drives will disclose which technology they use

Do you? I certainly don't. All I know is they have made, under extreme pressure, a once-off disclosure of unknown accuracy around their current drive models.


You appear to be under the mistaken perception that SMR drives are an appropriate product for some markets. SMR drives are not suitable for any market. SMR drives were supposed to cut costs by 25+%, but the savings never materialized. Instead, we now have a situation where products with a complete nightmare of a performance profile are being forced into consumers hands when an educated consumer would reject the product outright if they knew what they were buying.

Sorry, SMR HDDs are junk. The sooner the industry realizes that they're not suitable for any purpose, the better. Having a device where 4KB writes sometimes take seconds to complete is just plain unacceptable in 2020.


What if you're trying to fit a large amount of data into the smallest physical form factor possible?

What if you have data you want to write once and then store for an extended period?

What if you're just on an ultra low budget?


The highest capacity drives are not SMR, they're CMR. Why? Because nobody who stores large amounts of data wants to buy SMR drives. SMR drives are garbage and will continue to remain garbage. They're a dead end technology.

For those wondering why SMR drives are so bad, consider the following: SMR drives basically make the same trade-off that flash makes, but with a magnetic media that has a significant seek time and bandwidth penalty compared to flash. Any time a non-sequential write is made to an SMR drive, the whole "block" (which can be many megabytes in size) needs to be read from the disk into memory, written out somewhere else on the disk after which you can finally perform the equivalent of a flash sector erase and start writing data out sequentially again. The catch is that your hard drive can only perform sequential i/o at maybe 200MB/s vs the 3GB/s flash can. Flash sectors are also measured in kilobytes (128KB is a common size), while SMR "sectors" can end up being 16-128MB in size. Do the math. The latency is a disaster as we're talking hundreds to thousands of milliseconds of latency to do an erase. Flash can erase a sector in a millisecond, and you can have multiple erases occurring in parallel across multiple planes and dies. An SMR drive can have exactly 1 erase operation in flight at a time.

If SMR drives were of any value, you'd see them at the high end of the capacity spectrum. It's funny how those products don't seem to exist.


But we do use SMR drives at the high end of the capacity spectrum! Dropbox uses them heavily in their data centers[1], estimating that 40% of their data would be stored on SMR drives at the end of 2019.

Now, that's a data center, but don't consumers sometimes have similar cold-storage use cases? As long as they know what they're buying?

[1] https://dropbox.tech/infrastructure/smr-what-we-learned-in-o...


The reads like a fluff piece that just doesn't add up. Dropbox mentions testing 14TB SMR drives in the post from June 12 2018 https://www.anandtech.com/show/15457/western-digital-roadmap... . If I have a look on Amazon right now, the largest HDDs are 16TB CMR drives, not SMR.

Anandtech has a post based off of WD's press release from December 23rd 2019 here https://www.anandtech.com/show/15457/western-digital-roadmap... . The energy assisted MAMR drive is 18TB in capacity, while the SMR drive is 20TB in capacity. Where's the 25% capacity benefit? Shouldn't the SMR drive offer a capacity of 22.5TB?

Have a look at this https://www.anandtech.com/show/15457/western-digital-roadmap... article at Anandtech posted January 31st 2020. Note the side by side comparisons of projected growth of SMR drive capacity shipments. SMR growth projections are always right around the corner in the next 1-2 years. Why is this? Quite simply the market doesn't want SMR. Non-SMR technology continues to improve, eliminating SMR's capacity advantage after a period of time without SMR's orders of magnitude performance penalties.

If SMR was more than just a one time 25% capacity advantage it might be worth buying into. But it doesn't. It's a technology in search of a market that just doesn't exist because of all the downsides. HAMR and MAMR are the promising technologies to increase HDD capacity, not SMR.


WD black or NAS drives don't really fit the ultra low budget criteria.


You buy tape.


Good point. I actually just discovered via a link in the TomsHardware article that the Seagate drive I purchased a couple weeks ago is, in fact, SMR. While I shouldn't have issues with it because my workloads aren't particularly write-heavy, this has definitely not helped my trust in Seagate.


That's one reaction. Another reaction might be analogous to Dieselgate - someone might regard this as reflecting on all drives with spinning platters.

I tend to think the revealing of dishonesty has negative externalities as a general rule. Even though of course we would rather know than not.



"Now" yes, not sure if that (the matrix) will change in a few weeks => then, once the "time"-factor plays a role, it will become difficult again :(

They should just state it clearly in the drive's specs / model ID.


They said:

> We will update our marketing materials, as well as provide more information about SMR technology, including benchmarks and ideal use cases.

I interpreted that to mean "Future SMR drives will disclose their use of the technology". We'll of course have to see whether follow through, but for now they're doing the right things.


Ok, you're right, I didn't notice that sentence. We'll see... :)


Click through for the PDF on their site to confirm based on model numbers whether a drive will be SMR. The capacity-based list on their site seems to be for models in current production, but there are still plenty of "old" drives available with those capacities that are CMR. So be sure to look closely at the model numbers!


If you drop an email to the HN mods using the Contact footer link, they can definitely fix that.


What is crazy about this, is that there are real technical reasons why you shouldn't put a SMR drive (even with drive managed SMR) in a generic RAID. The fact that they thought they could get away with this for a _NAS_ marketed drive means that the technical people are being kept far away from the actual marketing/sales people. You wonder what other gocha's they have left lying about.


Why would a WD "BLACK" that is supposed to be high performance ever use SMR? Of course if the product placement of high/low performance makes no sense, how are people supposed to know what to buy?


As I understand it they've borrowed techniques from SSD firmware. Random writes are written to a linear cache area of the drive and transparently remapped. When the drive gets idle time it will rewrite the SMR zones where those blocks are supposed to go.

As a result the write speed of the drive is even faster than a non-SMR drive until the cache area fills up. If your workload does not try to write gigabytes of random junk without giving the drive any idle time, it will work great.


Let's re-encode some video files and see how that works out...

edit: or bring down some large CAD files from the server to work on locally...

The fact is, it's labelled a performance drive, much like the red drives... SMR should not even be an option here... sure add the extra smart caching/remapping, but you still want performance for larger failes.


> As a result the write speed of the drive is even faster than a non-SMR drive until the cache area fills up

I'm interested if there's a benchmark showing that?


Could the cache be flash based?


No, cache is reserved as a non-SMR area on platters.


can't they employ this technology for CMR drives? in theory the window of time it takes to write will be even smaller...


Weren't the "Hybrid" drives just that?


Apparently, they don't sell any 2,5″ 7mm CMR drives with 1 TB+. Still weird and misleading that they call it "Black", obviously.


Curious, is this because they can't physically fit such a large CMR drive in that form factor? What's the actual capacity/size limit?


There are 2TB 7mm SMR drives, so ~1.8TB CMR should be easily doable.

I think there were 1TB 7mm CMR drives available as far as 5-6 years ago, although I'm not sure about the 7mm part.


Interesting. Regarding 1.8 TB though, that's a weird size to put on a package, so I can see why they wouldn't do that.


Sorry, I meant up to ~1.8TB. When that's possible 1TB is trival.

Regarding your side point, there existed 320GB, 750GB and 1.5TB hard drives, so 1.5TB or even a 1.75TB would not be a weird size for a sku.


This looks like another non-apology from a PR person, just a slightly more competent PR person than whoever wrote the epic failure that was their previous blog post on this subject.

If someone tries to steal your car and gets caught, then says they only meant to steal your TV, and then when challenged says they understand that they should have been more upfront with you that they were going to try to harm you, they are not a good person. Sadly, I don't perceive any more legitimate remorse or sincerity about fixing the root cause here than there was last time.

The fact is that right now, several days after this all kicked off, if you go to WD's site and choose Products and then NAS internal drives from the menu, it still links to Reds and says (in very big letters, repeatedly) things like:

"Designed and Optimized for NAS Compatibility"

and

"WD Red Provides Storage Compatible with Leading NAS Systems"

Based on that, it appears that they are still selling drives that are not fit for their stated purpose. Everything else is just spin.


All I want to see is "WD recalls SMR drives for all customers who affirm that the drive gave them trouble, gives full retail price credit toward any other WD drive purchase". A percentage bonus for our trouble would be nice but unlikely.

It's not perfect, but it's a compromise. People who care get new drives; WD pays a cost for their mistake.


It shouldn't be that I have to affirm that a drive has already given me trouble. I should just have to show that I'm using a drive in an application where I would have problems in the future.

Product safety recalls don't require you to get food poisoning or take shrapnel in your face from a bad airbag before fixing the issue. I don't want to wait until my RAID fails a rebuild and I lose data.


Are SMR drives actually more likely to loose data than CMR drives? I was under the impression they had performance issues, but were reliable with regard to data integrity.


That depends on your situation. If you have a single redundant drive in your storage system and you lose a drive, and if a rebuild takes you multiple weeks instead of hours and during that time a second drive fails, you are going to lose the whole system and need to restore everything from backups. That's a pretty serious flaw in a drive advertised specifically for use in these systems, when reducing the risk of that downtime is the purpose of having the system in the first place.


They wrote:

> If you have purchased a drive, please call our customer care if you are experiencing performance or any other technical issues. We will have options for you. We are here to help.


I just attempted to call their customer service. The representative had no idea what SMR was.

After I asked to be transferred to someone else, the representative told me that "WD has no official answer yet and to call again in a day or two", I pointed out the blog post that specifically directs customers to call their hotline, but then I got back into the loop of "WD has no official answer yet and to call again in a day or two"....


Which is just smoke and mirrors. I bet you they will push back hard.

If they were here to help they would have done so already. It’s just a distraction.


Shouldn't we wait and see what reports are before jumping to that conclusion? I'm generally willing to assume good faith until a company gives me reason to believe otherwise.

Expecting them to proactively contact customers and to recall all existing drives seems like an unreasonable ask.


I'm generally willing to assume good faith until a company gives me reason to believe otherwise.

I think the point is that in this case, WD have already given us plenty of reason to believe otherwise. There isn't much doubt left now to give them its benefit.


What would you expect a sufficient response to look like?

As I said above, I don't think it's reasonable to ask WD to proactively contact everyone who bought one of these drives, so a message on their official corporate blog which reads "please reach out to customer care if you're experiencing performance problems" seems like exactly the right action to take.

So, that's the lens I'm looking through: what I'm seeing, versus what would I expect to see from a company reacting appropriately. Right now, they match.

Now, if WD had a history of posting such explicit, official messages publicly and then refusing to help customers privately, I would no longer be willing to give WD the benefit of the doubt. As far as I'm aware, we're not there yet.

I would present Facebook as an example of a company that has lost this trust. I unfortunately can no believe a single thing they say in any capacity. They have issued far, far too many apologies.


When the GTX 970 was released and it turned out that 500MB of its 4GB memory ran at a much slower speed they were sued. They settled the suit and paid out $30/card.

I think WD ought to offer some compensation for everyone who bought drives with worse performance than advertised. Likely they will be sued and be forced to pay something like that anyways.


What would you expect a sufficient response to look like?

I'm not sure there is only one good answer to that, but I'm quite sure that any good answer would involve not still having the broken product and misleading presentation all over the WD website right now.


We have seen their reaction so far. Why keep having faith on then given their behavior so far?


I think it's a little disingenuous to compare it to theft. They left some information out of their datasheet that was relevant to people. They said some really stupid stuff when they got called out on it. And I bet if you had access to the internal communications of WD after that blog post you'd see a bunch of big direct customers threatening to pull blanket POs for WD Reds unless WD added some information to their data sheets.

All I see is a company playing fast and loose with the specs, and their marketing people not understanding why SMR would not be fit for purpose. I would bet WD doesn't have a US-based tech team making those low-level decisions. They probably outsource all of it to their contract manufacturers that make the actual drives. And since the CM isn't responsible for marketing, they were like "you can save $5.00 per drive if you use this new recording technology." The marketing people have no idea what the trade offs are and the CM doesn't care.

Then 2 years later when the first batch of these drives start failing, the marketing person who made the original decision isn't there, the CM doesn't have any record of any of the conversations, and nobody has any idea what's going on. So in a bid to try to manage the conversation without actually knowing anything their marketing people release that first blog article.

Then the customer outcry happens, people threaten to pull blanket POs, and finally the marketer realizes they have to do some real work because it's not just the angry grognard brigade making noise.

Most modern companies that sell commodity electronics work this way.


To clarify, I wasn't intending to equate what WD are doing with theft. I was looking for analogy to illustrate the failure in trying to do something bad, trying to do essentially the same bad thing but to a slightly lesser degree, and then pretending that admitting to this makes it somehow better while you're still in fact doing the bad thing.

I think "left some information out of their datasheet that is relevant" is being rather kind here. They were advertising these products for one specific purpose, and the change in spec makes them unsuitable for that purpose. They then tried to upsell to a more expensive product that retained the original spec and was still presumably suitable for that purpose in their initial non-apology apology.

I understand the mechanics of how this can happen, but the fact is that this isn't the customer's fault. It's a failure of basic management and leadership skills, and criticising the company for that is entirely fair.


"unsuitable for that purpose"

"WESTERN DIGITAL DOES NOT PROVIDE ANY OTHER WARRANTIES OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE"


Consumer rights laws in various places now render such statements impotent. You don't get to make claims about what your product is for or how it performs in nice big letters on your website and then hide behind some legal weasel words on page 93 of an agreement no customer is ever likely to read.


Of course any major end customer or OEM are definitely going to know if the drives are SMR. Do you really think they’re going to slip that by EMC or AWS? Those customers are intimately familiar with the technologies used in these drives as well as technology development.


I can’t believe they put SMR into a WD Black drive considering its “performance” designation.


Apparently nothing is sacred if they can squeeze few bucks out of it


It's interesting that the Purple (surveillance) line are all listed as CMR since I was just looking up Seagate's Surveillance drives, which are apparently SMR.

Might grab me a couple of Purples as they're more available locally to me than Seagate's IronWolf CMR NAS drives.


I would of thought the Purples would have been one of the least bad places for SMR in consumer drives, since it's pretty much just all sequential load.

On the other hand, the Black with SMR is just shocking, considering that's their "performance" line.


> I would of thought the Purples would have been one of the least bad places for SMR in consumer drives, since it's pretty much just all sequential load.

I've tried out a couple Seagate SMR drives for NVR use. They weren't advertised as SMR but I figured it out from the observed performance and the "rated workload" on the spec sheet.

It's okay if you stay within the rated workload, but that's pretty limiting. This is given in the spec sheet in terabytes per year; if you divide by 4 you have approximately Mbps. One of my drives (ST8000DM004-2CX188) is rated for 55 TB/year, or 14 Mbps. It seems fine with 2 cameras that each have a <= 6 Mbps main stream and <= 1 Mbsp sub stream.

But if you want to record a bunch of streams to one drive (perfectly reasonable with CMR), these can't do it. When device-manager SMR is used with a standard filesystem (I'm using ext4) and non-SMR-aware software/write patterns (my software just writes new files with ~1-minute segments of H.264 and deletes old ones), the write amplification gets bad and the drive can't keep up. The software just gets stuck waiting for writes/syncs to go through. And it's really confusing to observe on the host because all the mapping and stuff is hidden from you. You just see on "iostat" that the "%util" is much higher what you'd calculate based on seek + throughput for the data you're transferring.

I imagine with host-managed SMR and SMR-aware software, this could be much better. Most NVRs mostly just record in a ring buffer—for a given stream, they almost always overwrite the oldest data. You could bypass the whole CMR cache area and not read/rewrite any adjacent areas if you do it well, and then the write amplification would go away.


Surveillance is sequential write load with hard real-time requirements, though. The unpredictable delays inherent to SMR could present serious problems.


No, the write load is nearly constant. With a fixed-size buffer in the drive, you're pretty close to optimal use case for SMR. There's no source of variance except for where the SMR mechanism adds it, and that's analogous to a garbage collector sweep. Size and frequency depend on dirty rate.


Even on a freshly formatted drive, you'd want your write rate to stay below what the SMR rewrite process can handle. If the drive fills up and it needs to delete something to make room before writing new video, its ability to keep up with the incoming video data is going to be reduced.

Lots of recording systems won't record unless the camera sees motion. Or they record at a much lower frame rate until motion is detected & then go into high speed mode to capture fine details. So that helps, and the CMR buffer also helps. But if the recording fills up the CMR buffer, you're back to incoming data potentially arriving faster than SMR can rewrite.

Time to drop back to 1080 HD cameras from 4k cameras maybe. Or reduce the number of cameras.

edit This is comparing a SMR to a similar CMR drive. Obviously, any drive can run into trouble if you exceed its write capabilities. But a SMR drive's write rate is going to be lower than a CMR drive, once it's buffer area is full. And especially if you're doing random writes.


The write load will be nearly constant but it won’t be sequential. The host device is going to have to update file system data structures periodically as it is writing the video stream; those structures will be at various different places on the drive. So the write load will look like 99% sequential, 1% random. That small amount of random writes in a non-contiguous stream is enough to cause SMR-related rewrite stuttering, especially since with DMSMR the host has no idea how logical blocks presented by the drive map to SMR regions.


> all sequential load

Surveillance data is recorded on top of the old one, which means all writes are rewrites and that is the worst case scenario for a SMR drive.

A host-managed drive with carefully aligned zones may avoid this issue in this scenario, but I don't know how far a drive-managed SMR drive can mitigate it.


These drives don't even support TRIM. So I think as far as they're concerned, once the drive is filled, every write is a rewrite no matter the exact workload. I think the concept of drop-in, device-managed SMR, but let's try to hide that is nowhere near as good as what's possible with host management in terms of the write load it can sustain.


It looks like it was just the 1T black. Still... would have never suspected they would SMR in the black line. Truth be told.. it was the last think I expected in the red.


Why you would buy a spinning rust drive that's only 1TB if you care about performance is a big mystery. 1TB is available in not overly expensive SSDs now. That is a drive with almost no use case. A quick search suggests it's about $100 cheaper than the SSD, but unless your budget is absolutely inflexible that extra $100 would be incredibly well spent on upgrading the drive. More so than almost any other place on the machine.


I'd never consider buying one new - but I do have some 2-4TB blacks that I had for Steam. Funny how terribly unloved the old 15k 300G velociraptor sitting next to me - replaced by something costing about eight quality beers.

Had I a stack of them, I would have considered the black line safe for home NAS. Apparently not, now that I know.


If they're old drives they probably don't use SMR.


You can probably tune how much of a hit SMR is by changing the size of the CMR region. The more CMR area you have, the more you can sustain bursts of writes without needing to reduce performance for recording recent writes into their respective final resting places. It's more or less the same logic of hybrid drives, with the difference reading from CMR is not faster than from SMR, so you can deallocate already SMR'ed information immediately after a successful write.


When you rebuild a drive in a RAID array the size of the CMR region doesn't matter though, your array is either going to explode due to the latency spikes that come or take weeks to rebuild.


Isn't it possible to query the drive and, if it is SMR, just send the data to the disk in the order it expects so that it can resilver the drive one SMR region at a time?

This seems more like a misunderstanding between the drive and the OS in that the drive doesn't tell the OS it'll need a break between write batches and an OS that tries to write to the disk in a way that's not compatible with the device technology.

The same more or less applies to SSDs - the drive can't write a disk block, but only a memory block at once.


‘would of’ followed by ‘would have’ in the same sentence! you’re grammar trolling, aren’t you?


> one of the least bad places for SMR

uh, except surveillance usage is often raided for redundancy.


This aggressive segmenting of drives into colors/use cases just serves to confuse me. I wish they'd just clearly state on boxes exactly what feature set the drives have - i.e. "vibration sensor, sustained write speed, SMR/CMR" etc


According to the website, they are designed to run cooler and with lower power, which may negatively affect the performance in relation to the more common Blue ones.


I don't understand why the WD Red 2-6 TB range was specifically assigned to SMR, not the smaller or larger drives in that line.

Is it especially price competitive?


Just spitballing, but it's probably the size range where the difference between one and two physical platters is whether they use SMR or not. A cost-cutting measure.


But also they've segmented away business users to Red Pro, Gold and Ultrastar drives for a few years, so I'd imagine sales of Reds today mostly fall in that range so more motivation to shave costs. With SSD pricing these days, <2TB hard drives just don't really make sense, but also I'm guessing very few home users need/want more than 8TB of storage.


There's a large market for "power" users who want high specs on paper but don't need them. The market response is to fake the specs or make compromising tradeoffs in lesser known specs. Often it works out fine and everyone is happy, but sometimes a bunch of customer use cases get broken.


That sounds right. I got a WD60EFAX and a Seagate ST6000VN0033 a few months ago, and the WD is significantly lighter than the Seagate.


Because they assume anyone buying a drive that small is price-sensitive. Except they didn't actually lower the price when they crippled the drives with SMR.


My theory would be that they’re the highest volume sellers, and where the margins are thinnest.


Could be a combination of yields and binning.


Is the core issue with RAID rebuilds that the RAID rebuild timeouts are too low for what happens when an SMR drive's buffer overflows during a rebuild?

Are these "5 seconds", or "60 seconds", or "300+ seconds" timeouts that are triggering the rebuild failure?


It sounds like at least 60 seconds. Even with no timeouts RAID rebuilds go from a day with CMR to over a week with SMR.


Okay, so they do succeed, they just take a long time. Good to know! Thanks.


Only if your RAID software/hardware is exceptionally tolerant of the drive.

Don't forget the reason "NAS" drives exist in the first place is that several years ago drive manufacturers added a feature where if a read failed they would go into an extremely thorough but long (60+ second) recovery effort to get the sector back. RAID controllers would just see the drive stop responding to commands and mark it dead. So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.

If the drives go out to lunch due to a SMR writeback bottleneck then they will have lost their main selling point. Presumably in the normal case the drive will write the data just fine, but at a slower rate so you can rebuild your array but it will take all week. However, if one of the sectors fails the CRC check after the write and it has to try several times to get it I can definitely see the RAID controller getting frustrated and kicking it out.

I would be interested to see if any RAID software comes with a "SMR" mode where if a drive stops responding to commands during a rebuild the controller lets the drive take a 20 minute break before resuming the rebuild.


> So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.

Hang on a sec. Is this documented somewhere?

I bought a WD Red to plug into my Raspberry Pi which I use as a file server. There's no RAID, just the one disk. I thought I was buying a more energy efficient or bulk-storage-oriented drive.

But if what you say is true, then the "NAS" or "Red" drives should _never_ be used outside of a RAID because robust error correction was removed from them by design. Do I have that right?


That's exactly correct: see https://en.wikipedia.org/wiki/Error_recovery_control for details.

Basically, NAS drives have a hard limit on how long they'll try to recover from errors before just reported the failure back to the RAID controller so that it can handle them.


Yes, that's the right idea. NAS/RAID drives have a different error recovery strategy, because the assumption is that they'll be part of a multi-drive arrangement where failing fast (and allowing the containing system to handle recovery) is preferable to avoiding failure if at all possible (but potentially taking a long time and thus causing the containing system to think the drive has stopped functioning properly and fail the whole thing out). I can't point you to any specific documentation off the top of my head, but this is a well-known position that I've seen described explicitly several times.

I'm afraid that does mean your choice of a Red for a single-disk system was not ideal. Presumably you keep backups of any valuable data anyway, but if downtime for recovery would be a significant problem for you then you might want to consider replacing that drive with something more suitable for your situation.


I should hand in my geek card, this feels like something I should have known about. In my defense, though, the HD manufacturers offer little to no information about the _technical_ differences between their drive lines. All of their documentation just says, "designed for X use case".

I do have backups, that's not my concern. My concern is that _when_ there is a read/write error (which are completely normal events with today's hard drive technology), the drive just gives up right away instead of making a few attempts. This could easily translate into (silently!) lost data in a single-disk scenario.


If one uses ZFS, one can instruct ZFS to keep multiple copies of the data. It will try to spread those copies among multiple disks, but in single-disk systems it will just spread the duplicate blocks over that disk.

Since ZFS does checksum verification on every read, it has a much better chance of recovering from a few bad sectors.

Downside though is that the default RPi installs are 32bit and ZFS was written with 64bit-only in mind, and AFAIK there are still some issues and limitations when running on a 32bit system.


You can set the TLER value via smartctl, though that might not work through a USB interface.

smartctl -l scterc,<READTIME>,<WRITETIME> /dev/xxxx

WD Red drives should retain this setting across reboots. Some drives don't, and some don't support the command.


I think this is called TLER (Time Limited Error Recovery.)

https://en.wikipedia.org/wiki/Error_recovery_control


If you are ever where this happens the drive is end of life and should be tossed and the new one rebuilt from backups. You do have backups right... Drives fail without notice often


Wait, if it's a NAS drive, the drive firmware will ensure that it doesn't timeout due to media failure. Which the RAID can trust, because it's a NAS drive.

So.. why do the RAID rebuilds have timeouts on NAS drives at all? If you paid all that extra money for a special firmware that doesn't time out on media error, and the drive is still accepting and processing commands in less than X hours per command, then wiring in your own timeout seems like a really bad idea.

When the cache is full and something sends a write to the drive, does the drive still accept "are you still there?" commands while the write is queued?


> Which the RAID can trust, because it's a NAS drive.

There's the problem right there! :) Drives aren't trustworthy, regardless of label.


So the raid software thinks it knows better than the drive firmware, ignores the fact that it's operating a drive with no I/O timeouts, and helpfully times out the drive from the array because obviously it's not behaving 'correctly' in line with the unverified assumptions of the RAID software?

It reads to me like the fault here isn't just on the hard drive manufacturers, like everyone's made it appear in top-level comments of both issues about it this week. I'm glad I asked more questions so that I'm better informed to help my friends when they encounter this. I appreciate everyone in the thread offering help with the details.


There've been lots of detailed reports about rebuilds not succeeding. For example: https://github.com/openzfs/zfs/issues/10214


This seems like a different problem than people are making it out to be.

SMR drives have slow random writes and paper over it with caching, until the cache gets full. Then in theory what should happen is that the actual write speed of the drive is exposed. That means resilvers would take a long time, but they should still finish.

What seems to be actually happening is that some of these drives have a firmware bug such that when caching is enabled and the cache gets full, the write speed drops to zero. The system then regards the drive as faulty and boots it out.

So it seems like this should be solvable with a firmware update that causes the drive to behave differently (slow rather than stopped) when the cache gets full.

This also implies that some other SMR drives with different firmware might not behave like this, and that it might not happen with ordinary RAID rebuilds as opposed to ZFS resilvering because then the writes should be almost entirely sequential (i.e. what SMR is good at) as opposed to ZFS which is more random.


It doesn't have to stop completely to trigger the problem case, does it? It seems like just being slow enough to trigger the containing system's timeout response would be enough.


SMR drives shouldn't be that slow. They're slower than PMR drives but it shouldn't be by so much that individual writes are taking tens of seconds.

What's probably happening is one of two things. Either the cache gets full and the drive is blocking while it flushes the cache to the disk, or the drive is advertising that it can do a large number of simultaneous write operations which it can't actually do all of in a reasonable amount of time when the cache is full, and then the ones queued last time out. In the first case they could have the firmware continue to process uncached writes when the cache is full, in the second case don't advertise the ability to do as many simultaneous writes.

Another alternative might be for the system to use a longer timeout value for these drives, but whether that's reasonable depends on how long it would actually have to be.


This is precisely the origin of my question upstream: Why, precisely, in specific detail, are these rebuilds failing?


> "...Thank you for letting us know how we can do better. We will update our marketing materials, as well as provide more information about SMR technology, including benchmarks and ideal use cases."

I'd love to see the CEO read that with a straight face.


CEO's become CEO's for a reason.


For those unwilling to disable their Adblocker, here is the official WD blog post with PDF containing make and model numbers:

https://blog.westerndigital.com/wd-red-nas-drives/


I had zero issues with uBlock Origin on the Tom's Hardware site. I do also use Privacy Badger, though I doubt that would make it more likely to work.

One thing I have started doing is running NoScript but by default I allow scripts. But when a site puts up notices about adblocking or other annoying things I disable scripts for them. Surprisingly it often allows me to read the website without issues. I did not do that with Tom's Hardware though.


I am using uBO. Blocking 3rd party scripts is what triggers it.


If you can't view the site with your adblocker then you're using a bad adblocker. uBlock Origin doesn't have any problems.


Sorry, but Black and Red labelled drives should NEVER be SMR... they're specifically labelled for a purpose... Blue, sure... but their High Performance Black, and NAS labelled drives should not be using this tech.


I am annoyed that they missed an opportunity here to fix SMR support in the non-GooFaceAmaSoft DC ecosystem.

Announce a trade-in for anyone who has a Red and has a problem with device-managed SMR, with a CMR Red. Many, many people won't notice or care (RAID1 users who don't do massive writes) Then reflash the SMR reds with the host-managed firmware they already have for the megascalers (who of course have their own highly proprietary software management layers to handle SMR) and sell those at some deep discount to anyone who wants them.

Announce a $100,000 bounty to anyone who adds host-managed SMR support (with some slightly ambitious perf target vs CMR) to ZFS, bcachefs, whatever.

They come out of this looking good (eventually) instead of looking like complete assholes. And end up with an expanded ecosystem that allows them to create a new product family using firmware work that they've already done years go!


Drive managed SMR is borderline. Host managed is stupid and only worth the effort if your a hyperscaler with a particular workload. A workload that happens to look a lot like old school tape workloads, which is probably what they should have sold SMR as.. Hard drives that look like tapes.

Which is why beyond some basic hinting its not worth putting in mainstream filesystems.


> Hard drives that look like tapes.

Tapes don't have fast (by spinning rust standards) random reads.

Hinting in ext4 or NTFS is of course a waste of time, but CoW filesystems like ZFS or btrfs are in a much better position to handle host-managed SMR.

And a generic "SMRTL" layer that utilizes (hopefully mirrored) SSDs as a write staging area could do massively better than the dumb host-managed SMR on these drives.


I really want to hear from someone inside WD (outside the marketing/PR department). The level of organizational dysfunction that must exist for this to have happened would be a real spectacle.


>Some users claim that SMR drives also do not work correctly when rebuilding ZFS arrays,

Is this true? On the assumption of the drive Firmware working correctly as intended, why would this be a problem?

I have not been following the problem closely as I dont quite understand what the fuss is about. I expect HDD to be slow, so not much of a problem to me. I only cares about its reliability. ( Which is something that theoretically SMR should be worst, there are no data to back it up yet )


>I expect HDD to be slow

The problem is how slow. 10MB/s sounds ok since it's faster than many home internet connections. But it's going to take you 3 weeks to re-build a 15TB drive in a raid array. Much longer again if the array is used at all (causing non-linear access as it has to rewrite changed blocks.)

https://news.ycombinator.com/item?id=15623937


And not knowing that rebuild time can cause you to incorrectly allocate physical drives to the array (based on AFR, size, and rebuild speeds). Changing the drive allocation for zfs requires destruction of the array.


How can I tell whether a non-WD drive - say, Seagate - uses SMR or CMR?



Dang, this is confusing, but it shouldn't be a dupe even though the URL is the same!

WD added new information. Their previous post didn't say which drives actually used SMR, and took a considerably different tone.

Why WD decided to keep the same URL is anyone's guess, but it's different content now.


Thanks for clearing that up! I've taken "[dupe]" off the submission.


Thanks, it's still worth changing the link though! Not sure what to do with the title.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: