Re: ZFS + mysql appears to be killing my SSD's
- Reply: Pete French : "Re: ZFS + mysql appears to be killing my SSD's"
- In reply to: Pete French : "Re: ZFS + mysql appears to be killing my SSD's"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 05 Jul 2021 15:09:37 UTC
On 7/5/2021 10:30, Pete French wrote: > > > On 05/07/2021 14:37, Stefan Esser wrote: >> Hi Pete, >> >> have you checked the drive state and statistics with smartctl? > > Hi, thanks for the reply - yes, I did check the statistics, and they > dont make a lot of sense. I was just looking at them again in fact. > > So, one of the machines that we chnaged a drive on when this first > started, which was 4 weeks ago. > > root@telehouse04:/home/webadmin # smartctl -a /dev/ada0 | grep Perc > 169 Remaining_Lifetime_Perc 0x0000 082 082 000 Old_age > Offline - 82 > root@telehouse04:/home/webadmin # smartctl -a /dev/ada1 | grep Perc > 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age > Offline - 0 > > Now, from that you might think the 2nd drive was the one changes, but > no. Its the first one, which is now at 82% lifetime remaining! The > other druve, still at 100%, has been in there a year. The drives are > different manufacturers, which makes comparing most of the numbers > tricky unfortunately. > > > Am now even more worried than when I sent the first email - if that > 18% is accurate then I am going to be doing this again in another 4 > months, and thats not sustainable. It also looks as if this problem > has got a lot worse recently. Though I wasnt looking at the numbers > before, only noticing tyhe failurses. If I look at 'Percentage Used > Endurance Indicator' isntead of the 'Percent_Lifetime_Remain' value > then I see some of those well over 200%. That value is, on the newer > drives, 100 minus the 'Percent_Lifetime_Remain' value, so I guess they > ahve the same underlying metric. > > I didnt mention in my original email, but I am encrypting these with > geli. Does geli do any write amplification at all ? That might explain > the high write volumes... > > -pete. > As noted elsewhere assuming ashift=12 the answer on write amplification is no. Geli should be initialized with -s 4096; I'm assuming you did that? I have a 5-unit geli-encrypted root pool, all Intel 240gb SSDs. They do not report remaining lifetime via smart but do report indications of trouble. Here's one example snippet from one of the drives in that pool: SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 098 098 000 - 0 9 Power_On_Hours -O--CK 100 100 000 - 53264 12 Power_Cycle_Count -O--CK 100 100 000 - 100 170 Available_Reservd_Space PO--CK 100 100 010 - 0 171 Program_Fail_Count -O--CK 100 100 000 - 0 172 Erase_Fail_Count -O--CK 100 100 000 - 0 174 Unsafe_Shutdown_Count -O--CK 100 100 000 - 41 175 Power_Loss_Cap_Test PO--CK 100 100 010 - 631 (295 5442) 183 SATA_Downshift_Count -O--CK 100 100 000 - 0 184 End-to-End_Error PO--CK 100 100 090 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 190 Temperature_Case -O---K 068 063 000 - 32 (Min/Max 29/37) 192 Unsafe_Shutdown_Count -O--CK 100 100 000 - 41 194 Temperature_Internal -O---K 100 100 000 - 32 197 Current_Pending_Sector -O--CK 100 100 000 - 0 199 CRC_Error_Count -OSRCK 100 100 000 - 0 225 Host_Writes_32MiB -O--CK 100 100 000 - 1811548 226 Workld_Media_Wear_Indic -O--CK 100 100 000 - 205 227 Workld_Host_Reads_Perc -O--CK 100 100 000 - 49 228 Workload_Minutes -O--CK 100 100 000 - 55841 232 Available_Reservd_Space PO--CK 100 100 010 - 0 233 Media_Wearout_Indicator -O--CK 089 089 000 - 0 234 Thermal_Throttle -O--CK 100 100 000 - 0/0 241 Host_Writes_32MiB -O--CK 100 100 000 - 1811548 242 Host_Reads_32MiB -O--CK 100 100 000 - 1423217 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 2) == 0x01 0x008 4 100 --- Lifetime Power-On Resets 0x01 0x018 6 118722148102 --- Logical Sectors Written 0x01 0x020 6 89033895 --- Number of Write Commands 0x01 0x028 6 93271951909 --- Logical Sectors Read 0x01 0x030 6 6797990 --- Number of Read Commands 6 years in-use, roughly, and no indication of anything going on in terms of warnings about utilization or wear-out. There is a MYSQL database on this box used by Cacti that is running all the time and while the traffic is real high, it's there (there is also a Postgres server running on there that sees some traffic too.) These specific drives were selected due to having power-fail protection for data in-flight -- they were one of only a few that I've tested which passed a "pull the cord" test even though they're actually the 730s, NOT the "DC" series. Raidz2 configuration: root@NewFS:/home/karl # zpool status zsr pool: zsr state: ONLINE scan: scrub repaired 0 in 0 days 00:07:05 with 0 errors on Mon Jun 28 03:43:58 2021 config: NAME STATE READ WRITE CKSUM zsr ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p4.eli ONLINE 0 0 0 ada1p4.eli ONLINE 0 0 0 ada2p4.eli ONLINE 0 0 0 ada3p4.eli ONLINE 0 0 0 ada4p4.eli ONLINE 0 0 0 errors: No known data errors Micron appears to be the only people making suitable replacements if and when these do start to fail on me, but from what I see here it will be a good while yet. -- -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/