adding BBU relearn support to mfiutil
Mark Johnston
markj at freebsd.org
Thu Nov 7 18:44:23 UTC 2013
On Thu, Nov 07, 2013 at 12:56:16PM -0500, Charles Owens wrote:
> On 11/6/13 6:03 PM, Mark Johnston wrote:
> > On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote:
> >> Hi, we've been playing with this patch in the context of 8.4-RELEASE-p4
> >> (we extracted r250483 and r250497 from stable/8 and applied to
> >> releng/8.4). I'm seeing some results that make me question whether or
> >> not caching is really working correctly after a BBU relearn operation
> >> has completed -- or maybe whether or not the new BBU patch is talking to
> >> LSI controller properly.
> >>
> >> Our test system had a BBU in the failed state (relearn needed). We used
> >> the "start learn command" and it seemed to go well, but strangely, when
> >> process is seems to have completed, and now several days later, status
> >> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery").
> >> This may be entirely normal -- maybe it says that because the autolearn
> >> feature is now enabled?
> > I suspect that the status is bogus and that the battery is in fact dead.
> > There seem to be a few firmware bugs in the BBU status reporting, at
> > least with iBBU07. In your output below, I see:
> >
> > Design Capacity: 1215 mAh
> > Full Charge Capacity: 65262 mAh
> > Current Capacity: 61543 mAh
> >
> > which clearly isn't right. I've seen this problem before as well: over
> > time, the full charge capacity decreases, and eventually it seems to
> > wrap around to 65535. MegaCli (LSI's binary RAID management tool) reports
> > exactly the same thing, so it's a problem with the controller firmware.
> > If you look at MegaCli output you get things like "Absolute charge: 6000%".
> > So I suspect that the status is incorrect as well; when I've run into
> > this problem, I still see "status: normal".
> >
> >> The "cache" status command also suggests also is a bit strange. Here is
> >> the raw output of these status commands:
> >>
> >> # mfiutil cache mfid0
> >> mfi0 volume mfid0 cache settings:
> >> I/O caching: disabled
> >> write caching: write-back
> >> write cache with bad BBU: disabled
> >> read ahead: adaptive
> >> drive write cache: enabled
> >> Cache disabled due to dead battery or ongoing battery relearn
> >>
> >>
> >> # ./mfiutil show battery
> >> mfi0: Battery State:
> >> Manufacture Date: 3/18/2010
> >> Serial Number: 77
> >> Manufacturer: LS1111001A
> >> Model: 3598501
> >> Chemistry: LION
> >> Design Capacity: 1215 mAh
> >> Full Charge Capacity: 65262 mAh
> >> Current Capacity: 61543 mAh
> >> Charge Cycles: 120
> >> Current Charge: 94%
> >> Design Voltage: 3700 mV
> >> Current Voltage: 4081 mV
> >> Temperature: 23 C
> >> Autolearn period: 30 days
> >> Next learn time: Tue Nov 26 20:06:40 2013
> >> Learn delay interval: 0 hours
> >> Autolearn mode: enabled
> >> Status: LEARN_CYCLE_REQUESTED
> >>
> >>
> >> /Why does cache status now say "Cache disabled due to dead battery or
> >> ongoing battery relearn"/? Shouldn't this no longer be the case since
> >> I've run the "learn" operation? Does this indicate that the I/O caching
> >> is really disabled?
> > I believe so. You can try changing the write caching policy to write-back
> > with bad BBU and see if that re-enables the cache. If it does, that's
> > more evidence that the BBU is dead and needs to be replaced.
> >
> >> I'd appreciate any and all assistance. Here's a bit of other info that
> >> might be of interest:
> >>
> >> # mfiutil show adapter
> >> mfi0 Adapter:
> >> Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2
> >> Serial Number:
> >> Firmware: 11.0.1-0036
> >> RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
> >> Battery Backup: present
> >> NVRAM: 32K
> >> Onboard Memory: 512M
> >> Minimum Stripe: 8k
> >> Maximum Stripe: 1M
> >>
> >> # mfiutil show drives
> >> mfi0 Physical Drives:
> >> 1 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005JE> SAS E1:S0
> >> 2 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005JV> SAS E1:S1
> >> 3 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005KD> SAS E1:S4
> >> 4 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005BQ> SAS E1:S2
> >> 5 ( 136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=6TB005FJ> SAS E1:S3
> >>
> >> The storage volume is 4-drives, RAID10. System has 16GB RAM, dual Xeon
> >> E5530 CPUs, on an Intel S5520UR motherboard.
> > It might be useful to check the output of "mfiutil show events -c info".
> >
> >
>
> This is good info, thank you.
>
> The "show events" command tells us when the battery first was detected
> as "failed":
>
> 49336 (Sun Mar 3 21:53:40 UTC 2013/BATTERY/info) - Battery charge complete
> 49340 (boot + 4s/BATTERY/info) - Battery Present
> 49341 (boot + 4s/BATTERY/FATAL) - Battery has failed and cannot support data retention. Please replace the battery
> 49365 (boot + 45s/BATTERY/WARN) - BBU disabled; changing WB virtual disks to WT
> 49367 (Mon Mar 4 05:13:09 UTC 2013/BATTERY/info) - Battery temperature is normal
>
>
>
> So, given this strong indication that the BBU is really dead, and that
> I'd still like to test the effects of write-caching, I used this
> command: mfiutil cache mfid0 bad-bbu-write-cache enable
>
> Now the "cached disabled" messages is gone:
>
> # mfiutil cache mfid0
> mfi0 volume mfid0 cache settings:
> I/O caching: writes
> write caching: write-back
> write cache with bad BBU: enabled
> read ahead: adaptive
> drive write cache: enabled
>
>
> The obvious interpretation is that write-caching is now operational (in
> the preferred write-back mode). Strangely, though, my performance tests
> (with both pgbench and bonnie) still showed no meaningful effect from
> having the cache operational! I toggled between caching / no-caching
> with these commands:
>
> # mfiutil cache mfid0 writes
> Setting write cache policy to write-back
>
> # mfiutil cache mfid0 disable
> Disabling caching of I/O writes
>
>
> Again, no difference in performance was seen.
>
> On a whim, I also tried write-through mode, and to my surprise, bonnie
> showed significantly reduced performance! (consistent over multiple
> samples) This is really confusing. To me it suggests that there's some
> kind of disconnect between caching-status as seen with mfiutil and
> caching-status in reality. Chief exhibits being that write-caching
> appears to have still been happening even:
>
> * after the "cache mfid0 disable" command was issued, and
> * earlier, before the "cache mfid0 bad-bbu-write-cache enable" command
> was issued (when "mfiutil cache mfid0" still showed "Cache disabled
> due to dead battery or ongoing battery relearn").
>
> ** If this is the case then it suggests that the system before today was
> in a dangerous state... actively doing write-back caching with a bad BBU
> (despite what mfiutil claimed about the cache being disabled)! **
Yup. That's rather frightening. :(
>
> Your thoughts? Is there any other way to explain this?
Nothing that comes to mind. The reason I did some work to improve LSI BBU
reporting was because we were noticing intermittent performance problems
that turned out to be caused by the controller flipping to write-through
mode during BBU relearn cycles.
However, I've never bothered verifying that the cache is actually in
write-through mode when the battery is dead. I think there's a machine
in my lab which shows similar problems, so I will try to take a look at
it soon, do some write perf testing and see what MegaCli reports. It'll
take me a few days at least to get to this though.
I'm not sure how this might be fixed in the case that it turns out to be
another firmware bug.
-Mark
>
>
> Here is the data from bonnie:
>
> ***** write-through caching (2 samples)
>
> # bonnie -s 2000
> File './Bonnie.1351', size: 2097152000
> ...
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 2000 61515 21.3 46388 4.3 57432 16.0 247823 99.9 1629696 100.0 55687.0 212.4
>
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 2000 60001 20.7 51828 4.9 51666 13.9 247501 100.0 1657454 100.0 53136.4 251.0
>
> ***** write-back caching (2 samples)
>
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 2000 128564 44.6 90065 8.7 245325 47.8 248492 100.0 1558747 99.7 61967.5 179.1
>
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 2000 184059 64.0 141360 13.8 129801 22.2 246222 99.2 1556723 100.0 51728.4 159.7
>
> (and, again... same performance is seen after issuing "cache disable"
> command)
>
>
> Thanks much,
>
> Charles Owens
> Great Bay Software
>
More information about the freebsd-scsi
mailing list