[Bug 211381] L2ARC degraded, repeatedly, on Samsung SSD 950 Pro nvme

Tue Jul 26 12:06:33 UTC 2016

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211381

            Bug ID: 211381
           Summary: L2ARC degraded, repeatedly, on Samsung SSD 950 Pro
                    nvme
           Product: Base System
           Version: 10.3-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: braddeicide at hotmail.com
                CC: freebsd-amd64 at FreeBSD.org
                CC: freebsd-amd64 at FreeBSD.org

When placing an l2arc on an Samsung SSD 950 Pro (nvme) it works for a while
(hours/day?) then it's marked as degraded and capacity is all unallocated.

I'm using the same device for logs and it appears to be functioning fine (I
know not best practice)

Unlike other similar ticket #198242 I don't get bad checksums or io errors.

#zfs-stats -L
------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jul 26 20:54:51 2016
------------------------------------------------------------------------

L2 ARC Summary: (DEGRADED)
        Passed Headroom:                        1.05m
        Tried Lock Failures:                    1.68m
        IO In Progress:                         268
        Low Memory Aborts:                      114
        Free on Write:                          647.02k
        Writes While Full:                      1.42m
        R/W Clashes:                            2
        Bad Checksums:                          0
        IO Errors:                              0
        SPA Mismatch:                           1.38k

L2 ARC Size: (Adaptive)                         298.83  MiB
        Header Size:                    0.00%   0

L2 ARC Evicts:
        Lock Retries:                           37
        Upon Reading:                           0

L2 ARC Breakdown:                               44.69m
        Hit Ratio:                      5.07%   2.27m
        Miss Ratio:                     94.93%  42.43m
        Feeds:                                  1.70m

L2 ARC Buffer:
        Bytes Scanned:                          114.97  TiB
        Buffer Iterations:                      1.70m
        List Iterations:                        3.13m
        NULL List Iterations:                   58.42k

L2 ARC Writes:
        Writes Sent:                    100.00% 1.54m

# sysctl -a | grep l2
kern.cam.ctl2cam.max_sense: 252
vfs.zfs.l2c_only_size: 0
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 0
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 52428800
vfs.zfs.l2arc_write_max: 26214400
vfs.cache.numfullpathfail2: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 58416
kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 3128016
kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 126433859595264
kstat.zfs.misc.arcstats.l2_write_pios: 1543758
kstat.zfs.misc.arcstats.l2_write_buffer_iter: 1705468
kstat.zfs.misc.arcstats.l2_write_full: 1424118
kstat.zfs.misc.arcstats.l2_write_not_cacheable: 1229880739
kstat.zfs.misc.arcstats.l2_write_io_in_progress: 268
kstat.zfs.misc.arcstats.l2_write_in_l2: 4397896253
kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 1379
kstat.zfs.misc.arcstats.l2_write_passed_headroom: 1047813
kstat.zfs.misc.arcstats.l2_write_trylock_fail: 1676946
-> kstat.zfs.misc.arcstats.l2_compress_failures: 143662342
kstat.zfs.misc.arcstats.l2_compress_zeros: 11159
kstat.zfs.misc.arcstats.l2_compress_successes: 1038776676
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.l2_asize: 152834048
kstat.zfs.misc.arcstats.l2_size: 338656768
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 114
kstat.zfs.misc.arcstats.l2_cdata_free_on_write: 4449
kstat.zfs.misc.arcstats.l2_free_on_write: 647030
kstat.zfs.misc.arcstats.l2_evict_l1cached: 3164010
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 37
kstat.zfs.misc.arcstats.l2_writes_lock_retry: 2181
-> kstat.zfs.misc.arcstats.l2_writes_error: 1499766
kstat.zfs.misc.arcstats.l2_writes_done: 1543758
kstat.zfs.misc.arcstats.l2_writes_sent: 1543758
kstat.zfs.misc.arcstats.l2_write_bytes: 17569594126336
kstat.zfs.misc.arcstats.l2_read_bytes: 59325997056
kstat.zfs.misc.arcstats.l2_rw_clash: 2
kstat.zfs.misc.arcstats.l2_feeds: 1705468
kstat.zfs.misc.arcstats.l2_misses: 42457839
kstat.zfs.misc.arcstats.l2_hits: 2265463
kstat.zfs.misc.arcstats.evict_l2_skip: 712
kstat.zfs.misc.arcstats.evict_l2_ineligible: 214781087744
kstat.zfs.misc.arcstats.evict_l2_eligible: 1461307185152
kstat.zfs.misc.arcstats.evict_l2_cached: 250679670784

# zpool status
  pool: zpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 30h28m with 0 errors on Wed Jul 20 04:06:39 2016
config:

        NAME           STATE     READ WRITE CKSUM
        zpool1         ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            da0p2.eli  ONLINE       0     0     0
            da1p2.eli  ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            da2p2.eli  ONLINE       0     0     0
            da3p2.eli  ONLINE       0     0     0
          mirror-3     ONLINE       0     0     0
            da6p2.eli  ONLINE       0     0     0
            da7p2.eli  ONLINE       0     0     0
          mirror-4     ONLINE       0     0     0
            da4p2.eli  ONLINE       0     0     0
            da5p2.eli  ONLINE       0     0     0
        logs
          nvd0p2.eli   ONLINE       0     0     0
        cache
          nvd0p3.eli   ONLINE       0 1.21G     0

errors: No known data errors

When I remove and re-add the device it works for a while, but then gets in the
state seen below where its capacity is entirely free.

# zpool iostat -v
                  capacity     operations    bandwidth
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
zpool1         11.6T  2.91T    625    186  11.6M  3.58M
  mirror       2.70T   945G    137     51  2.76M   727K
    da0p2.eli      -      -     39     15  2.78M   730K
    da1p2.eli      -      -     39     15  2.78M   730K
  mirror       2.86T   784G    183     34  2.86M   515K
    da2p2.eli      -      -     43     12  2.89M   517K
    da3p2.eli      -      -     43     11  2.89M   517K
  mirror       3.23T   406G    217     28  3.17M   452K
    da6p2.eli      -      -     54     10  3.27M   455K
    da7p2.eli      -      -     53     10  3.27M   455K
  mirror       2.80T   846G     87     45  2.80M   691K
    da4p2.eli      -      -     31     13  2.80M   694K
    da5p2.eli      -      -     31     13  2.80M   694K
logs               -      -      -      -      -      -
  nvd0p2.eli   14.1M  6.92G      0     34      0  1.60M
cache              -      -      -      -      -      -
  nvd0p3.eli    146M   450G    126    263  1.05M  12.7M
-------------  -----  -----  -----  -----  -----  -----

-- 
You are receiving this mail because:
You are on the CC list for the bug.