Data corruption with checksum offloading enabled

Tue Jan 27 04:46:07 PST 2009

Hello,

Dmitry Marakasov <amdmi3 at amdmi3.ru> writes:

> For now I have two cases of corruption - in both cases it is single
> difference of one 128 byte block with file offsets 0x65F872 and
> 0x61A072.

I had a similar problem last April on a 7-stable box reported
in a 'nfs-server silent data corruption' thread.

I found :

- in all failing cases just *one* byte is currupted, 4 or all 8 bits
  set to zero *and* the original value is one out of the limited
  subset {1, 8, 9} ....

  here is the output of `cmp -x $i/BIG $i/BIG2` for some failing
  cases I saved :

  03869a48 09 00
  05209d88 09 00
  01777148 09 00
  00f10f88 09 00
  01f4c4c8 11 00
  06c3d6c8 11 00
  0725ca48 18 00
  01608008 09 00
  00f3b888 18 00

  07aa45c8 29 20

Does your corruption fulfill these characterisations as well?

> I was suggested by Andrzej Tobola to try disabling txcsum on a
> network interface. I've disabled both rxcsum and txcsum, and that
> solved a problem.
>
> Judging from that this helped Andrzej with sk(4) and me with ale(4)
> driver, that's not a single driver problem. Does his mean that we
> have global problems with checksum offloading?

I could reproduce it with nfe(4) and re(4) ...

interestingly enough, I could *not* reproduce it when disabling
cpu frequency control ...

for what it's worth

Best, Arno