SCSI tape data loss

Kern Sibbald kern at sibbald.com
Fri Jun 6 07:38:41 PDT 2003


Hello,

I have now completed a fairly extensive series of tests
on my Linux machine with a DDS-4 drive and on Dan's FreeBSD
machine with a DDS-1 drive.

Bottom line: There is a significant data loss (500KB to 2MB) 
at the EOM on Dan's drive.  There is no data loss on my drive.

The variation in the data loss seems to be inversely dependent
on how compressible the data is (i.e the more the data can be
compressed to fit in a fixed size driver buffer, the more user
data is lost).

I ran three different kinds of tests and several variations of some
of those tests:

Tests:
1. Bacula saving a 1GB file containing random data.
2. Simulation of Bacula writing easily compressible, non-random data.
3. Raw write() of random data (same data each write except for
   first 32 bits).

Variations:
1. Bacula stop writing before EOM reached.
2. Test 2 above without drive hardware compression
3. Test 3 above without writing EOF but simply rewinding
4. Tests with and without using ioctl(MTIOCLRERROR).
5. Various tests with block size at 64,512 bytes, others with
   block size at 61,440 bytes.

Results:
1. All tests on my machine succeeded.
2. All tests (Test 1 Variation 1) not writing to EOM succeed
   on both machines. (Previously we indicated that there
   was a loss when not writing to the EOM. I could not
   produce this and believe we had a misunderstanding 
   somewhere).
3. All tests of all variations writing to EOM failed 
   on Dan's machine.
4. The number of buffers lost was quite consistent (1-2 buffer
   difference) for any given variation.
5. There was not much difference in the number of buffers
   lost with/without hardware compression when the data was
   random.
6. The number of buffers lost was 4 times greater with
   non-random data and drive compression enabled than
   with random data or with no drive compression.

Conclusions:
1. On Dan's machine, data is always lost at EOM.
2. The amount of data lost appears to be closely
   related to what is in the drive buffer (more buffers 
   are lost if the data is easily compressed).

Possible causes:
1. The hardware does not have an LEOM
2. The driver is not signaling to the program when an LEOM
   occurs thus the buffered data is lost at the PEOM,  The
   ONLY write() status I got in all the tests was -1 with 
   errno=ENOSPC (no zero bytes written were ever returned).
3. Some miscommunication between the hardware and the driver.

What next:
- Time for the SCSI guys to look at this.  The problem is easily 
  repeatable on Dan's machine -- just do a whole bunch
  of write()s, nothing else, and it is guaranteed
  to happen.

Perhaps all the above is not clear enough, in which case,
please ask, but if I write it out with all the reasoning, it will
be a monster essay, so I've tried to give the important test
results so that you can draw your own conclusions and then
compare them to mine.

Best regards,

Kern



More information about the freebsd-scsi mailing list