table hosed: validate

Over the past months I have used SQLIOStress extensively to validate
proposed hardware going to production. In most cases, the hardware has
failed to meet performance expectations and the SQLIOStress results
have not mattered.
However, today we have a piece of hardware which seems to perform very
well, yet *fails* the SQLIOStress test. Should we care?
Research
--
I called the sales rep, and they talked to their technical reps about
it...no useful information. Nothing more than I can search on the web.
I've contacted MSFT technical support, and they have yet to come back
with any specific information beyond what I can find in MSFT KB.
I've searched on the internet. Yet to find comments on why failures
are definitive no-go's for hardware
Failure details
--
Run SQLIOStress with /O to cause a reboot. (Note: the /O reboots the
computer not the storage, so it seems to be dated from a time when the
two were wired together. So we're really rebooting a Windows 2003 SP 1
server attached to storage which is still powered.)
Command line:
sqliostress
/fn:\stress.mdf /lm:\stress.ldf /S3072 /I11 /O10
Hardware: HP - MSA-500 G2 storage array and Proliant DL585 internal
drives. Quad HP Opteron is the machine which is using this storage.
>From the systems engineer who rebuilt the machine this week:
"The MSA500 G2 Firmware was upgraded from: Product ID 0x0E11E020,
Version 1.40 to 1.52
(http://h18023.www1.hp.com/support/files/server/us/download/23010.html).
We are also running the latest PSP and all of the other firmware is
being reported as up to date."
We are running SQLIOStress with SQLServer in the background. We have
discovered that with SQLServer in the background (dedicated 7 Gigs of
memory out of 8 or so) we more reliably see problems.
Configured the HP to 100% write cache.
Run SQLIOStress. It triggers a reboot. Restart SQLIOStress to
validate the written disk image and it generated a log file with the
following error:
06/30/05 17:37:11 00003192 Verifying the integrity of the file.
06/30/05 17:39:05 00003192
06/30/05 17:39:05 00003192 ERROR: Byte 1033 supposed to be [0x41] but
[0x42] was read
06/30/05 17:39:05 00003192 ERROR: Byte 1034 supposed to be [0x41] but
[0x4F] was read
06/30/05 17:39:05 00003192 ERROR: Byte 1035 supposed to be [0x41] but
[0x42] was read
06/30/05 17:39:05 00003192
06/30/05 17:39:05
00003192 ----
06/30/05 17:39:05 00003192 Found pattern [A] in file for page 154455.
06/30/05 17:39:05 00003192 Bytes read = 8192
06/30/05 17:39:05 00003192 Potential torn write, lost write or stale
read may have been encountered
06/30/05 17:39:05
00003192 ----
06/30/05 17:39:05 00003192
06/30/05 17:39:05 00003192 Sector: 0 LSN: 3914551 Page:
154455 Address: 0xA3280000
06/30/05 17:39:05
00003192 [AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA]
06/30/05 17:39:05 00003192 Hex:
[0x414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141]
We've also seen this with 50% read/50% write cache (using earlier
firmware). (Upon reviewing another post, we had decided to do 100%
write cache to better align with optimal performance and avoid possible
failures due to read cacheing. The failure still occurred. It did not
take many attempts to cause it.)
Some people have suggested that a torn write is a fact of life...that
with disks writing 512 bytes at a time, there is always the possibility
that a portion of the page will be written. Is that true? Does that
mean that a sqliostress failure is not news?
(Note: we configured NTFS with 8K blocks and 64K stripes)
Reasoning: when NT sends data to the hard drive, some may be written to
disk before NT receives an ACK that data has been written. With a
partial data write you have a "torn page". Hence, the error we see
(which may or may not be a torn page?) would indicate an OS issue and
not reflect on the hardware at all. Is this true? Or are there caches
which guarantee not to let any data flow out to disk until the ack is
returned to NT? (In which case a battery backed cache can ensure no
loss of data.)
Have other companies tested with SQLIOStress, seen failures, and still
deployed hardware and feel good about that? Or should a failure of
SQLIOStress be treated as the kiss of death?
Please advise.
Mark AndersenFollow up:
A helpful MSFT engineer from the SQL Server group spent some time on
the phone with me today. Thank you!
Conclusions:
/O can indicate problems, but not definitively. /O simply causes a
reboot while writing data.
Extensive testing we did at our company to pull the power plug during
both log and checkpoint operations provides a better sense of whether
the hardware works correctly.
MSFT technical support was unable to provide much help with SQLIOStress.|||Hi Mark
I read your post with interest as I am getting very similar results from
SQLIOStress, but I am trying to prove that SQL Server is safe in a production
virtual environment under either Microsoft Virtual Server or VmWare ESX
server. I was concerned that there is something intrinsically unsafe about
virtual machines but since you are getting exact same errors under real
hardware, it now doesn't worry me as much anymore.
I guess I am going to have to do similar extensive testing as you - pulling
the plug on both the virtual guest and the operating system host and see the
results. How did you go about deciding a good time to power off and how did
you check for data errors?
"markandersen@.evare.com" wrote:
> Follow up:
> A helpful MSFT engineer from the SQL Server group spent some time on
> the phone with me today. Thank you!
> Conclusions:
> /O can indicate problems, but not definitively. /O simply causes a
> reboot while writing data.
> Extensive testing we did at our company to pull the power plug during
> both log and checkpoint operations provides a better sense of whether
> the hardware works correctly.
> MSFT technical support was unable to provide much help with SQLIOStress.
>|||We have seen similar issues on an MSA 1000. In our case the problem is
resolved by a firmware upgrade to version 4.32 on the array controller. From
the release notes:
Fixes an issue found with SQL Server 2000 in which SQL may report the use of
stale cache
data under the following conditions:
ô'¾ Extremely heavy I/O load
ô'¾ Some percentage of MSA controller cache allocated to READ
ô'¾ Multiple small simultaneous writes to the same SCSI blocks
ô'¾ Write cache is full at the time of the request
If all of the above criteria is met SQL may report error IDâ's 605, 644 and
823 when performing subsequent reads from the same SCSI blocks
"markandersen@.evare.com" wrote:
> Over the past months I have used SQLIOStress extensively to validate
> proposed hardware going to production. In most cases, the hardware has
> failed to meet performance expectations and the SQLIOStress results
> have not mattered.
> However, today we have a piece of hardware which seems to perform very
> well, yet *fails* the SQLIOStress test. Should we care?
> Research
> --
> I called the sales rep, and they talked to their technical reps about
> it...no useful information. Nothing more than I can search on the web.
> I've contacted MSFT technical support, and they have yet to come back
> with any specific information beyond what I can find in MSFT KB.
> I've searched on the internet. Yet to find comments on why failures
> are definitive no-go's for hardware
>
> Failure details
> --
> Run SQLIOStress with /O to cause a reboot. (Note: the /O reboots the
> computer not the storage, so it seems to be dated from a time when the
> two were wired together. So we're really rebooting a Windows 2003 SP 1
> server attached to storage which is still powered.)
> Command line:
> sqliostress
> /fn:\stress.mdf /lm:\stress.ldf /S3072 /I11 /O10
> Hardware: HP - MSA-500 G2 storage array and Proliant DL585 internal
> drives. Quad HP Opteron is the machine which is using this storage.
> >From the systems engineer who rebuilt the machine this week:
> "The MSA500 G2 Firmware was upgraded from: Product ID 0x0E11E020,
> Version 1.40 to 1.52
> (http://h18023.www1.hp.com/support/files/server/us/download/23010.html).
> We are also running the latest PSP and all of the other firmware is
> being reported as up to date."
> We are running SQLIOStress with SQLServer in the background. We have
> discovered that with SQLServer in the background (dedicated 7 Gigs of
> memory out of 8 or so) we more reliably see problems.
> Configured the HP to 100% write cache.
> Run SQLIOStress. It triggers a reboot. Restart SQLIOStress to
> validate the written disk image and it generated a log file with the
> following error:
> 06/30/05 17:37:11 00003192 Verifying the integrity of the file.
> 06/30/05 17:39:05 00003192
> 06/30/05 17:39:05 00003192 ERROR: Byte 1033 supposed to be [0x41] but
> [0x42] was read
> 06/30/05 17:39:05 00003192 ERROR: Byte 1034 supposed to be [0x41] but
> [0x4F] was read
> 06/30/05 17:39:05 00003192 ERROR: Byte 1035 supposed to be [0x41] but
> [0x42] was read
> 06/30/05 17:39:05 00003192
> 06/30/05 17:39:05
> 00003192 ----
> 06/30/05 17:39:05 00003192 Found pattern [A] in file for page 154455.
> 06/30/05 17:39:05 00003192 Bytes read = 8192
> 06/30/05 17:39:05 00003192 Potential torn write, lost write or stale
> read may have been encountered
> 06/30/05 17:39:05
> 00003192 ----
> 06/30/05 17:39:05 00003192
> 06/30/05 17:39:05 00003192 Sector: 0 LSN: 3914551 Page:
> 154455 Address: 0xA3280000
> 06/30/05 17:39:05
> 00003192 [AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA]
> 06/30/05 17:39:05 00003192 Hex:
>
[0x414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141]
> We've also seen this with 50% read/50% write cache (using earlier
> firmware). (Upon reviewing another post, we had decided to do 100%
> write cache to better align with optimal performance and avoid possible
> failures due to read cacheing. The failure still occurred. It did not
> take many attempts to cause it.)
> Some people have suggested that a torn write is a fact of life...that
> with disks writing 512 bytes at a time, there is always the possibility
> that a portion of the page will be written. Is that true? Does that
> mean that a sqliostress failure is not news?
> (Note: we configured NTFS with 8K blocks and 64K stripes)
> Reasoning: when NT sends data to the hard drive, some may be written to
> disk before NT receives an ACK that data has been written. With a
> partial data write you have a "torn page". Hence, the error we see
> (which may or may not be a torn page?) would indicate an OS issue and
> not reflect on the hardware at all. Is this true? Or are there caches
> which guarantee not to let any data flow out to disk until the ack is
> returned to NT? (In which case a battery backed cache can ensure no
> loss of data.)
> Have other companies tested with SQLIOStress, seen failures, and still
> deployed hardware and feel good about that? Or should a failure of
> SQLIOStress be treated as the kiss of death?
> Please advise.
> Mark Andersen
>

Monday, March 19, 2012

Is SQLIOStress failure a definitive problem?

table hosed

Blog Archive

About Me