Possible instruction pipelining problem between HT's on the same die ?

Fri Jun 3 22:47:27 GMT 2005

:This is normal behaviour.
:Take a look at IA-32 Intel Developers ... Vol 3,  
:Section: 7.2.2 for details + solutions.
:
:Stephan

    Ok.. that section seems to indicate that speculative reads 
    can pass writes, but it also says that the pipeline sniffs the address
    within the processor and ensures proper ordering.  The latter part
    makes sense within the context of a single cpu, but the big question is: 
    Is that supposed to hold true for interactions with HT cpus (that share
    the pipeline) as well?  Or not ?  It seems not.

    Speculative reads creating out of order situations seems to be the
    biggest issue.  The AMD manual (Programmers manual volume 3 page
    186, MFENCE instruction) says this:

    "The MFENCE instruction is weakly-ordered with respect to data and
    instruction prefetches.  Speculative loads initiated by the processor,
    or specified explicitly using cache-prefetch instructions, can be 
    reordered around an MFENCE".

    This seems to be different then what the Intel manual says, and doesn't
    make much sense.  What's the point of having a fence instruction if it
    can't guarentee read/write ordering?  Is the AMD manual simply wrong ?

    Other then that, the Intel manual does indicate that speculative reads
    will not pass locked bus cycle instructions (the AMD manual says nothing
    about that that I can see).  So, presumably, doing a dummy locked bus 
    cycle operation on e.g. the top of the stack, such as Linux does, would
    be sufficient to ensure read ordering.  Would you concur with that
    assessment?

    What's really horrible here is that the 'old' value of the data being
    used is modified at location A something like 30 instructions prior to 
    the instruction that updates the index (B).   I think this is a 
    situation that can only occur in an HT configuration, and then only if
    the speculative read issued by the HT cpu is being held for across
    30 instructions executed by the primary cpu before the HT cpu issues the
    read of B.

    cpu #0 			cpu #1 (HT cpu on same die as cpu #0)

				speculatively read A
    write A			(stalled)
    [30 instructions]		(stalled x 30)
    write B			(stalled)
				read B
				see that B has been updated
				read A (get old value for A instead of new)

    Is that even possible ?  Not only the 30 instruction latency, but also
    the fact that even with the shared pipeline you have a speculative read
    on the HT cpu surviving 30 instructions running on cpu #0 (but only one
    or two on the HT cpu)... even though they share the same pipeline.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>