Optimized, table-less YUV 4:2:0 conversion to RGB 05:5:5 method
===============================================================
by Erik Walthinsen <omega@cse.ogi.edu>
Copyright (c) 2000
Licensed under the OpenContent License v1.0 (opencontent.org/opl.shtml)

This document will describe the methods used in constructing a highly
MMX-optimized conversion from YUV 4:2:0 planar data (as is common in video
processing applications, such as DV and MPEG decode) to RGB 05:5:5 packed
format, as is used for most video cards.


Design decisions
================
At the outset, several design decisions have to be made.  The first
decision was to create a macroblock-sized conversion, i.e. take a 16x16
luma plane and two 8x8 chroma planes, yielding a 16x16 RGB plane.  This is
driven by the structure of the MPEG-2 decoder, which when optimized for a
number of caching issues will operate on data in a much more local fashion
than the naive implementation of processing the whole frame at once, per
operation.

A result of this decision is that cache polution is to be avoided, lest
the gains found in rearranging the MPEG decoder be lost as data must be
constantly reloaded from main memory to service the next iteration of the
decoder.  This means that lookup tables are not possible, unless they are
very small.  As there are 256 possible values for each of luma and chroma,
and both chroma samples impact two each of the R, G, and B values, we're
looking at 1280 table entries.  For precision, these must be at least 16
bits, yielding a table size of 2560 bytes at minimum.  Per iteration, as
many as 512 of these entries may be used, as that is the total number of
values calculated in a macroblock (256 luma, 64 each per chroma, 2 impacts
per chroma).  This means that a significant number of cache lines may be
touched per iteration, loading the whole cache line (32 bytes) into
memory, possibly displacing data that could be useful in the next round.

This leaves us with an algorithm that must utilize multiplication rather
heavily, with a possible loss of some precision depending on the scale of
those multipliers.  However, this actually gives us an advantage, if we're
clever enough to utilize it: the structure of Pentium MMX processors
allows MMX multiplies to be pipelined, such that even though a single
multiply will take 3 cycles for the result to be available, many
multiplies can be stacked one after another, so long as they violate any
other scheduling rules.  The idea would be to postpone mutliply operations
in order to prepare several, then run them through the pipeline all at
once.

Another issue with Pentium scheduling is a 1-cycle stall before copying
(MOVQ or MOVD) a register to anything but another register.  This means
that either we have to avoid memory access as much as possible (a good
strategy in general), or be prepared to do something else during that
stall.  This shouldn't be too much of an issue, but it something to think
about.  Regardless, an attempt will be made to do the entire conversion
within the available registers, avoiding the overhead of writing to
temporary memory locations.


Color-space conversion algorithm
================================
Color space conversion, in the case of YUV to RGB, consists of multiplying
luma and chroma samples by constants and summing to yield the R, G, and B
components.  The algorithm is as follows:

R = 1.164 (Y-16)                + 1.596(V-128)
G = 1.164 (Y-16) - 0.391(U-128) - 0.813(V-128)
B = 1.164 (Y-16) + 2.018(U-128)

Note the subtractions on each.  These normalize the values into the proper
ranges.  In the luma case, this is because the total brightness range that
YUV allows is larger than that of RGB.  While technically the luma range
is from 16 to 235, there are instances where so-called 'super-white' and
'over-black' pixels appear.  These values are actually supposed to be
clamped to the above range, which is something to worry about in
implementing a correct conversion.

In the luma case, the normalization actually brings the values into the
-112 to +112 range, which requires a signed number to store.  This means
that signed arithmetic must be used in any operations involving the chroma
samples, which happens to be almost all of them.

Of course, this algorithm must be applied to the correct data.  The key
here is that the chroma bandwidth is 1/4 that of the luma bandwidth.  This
is accomplished by colocating the two chroma samples with every other luma
sample, both horizontally and vertically:

YC Y  YC Y  YC		Y = luma sample
Y  Y  Y  Y  Y		C = chroma samples
YC Y  YC Y  YC
Y  Y  Y  Y  Y

This means that we have a lot less work to do.  We have to do the impact
calculations for chroma only once for every four output pixels.  Since the
work necessary to construct the luma impact is constant across each of R,
G, and B, this simplifies our work quite a bit.


MMX tricks
==========
[Un]Packing
-----------
The basic idea behind MMX is to operate on data in parallel, either 2, 4,
or 8 at a time depending on the width of the data in question.  Of course,
if it were nothing but that, it wouldn't be very interesting, since you'd
have to arrange your data very carefully into the right sequence in order
to use the instructions.  Instead, MMX provides a number of very useful
data manipliation instructions.

The fundamental operations here are packing and unpacking.  In our case,
we're going to be starting with data that's packed as bytes, but our
operations are going to require more precision than 8 bits can provide.
Unpacking is precisely what we need.  Unpacking will take the top or
bottom half of a register, and promote those values to the next higher bit
count.  We'll be using it to take the byte samples, 4 at a time, and
convert them into 16-bit word samples.

One thing to note is that unpacking actually takes two registers.  Bytes
or words from the same location will be taking from each register, and
interleaved to form the new wider values.  This means that if all you want
to do is widen the data already in a register, you have to provide a
register full of zeros in order to fill the top half of each of the
resulting values.  If you're dealing with signed data, things get much
hairier, since you'll have to unpack with a register filled with 1's,
which means using some conditional operations to deal with that.  Luckily,
none of the work we have to do will require us to unpack signed values.

The inverse of this, packing, is the process of taking two register's
contents and demoting every value to half it's bit count.  This means
taking two register that contain 4 words each and combining them into a
single register with 8 bytes.  The key behind this operation is the
saturation that occurs during packing.  Any value that is outside the
range of the destination width is automatically brought within that range.
That means that if we were to end up with any super-white pixels, where
one of the resulting RGB components is outside the 0 - 255 range, it will
automatically be clipped down to 255.  Whether we can take advantage of
this depends on whether or not we will be doing any packing.

High-low mutliplication
-----------------------
Any time you multiply two values, you end up with a value that's larger
than either of the original two.  In binary terms, the resulting register
must have enough bits to hold the sum of the maximum bit count in each
source register.  That means that the result of multiplying two 16-bit
numbers must be stored in a 32-bit register, lest data get lost if either
of the source values is 'too large'.

MMX's multiply routines are designed not to change the type of data stored
in a register, so multiplying registers with 16-bit words together will
always result in a register with 16-bit words.  "That can't work!" you
say.  Yes it can, if it takes two instructions to do the work, one for the
high result bits and one for low.

If you want to produce data in the wider format, it'll take another
register to unpack the two resulting registers' data into the proper
format.  However, if you're doing mutliplication that will always adhere
to some bit boundaries, you can be clever and retain precision while
dumping excess data, all in one operation.

Consider the case we'll run into, where we need to multiply an 8-bit value
and a 16-bit value, yet come out with a 16-bit value.  Normally this would
result in a 24-bit peak value:

   aaaaaaaa * bbbbbbbbbbbbbbbb
 =    cccccccccccccccccccccccc

When this is done with MMX multiply instructions, you'll end up with one 
register containing the top 8 bits and the other containing the bottom 16.
Since we only want the top 16 bits, we'd have to play some pretty ugly
games to get things back to the state we want them.

But what if one were to shift A over to the left 8 bits, to 'fill out' the
16-bit space?

   aaaaaaaa00000000 * bbbbbbbbbbbbbbbb
 =    cccccccccccccccccccccccc00000000

Remember, shifting has the same effect whether it's done before or after
the multiply, because it really is just another multiply, and you can
re-arrange those all you want.  Note now that when you mutliply this with
MMX, you get the top 16 bits of your answer in one register, and the
bottom 8 plus the 0's you added in the other.  Voila!  Since all you want
is the top 16 bits, you're done.


Packed Multiply and Add
-----------------------
This instruction puzzled me for quite some time, as I wondered what on
earth it was good for.  Eventually, in the process of pondering this
color-conversion process, it hit me.  It is perfect for several parts of
this process.

The idea behind Packed Multiply and Add is to take two registers
containing words, mutliply each coincident pair of words together (into
32-bit temporary space), then sum the results of each pair of multiplies:

  aaaaaaaa bbbbbbbb cccccccc dddddddd
      *        *        *        *
  eeeeeeee ffffffff gggggggg hhhhhhhh
  ======== ======== ======== ========
          +                 +
     ==========        ===========
     (A*E + B*F)       (C*G + D*H)

This is particularly useful when calculating the chroma impact for the
green component.  The key is to interleave paired U and V samples in one
register, and do the same to the impact mutlipliers.  Of course, this only
does two samples at a time, but it's vastly superior to the alternatives
(massive unpacking and many multiplies, or reduced precision).

The other major use for this instruction is in packing samples into RGB
05:5:5.  Intel has an appnote on this, which covers the concept rather
nicely.  Rather than duplicating it here, go there.

Shifting == Multiplying
-----------------------
In certain extreme cases, you may find yourself in a situation where
you've optimized the code as far as it will go, but you've got a set of
multiplies or shifts clustered together that you can't break up.  Since
there is only one MMX shifter, a cluster of MMX shifts will fill the U
pipeline but stall the V pipeline continually.  Assuming you don't have
dependency chains between registers to prevent this, remember that you can
dispatch a shift and a multiply in the same cycle.  The fact that a shift
is nothing more than a multiply by a power of 2, done more efficiently,
can be used (if you have the spare registers or cached constants) can be
used to alternate between shifts and multiplies, allowing tigher pairing.


Calculating chroma impact
=========================

Green chroma impact
-------------------
The first step is to calculate the impact of the chroma samples.  Since
green impact is the hardest, as well as the most register-hungry, we'll
start there.

The obvious first place to start is to actually load the pixels:

	movq_m2r(*u_block,mm0);
	movq_m2r(*v_block,mm1);

Of course, this gives us the first 8 pixels of each, which is more than we
want to deal with right off.  To solve this problem we can unpack the high
4 bytes of each, using a conviently zero'd register:

	pandn_r2r(mm7,mm7);
	punpckhbw_r2r(mm7,mm0);
	punpckhbw_r2r(mm7,mm1);

This gives us what we want, which is the first four pixels of each in mm0
and mm1, 16 bits wide each.  Now we have to do range correction, bringing
them into the proper -128 to +128 range:

	val128w = 0x0080008000800080L;
	psubw_m2r(val128w,mm0);
	psubw_m2r(val128w,mm1);

Now to pull the Packed Multiply and Add trick.  We need to get two pairs
of chroma samples arranged, which means using another unpack instruction:

	movq_r2r(mm1,mm2);
	punpckhwd_r2r(mm0,mm2);

The copy is to avoid having to load, unpack, and normalize the chroma data
in mm1 all over again.  Now that we have the first two chroma samples from
each of U and V interleaved in mm2, we can use Packed Multiply and Add to
turn them into impact values:

	green_impact = 0x1906340819063408L;
	pmaddwd_m2r(green_impact,mm2);
	psrad_i2r(8,mm2);

The impact values are chosen carefully to maximize precision while still
remaining in range.  Specifically, you'll note that the highest scalar in
the equation is 2.018.  This means that we have to have 2 bits of
headroom.  Since we're going to be using left-shifts (or their equivalent)
to do the work of making this act like a fraction, our impact multipliers
must be based on a power of two.  We have 16 bits to play with, but need 2
bits of headroom, thus we multiply all the scalars by 2^13, which is
8192:

	0.391 * 16384 = 6406 = 0x1906
	0.813 * 16384 = 13321 = 0x3408

The same method will be used all throughout to calculate impact
multipliers, so we don't have to worry about mismatches at various points.
14 bits of precision is quite good, since in the final stages it comes
down to only a few bits anyway.

The right-shift corrects for the fact that we've taken a set of 8-bit
values (in mm2) and multiplied them by a set of 16-bit values
(green_impact), yielding a maximum possible usage of 24 bits.  Since we're
going to have to pack this data down into 16-bit data in order to
effectively keep it in registers and use it to form real pixel data, we
simply strip off the bottom 8 bits.

We now have two pixel's worth of green impact data in mm2, we can repeat
the process to get the other two pixels:

	movq_r2r(mm1,mm3);
	punpcklwd_r2r(mm0,mm3);
	pmaddwd_m2r(green_impact,mm3);
	psrad_i2r(8,mm3);

In order to combine the results, we can use the pack instruction, which
will squash and concatenate the two register's contents, performing
saturation:

	packssdw_r2r(mm2,mm3);


Red and Blue chroma impact
--------------------------
This should be simpler, since only one chroma sample is involved in each
calculation.  We're going to be doing 4 samples at a time, and it just so
happens that mm0 and mm1 still contain exactly that after calculation of
green impact (it was those movq's, ya see?):

	blue_impact = 0x8127812781278127L;
	red_impact = 0x6625662566256625L;
	shift_left_8 = 0x0100010001000100L;
	pmullw_m2r(shift_left_8,mm0);
	pmullw_m2r(shift_left_8,mm1);
	pmulhw_m2r(blue_impact,mm0);
	pmulhw_m2r(red_impact,mm1);

Note the fist multiply, which has the effect of shifted left by 8 bits.
This is necessary because there is no aritmetic left shift instruction.
This is to move the resulting value over another 8 bits, into the top 16
bits of the result.  The Packed Multiply High gives us only these top
bits, so we have what we came for.

Performance-wise, this is one of the worst things you can do, short of
repeatedly multiplying a single register.  The first multiply goes into
the U pipeline, there's nothing to pair (except maybe loading a constant
into another register in advance to hide potential memory latencies).  
The second multiply goes into the U pipeline one cycle behind the first,
but the third must stall for 2 cycles while waiting for the first to
complete, as will the 4th.  The only hope here is to find a way to
interleave these instructions with other work with no cross-dependencies.


Calculating luma impact
-----------------------
We now have to calculate the impact of the luminance samples.  The key
issue is that every pair of samples has to be applied to the same chroma
impacts.  To accomplish this, we'll use some new tricks, combined with
mutlipliers, to get even and odd samples' impacts:

	upperbyte = 0xff00ff00ff00ff00L;
	movq_m2r(*y_block,mm4);
	movq_r2r(mm4,mm5);
	pand_m2r(upperbyte,mm4);

This gives us the even samples (0, 2, 4, 6), but they're shifted up into
the top half of each word.  This happens to be pretty much what we want,
since normally the multiplication of 8- and 16-bit values requires a
left-shift of 8 bits.  It's already done for us.  However, note that in
the equation at top there's a subtraction done on each Y sample.  This is
messy, since there's no sane way to do that in its current form.  Such a
subtraction would cause the values to go signed, and since we're shifted
over there's no way we can become signed, short of shifting down,
subtracting, then back up again.

The trick we'll use is to do the subtraction later, since:

	(Y - 16) * x == (Y * x) - (16 * x)

We know x, therefore we can simply have a constant handy that we subtract
from the result of the multiplication, keeping in mind that the value has 
shifted 8 bits in the process of multiplication:

	luma_impact = 0x4a7f4a7f4a7f4a7f;
	luma_impact_16 = 0x04a804a804a804a8;
	pmulhw_m2r(luma_impact,mm4);
	psubw_m2r(luma_impact_16,mm4);

We now have the luma impact for the even pixels, which has to be added to
each of the R, G, and B impact in order to get pixel values:

	movq_r2r(mm1,mm5);
	movq_r2r(mm0,mm6);
	paddw_r2r(mm4,mm5);
	paddw_r2r(mm4,mm6);

Note that we're only doing red and blue right away.  This is so we can use
the Packed Multiply and Add trick in a couple instructions.  In the
meantime, however, we need to do several things.  The first is to
arithmetically shift both registers left by 6 bits.  This undoes the
precision added into the impact multipliers (14 - 8 = 6), since it's no
longer useful:

	psraw_i2r(6,mm5);
	psraw_i2r(6,mm6);

To be correct, however, we must apply saturation to the output.  This is
because of the possibility of super-white and over-black samples.
Super-white values have the annoying habit of turning into all sorts of
nice primary and secondary colors when the RGB data is not saturated.
Unfortunately, we've got all the data in words, and no good reason to have
them in byte form.  We'll have to bite the bullet and forcibly pack and
unpack them to achieve saturation:

	packsswb_r2r(mm7,mm5);
	punpcklbw_r2r(mm7,mm5);
	packsswb_r2r(mm7,mm6);
	punpcklbw_r2r(mm7,mm6);

That hurts, since that will take 4 cycles.  There may be some
opportunities to pair them with other work later on, however.


Conversion to RGB 05:5:5
------------------------
We now have two registers with 4 pixels of valid R and B data each.  The
trick is to get these packed into the normal 05:5:5 representation.  Here
we can use Intel's trick, which is to use PMADD to both shift and join the
two color components.  This requires that we have the color components 
interleaved.  This can be achieved with some unpacking:

	movq_r2r(mm6,mm7);
	punpcklwd_r2r(mm5,mm6);
	punpckhwd_r2r(mm7,mm5);

This gives us two registers with interleaved red and blue components,
paired together.  We could PMADD them now, but still end up with more
bits than we need.  That means having to mask off what we need before
continuing:

	mask_5_bits = 0x001f001f001f001fL;
	pand_m2r(mask_5_bits,mm5);
	pand_m2r(mask_5_bits,mm6);

Now we can PMADD them:

	red_blue_pack = 0x1000000810000008L;
	pmadd_m2r(red_blue_pack,mm5);
	pmadd_m2r(red_blue_pack,mm6);