Optimized, table-less YUV 4:2:0 conversion to RGB 05:5:5 method =============================================================== by Erik Walthinsen Copyright (c) 2000 Licensed under the OpenContent License v1.0 (opencontent.org/opl.shtml) This document will describe the methods used in constructing a highly MMX-optimized conversion from YUV 4:2:0 planar data (as is common in video processing applications, such as DV and MPEG decode) to RGB 05:5:5 packed format, as is used for most video cards. Design decisions ================ At the outset, several design decisions have to be made. The first decision was to create a macroblock-sized conversion, i.e. take a 16x16 luma plane and two 8x8 chroma planes, yielding a 16x16 RGB plane. This is driven by the structure of the MPEG-2 decoder, which when optimized for a number of caching issues will operate on data in a much more local fashion than the naive implementation of processing the whole frame at once, per operation. A result of this decision is that cache polution is to be avoided, lest the gains found in rearranging the MPEG decoder be lost as data must be constantly reloaded from main memory to service the next iteration of the decoder. This means that lookup tables are not possible, unless they are very small. As there are 256 possible values for each of luma and chroma, and both chroma samples impact two each of the R, G, and B values, we're looking at 1280 table entries. For precision, these must be at least 16 bits, yielding a table size of 2560 bytes at minimum. Per iteration, as many as 512 of these entries may be used, as that is the total number of values calculated in a macroblock (256 luma, 64 each per chroma, 2 impacts per chroma). This means that a significant number of cache lines may be touched per iteration, loading the whole cache line (32 bytes) into memory, possibly displacing data that could be useful in the next round. This leaves us with an algorithm that must utilize multiplication rather heavily, with a possible loss of some precision depending on the scale of those multipliers. However, this actually gives us an advantage, if we're clever enough to utilize it: the structure of Pentium MMX processors allows MMX multiplies to be pipelined, such that even though a single multiply will take 3 cycles for the result to be available, many multiplies can be stacked one after another, so long as they violate any other scheduling rules. The idea would be to postpone mutliply operations in order to prepare several, then run them through the pipeline all at once. Another issue with Pentium scheduling is a 1-cycle stall before copying (MOVQ or MOVD) a register to anything but another register. This means that either we have to avoid memory access as much as possible (a good strategy in general), or be prepared to do something else during that stall. This shouldn't be too much of an issue, but it something to think about. Regardless, an attempt will be made to do the entire conversion within the available registers, avoiding the overhead of writing to temporary memory locations. Color-space conversion algorithm ================================ Color space conversion, in the case of YUV to RGB, consists of multiplying luma and chroma samples by constants and summing to yield the R, G, and B components. The algorithm is as follows: R = 1.164 (Y-16) + 1.596(V-128) G = 1.164 (Y-16) - 0.391(U-128) - 0.813(V-128) B = 1.164 (Y-16) + 2.018(U-128) Note the subtractions on each. These normalize the values into the proper ranges. In the luma case, this is because the total brightness range that YUV allows is larger than that of RGB. While technically the luma range is from 16 to 235, there are instances where so-called 'super-white' and 'over-black' pixels appear. These values are actually supposed to be clamped to the above range, which is something to worry about in implementing a correct conversion. In the luma case, the normalization actually brings the values into the -112 to +112 range, which requires a signed number to store. This means that signed arithmetic must be used in any operations involving the chroma samples, which happens to be almost all of them. Of course, this algorithm must be applied to the correct data. The key here is that the chroma bandwidth is 1/4 that of the luma bandwidth. This is accomplished by colocating the two chroma samples with every other luma sample, both horizontally and vertically: YC Y YC Y YC Y = luma sample Y Y Y Y Y C = chroma samples YC Y YC Y YC Y Y Y Y Y This means that we have a lot less work to do. We have to do the impact calculations for chroma only once for every four output pixels. Since the work necessary to construct the luma impact is constant across each of R, G, and B, this simplifies our work quite a bit. MMX tricks ========== [Un]Packing ----------- The basic idea behind MMX is to operate on data in parallel, either 2, 4, or 8 at a time depending on the width of the data in question. Of course, if it were nothing but that, it wouldn't be very interesting, since you'd have to arrange your data very carefully into the right sequence in order to use the instructions. Instead, MMX provides a number of very useful data manipliation instructions. The fundamental operations here are packing and unpacking. In our case, we're going to be starting with data that's packed as bytes, but our operations are going to require more precision than 8 bits can provide. Unpacking is precisely what we need. Unpacking will take the top or bottom half of a register, and promote those values to the next higher bit count. We'll be using it to take the byte samples, 4 at a time, and convert them into 16-bit word samples. One thing to note is that unpacking actually takes two registers. Bytes or words from the same location will be taking from each register, and interleaved to form the new wider values. This means that if all you want to do is widen the data already in a register, you have to provide a register full of zeros in order to fill the top half of each of the resulting values. If you're dealing with signed data, things get much hairier, since you'll have to unpack with a register filled with 1's, which means using some conditional operations to deal with that. Luckily, none of the work we have to do will require us to unpack signed values. The inverse of this, packing, is the process of taking two register's contents and demoting every value to half it's bit count. This means taking two register that contain 4 words each and combining them into a single register with 8 bytes. The key behind this operation is the saturation that occurs during packing. Any value that is outside the range of the destination width is automatically brought within that range. That means that if we were to end up with any super-white pixels, where one of the resulting RGB components is outside the 0 - 255 range, it will automatically be clipped down to 255. Whether we can take advantage of this depends on whether or not we will be doing any packing. High-low mutliplication ----------------------- Any time you multiply two values, you end up with a value that's larger than either of the original two. In binary terms, the resulting register must have enough bits to hold the sum of the maximum bit count in each source register. That means that the result of multiplying two 16-bit numbers must be stored in a 32-bit register, lest data get lost if either of the source values is 'too large'. MMX's multiply routines are designed not to change the type of data stored in a register, so multiplying registers with 16-bit words together will always result in a register with 16-bit words. "That can't work!" you say. Yes it can, if it takes two instructions to do the work, one for the high result bits and one for low. If you want to produce data in the wider format, it'll take another register to unpack the two resulting registers' data into the proper format. However, if you're doing mutliplication that will always adhere to some bit boundaries, you can be clever and retain precision while dumping excess data, all in one operation. Consider the case we'll run into, where we need to multiply an 8-bit value and a 16-bit value, yet come out with a 16-bit value. Normally this would result in a 24-bit peak value: aaaaaaaa * bbbbbbbbbbbbbbbb = cccccccccccccccccccccccc When this is done with MMX multiply instructions, you'll end up with one register containing the top 8 bits and the other containing the bottom 16. Since we only want the top 16 bits, we'd have to play some pretty ugly games to get things back to the state we want them. But what if one were to shift A over to the left 8 bits, to 'fill out' the 16-bit space? aaaaaaaa00000000 * bbbbbbbbbbbbbbbb = cccccccccccccccccccccccc00000000 Remember, shifting has the same effect whether it's done before or after the multiply, because it really is just another multiply, and you can re-arrange those all you want. Note now that when you mutliply this with MMX, you get the top 16 bits of your answer in one register, and the bottom 8 plus the 0's you added in the other. Voila! Since all you want is the top 16 bits, you're done. Packed Multiply and Add ----------------------- This instruction puzzled me for quite some time, as I wondered what on earth it was good for. Eventually, in the process of pondering this color-conversion process, it hit me. It is perfect for several parts of this process. The idea behind Packed Multiply and Add is to take two registers containing words, mutliply each coincident pair of words together (into 32-bit temporary space), then sum the results of each pair of multiplies: aaaaaaaa bbbbbbbb cccccccc dddddddd * * * * eeeeeeee ffffffff gggggggg hhhhhhhh ======== ======== ======== ======== + + ========== =========== (A*E + B*F) (C*G + D*H) This is particularly useful when calculating the chroma impact for the green component. The key is to interleave paired U and V samples in one register, and do the same to the impact mutlipliers. Of course, this only does two samples at a time, but it's vastly superior to the alternatives (massive unpacking and many multiplies, or reduced precision). The other major use for this instruction is in packing samples into RGB 05:5:5. Intel has an appnote on this, which covers the concept rather nicely. Rather than duplicating it here, go there. Shifting == Multiplying ----------------------- In certain extreme cases, you may find yourself in a situation where you've optimized the code as far as it will go, but you've got a set of multiplies or shifts clustered together that you can't break up. Since there is only one MMX shifter, a cluster of MMX shifts will fill the U pipeline but stall the V pipeline continually. Assuming you don't have dependency chains between registers to prevent this, remember that you can dispatch a shift and a multiply in the same cycle. The fact that a shift is nothing more than a multiply by a power of 2, done more efficiently, can be used (if you have the spare registers or cached constants) can be used to alternate between shifts and multiplies, allowing tigher pairing. Calculating chroma impact ========================= Green chroma impact ------------------- The first step is to calculate the impact of the chroma samples. Since green impact is the hardest, as well as the most register-hungry, we'll start there. The obvious first place to start is to actually load the pixels: movq_m2r(*u_block,mm0); movq_m2r(*v_block,mm1); Of course, this gives us the first 8 pixels of each, which is more than we want to deal with right off. To solve this problem we can unpack the high 4 bytes of each, using a conviently zero'd register: pandn_r2r(mm7,mm7); punpckhbw_r2r(mm7,mm0); punpckhbw_r2r(mm7,mm1); This gives us what we want, which is the first four pixels of each in mm0 and mm1, 16 bits wide each. Now we have to do range correction, bringing them into the proper -128 to +128 range: val128w = 0x0080008000800080L; psubw_m2r(val128w,mm0); psubw_m2r(val128w,mm1); Now to pull the Packed Multiply and Add trick. We need to get two pairs of chroma samples arranged, which means using another unpack instruction: movq_r2r(mm1,mm2); punpckhwd_r2r(mm0,mm2); The copy is to avoid having to load, unpack, and normalize the chroma data in mm1 all over again. Now that we have the first two chroma samples from each of U and V interleaved in mm2, we can use Packed Multiply and Add to turn them into impact values: green_impact = 0x1906340819063408L; pmaddwd_m2r(green_impact,mm2); psrad_i2r(8,mm2); The impact values are chosen carefully to maximize precision while still remaining in range. Specifically, you'll note that the highest scalar in the equation is 2.018. This means that we have to have 2 bits of headroom. Since we're going to be using left-shifts (or their equivalent) to do the work of making this act like a fraction, our impact multipliers must be based on a power of two. We have 16 bits to play with, but need 2 bits of headroom, thus we multiply all the scalars by 2^13, which is 8192: 0.391 * 16384 = 6406 = 0x1906 0.813 * 16384 = 13321 = 0x3408 The same method will be used all throughout to calculate impact multipliers, so we don't have to worry about mismatches at various points. 14 bits of precision is quite good, since in the final stages it comes down to only a few bits anyway. The right-shift corrects for the fact that we've taken a set of 8-bit values (in mm2) and multiplied them by a set of 16-bit values (green_impact), yielding a maximum possible usage of 24 bits. Since we're going to have to pack this data down into 16-bit data in order to effectively keep it in registers and use it to form real pixel data, we simply strip off the bottom 8 bits. We now have two pixel's worth of green impact data in mm2, we can repeat the process to get the other two pixels: movq_r2r(mm1,mm3); punpcklwd_r2r(mm0,mm3); pmaddwd_m2r(green_impact,mm3); psrad_i2r(8,mm3); In order to combine the results, we can use the pack instruction, which will squash and concatenate the two register's contents, performing saturation: packssdw_r2r(mm2,mm3); Red and Blue chroma impact -------------------------- This should be simpler, since only one chroma sample is involved in each calculation. We're going to be doing 4 samples at a time, and it just so happens that mm0 and mm1 still contain exactly that after calculation of green impact (it was those movq's, ya see?): blue_impact = 0x8127812781278127L; red_impact = 0x6625662566256625L; shift_left_8 = 0x0100010001000100L; pmullw_m2r(shift_left_8,mm0); pmullw_m2r(shift_left_8,mm1); pmulhw_m2r(blue_impact,mm0); pmulhw_m2r(red_impact,mm1); Note the fist multiply, which has the effect of shifted left by 8 bits. This is necessary because there is no aritmetic left shift instruction. This is to move the resulting value over another 8 bits, into the top 16 bits of the result. The Packed Multiply High gives us only these top bits, so we have what we came for. Performance-wise, this is one of the worst things you can do, short of repeatedly multiplying a single register. The first multiply goes into the U pipeline, there's nothing to pair (except maybe loading a constant into another register in advance to hide potential memory latencies). The second multiply goes into the U pipeline one cycle behind the first, but the third must stall for 2 cycles while waiting for the first to complete, as will the 4th. The only hope here is to find a way to interleave these instructions with other work with no cross-dependencies. Calculating luma impact ----------------------- We now have to calculate the impact of the luminance samples. The key issue is that every pair of samples has to be applied to the same chroma impacts. To accomplish this, we'll use some new tricks, combined with mutlipliers, to get even and odd samples' impacts: upperbyte = 0xff00ff00ff00ff00L; movq_m2r(*y_block,mm4); movq_r2r(mm4,mm5); pand_m2r(upperbyte,mm4); This gives us the even samples (0, 2, 4, 6), but they're shifted up into the top half of each word. This happens to be pretty much what we want, since normally the multiplication of 8- and 16-bit values requires a left-shift of 8 bits. It's already done for us. However, note that in the equation at top there's a subtraction done on each Y sample. This is messy, since there's no sane way to do that in its current form. Such a subtraction would cause the values to go signed, and since we're shifted over there's no way we can become signed, short of shifting down, subtracting, then back up again. The trick we'll use is to do the subtraction later, since: (Y - 16) * x == (Y * x) - (16 * x) We know x, therefore we can simply have a constant handy that we subtract from the result of the multiplication, keeping in mind that the value has shifted 8 bits in the process of multiplication: luma_impact = 0x4a7f4a7f4a7f4a7f; luma_impact_16 = 0x04a804a804a804a8; pmulhw_m2r(luma_impact,mm4); psubw_m2r(luma_impact_16,mm4); We now have the luma impact for the even pixels, which has to be added to each of the R, G, and B impact in order to get pixel values: movq_r2r(mm1,mm5); movq_r2r(mm0,mm6); paddw_r2r(mm4,mm5); paddw_r2r(mm4,mm6); Note that we're only doing red and blue right away. This is so we can use the Packed Multiply and Add trick in a couple instructions. In the meantime, however, we need to do several things. The first is to arithmetically shift both registers left by 6 bits. This undoes the precision added into the impact multipliers (14 - 8 = 6), since it's no longer useful: psraw_i2r(6,mm5); psraw_i2r(6,mm6); To be correct, however, we must apply saturation to the output. This is because of the possibility of super-white and over-black samples. Super-white values have the annoying habit of turning into all sorts of nice primary and secondary colors when the RGB data is not saturated. Unfortunately, we've got all the data in words, and no good reason to have them in byte form. We'll have to bite the bullet and forcibly pack and unpack them to achieve saturation: packsswb_r2r(mm7,mm5); punpcklbw_r2r(mm7,mm5); packsswb_r2r(mm7,mm6); punpcklbw_r2r(mm7,mm6); That hurts, since that will take 4 cycles. There may be some opportunities to pair them with other work later on, however. Conversion to RGB 05:5:5 ------------------------ We now have two registers with 4 pixels of valid R and B data each. The trick is to get these packed into the normal 05:5:5 representation. Here we can use Intel's trick, which is to use PMADD to both shift and join the two color components. This requires that we have the color components interleaved. This can be achieved with some unpacking: movq_r2r(mm6,mm7); punpcklwd_r2r(mm5,mm6); punpckhwd_r2r(mm7,mm5); This gives us two registers with interleaved red and blue components, paired together. We could PMADD them now, but still end up with more bits than we need. That means having to mask off what we need before continuing: mask_5_bits = 0x001f001f001f001fL; pand_m2r(mask_5_bits,mm5); pand_m2r(mask_5_bits,mm6); Now we can PMADD them: red_blue_pack = 0x1000000810000008L; pmadd_m2r(red_blue_pack,mm5); pmadd_m2r(red_blue_pack,mm6);