When one wants to send data, like a picture, or a song, across the internet, one usually compresses the data first. The fewer bits and bytes you can use to accurately represent the picture or song, the less time you have to spend sending that compressed data across the web. (Remember: time is money.) In order to compress data well, we must first represent it in some fashion that lends itself to compression. Recall the audio samples I did in class - by deciding which coefficients would be 0, I effectively reduced the size of the sound. However, there was still enough coefficients left for me to recognize and understand the sound. ("Bring out your dead ....")

There are a number of ways to represent data so that one might easily compress it. One method is based on the work of Joseph Fourier (mid 1800s). A variation of his method is used today in the JPEG (Joint Photographic Experts Group) compression method. So let's get started and see how to get to JPEG...

The following is one of the most important (and useful) inner product spaces of all time.


Let V = C[-Pi, Pi], the vector space of continuous functions defined on the interval [-Pi, Pi]. There exists an inner product for this vector space. If f(t) and g(t) are two functions in V, then their inner product is defined to be the integral from -Pi to Pi of f(t)g(t). (See Chapter 6, Section 8, for details.)

With respect to this inner product, an orthogonal basis for V is the infinite set {1, Cos(t), Sin(t), Cos(2t), Sin(2t), Cos(3t), Sin(3t), ...}. This means that any function f(t) in V can be written as a sum of sines and cosines. This sum, called the Fourier Series for f(t), can be determined by the good ol' Orthogonal Decomposition Theorem (page 390): each term in the sum is the projection of f(t) onto Cos(nt) (or Sin(nt), depending on which basis element you're projecting onto).

Here are graphs of some of the basis elements:

Sin(nt) Cos(nt)
n = 1  
n = 2  
n = 3  



So by integrating, I can represent a function as a sum of sines and cosines. Unfortunately, this doesn't get me very far in my compression problem. First of all, my data is not a function f(t). Rather, my data consists of function f(t) sampled at different times (i.e. like f(0), f(1), f(2), ...., f(N-1)). What can I do?

Well, as we learned in calculus, we can do a discrete approximation of the integral. (Does the Trapezoid Rule sound familiar?) And this is what's done on computers.

Ok, so suppose my data is a function sampled at N evenly spaced intervals, F = ( f(0), f(1), f(2), ...., f(N-1) ). This is a vector in N dimensional Euclidean space. For reasons that don't need to be gotten into (it's a little beyond our scope), a very good basis for our purposes is one whose vectors are sample values of Cosines.

Hold N fixed, and let b(0) = Sqrt[1/N] and b(k) = Sqrt[2/N] for k not equal to 0. Let A be the matrix square (N x N) whose columns are these fabulous vectors: the k-th column (I'll be numbering my columns from 0 to N-1 and not 1 to N) of A will be

C_k = b(k)*( Cos(k*Pi/(2*N)), Cos(3*k*Pi/(2*N)), Cos(5*k*Pi/(2*N)),..., Cos((2*N-1)*k*Pi/(2*N))).

This basis is very closely related to the Cosines in the Fourier series (though I won't go into why). I need the funny b(k) function to make the columns of A orthonormal. Yes - the matrix A is an orthogonal matrix. Here are plots of the columns of A for the case of N = 4, from the first column (n = 0), to the fourth (n = 3):

n = 0   n = 1  
n = 2   n = 3  

If I looked at the matrix A as a grey scale image (so the smaller a number is, the darker I colour it; the larger a number is, the lighter I colour it), here's what it would look like:


Here's the matrix A for N = 8:


And for N = 64:



Given an orthonormal basis set {u_0, ..., u_(N-1)} for R^N, let P be the matrix whose columns are the u_i's. We know that it's easy to write any vector X as a linear combination of the basis vectors:

X = c_0 u_0 + ... c_(N-1) u_(N-1)

where the weights/scalars c_i's can be computed by multiplying P*X. That is, the scalar c_i is the inner product of X and u_i. We can also think of this as a change of basis .

And that's what we do here. (Ok, back to the samples F = ( f(0), f(1), f(2), ...., f(N-1) ).) We want to write F as a linear combination of the columns of A. So we need to compute the inner products. That is, multiply transpose(A)*F = F-hat. In this case, for this special Cosine matrix A, transpose(A)*F is known as the Discrete Cosine Transform (DCT) of F. The vector F-hat are the coefficients of F. Note that since A is an orthogonal matrix, we also have A * F-hat = F (so we can go from the coefficients back to the signal samples). The JPEG standard uses the DCT.

Here are pictures of two examples for the case N = 8. In the first example, why is there only one non-zero coefficient?

First Signal First Signal's Coefficients
Second Signal Second Signal's Coefficients


Note in the second example that alot of the coefficients are (relatively) close to 0. This is not entirely a coincidence. The DCT is designed so that the larger coefficients tend to occur first, and that the smaller coefficients occur later. This is one way to compress - set those smaller coefficients to 0. (JPEG does things a bit more cleverly.)

Ok, so we can (sort of) see how to take the DCT of a one-dimensional signal. It's just a matrix times a column vector. But what about 2-dimensional signals? That is, what about images?

An image F can be thought of as an N x N matrix in the vector space of matrices M_(NxN). (So now F is a matrix and not a vector.) We want to find a good basis of M_(NxN), one that lends itself to compression (like in the 1-d case). Since dim M_(NxN) = N^2 (N-squared), our basis will have N^2 matrices. How do we find them? Easy - we use the basis from the above 1-d case! Here's how.

Remember the C_k's up above, those column vectors made up of sampled Cosines? Well, for j = 0, 1, ..., N-1 and k = 0, 1, ..., N-1, let

B_(j,k) = C_j * transpose(C_k)     (this is an NxN matrix)

These N^2 many matrices form an orthonormal basis of M_(NxN) (with respect to the inner product I mentioned in class: the inner product of two matrices D and E is computed by first multiplying together their corresponding entries and then add those entries together). That's right, not only are they orthogonal, the B_(j,k)'s are orthonormal. Neat, huh?

Here are pictures of the B_(j,k)'s for the cases N = 4 and N = 8 (the B_(j,k)'s are arranged in a single array, with the top row containing the matrices B_(0,k) for k = 0,..., N-1, the second row containing B(1,k) for k = 0,..., N-1, and so on):

N = 4
N = 8



We define the 2-D DCT of an image F to be the matrix product

transpose(A) * F * A = F-hat

That's the same A as in the 1-D case. Basically, what you're doing with the matrix product is to first take the DCT of all the rows of A, and then take the DCT of all the columns of that intermediate result. Or in other (more familiar) words, we're first projecting the rows of A onto the basis vectors C_k's (that's the F*A) followed by projecting the columns of that intermediate result onto the C_k's as well.

Here's an example using a 4x4 image:

Input ( i.e. F ) Coefficients ( i.e. F-hat )


Note that the coefficient array is brighter in the top left corner. As in the 1-d case, this isn't a coincidence, again because we're using the cool Cosine basis. As in the 1-d situation, a 2-d DCT is designed to "put" the larger coefficients in the top left corner and the smaller ones elsewhere. That lends itself to compression.

So how do we interpret the coefficient array? This way. Let F-hat(i,j) denote the entry in the i-th row and j-th column of F-hat. Then the image/matrix F is equal to

F = F-hat(0,0) * B_(0,0) + F-hat(0,1) * B_(0,1) + ... + F-hat(N-1,N-1) * B_(N-1,N-1)

In a sense, as we compute first F-hat(0,0) * B_(0,0), and then F-hat(0,0) * B_(0,0) + F-hat(0,1) * B_(0,1), and so and so on, we're adding more and more detail to the image. Eventually, after adding N^2 many terms, we get the image back.

Here I have arranged the partial sums for the above 4x4 case as an array.



The 4x4 block in the lower right corner of the above array is the original image I started with. (Technical note: I'm doing the partial sums in a zigzag fashion. That is, I've order the coefficients F-hat(i,j) in the following fashion when doing the sums: F-hat(0,0), F-hat(0,1), F-hat(1,0), F-hat(2,0), F-hat(1,1), F-hat(0,2), F-hat(0,3), F-hat(1,2), F-hat(2,1), F-hat(3,0), F-hat(3,1), F-hat(2,2), F-hat(1,3), F-hat(2,3), F-hat(3,2), F-hat(3,3). Why? Well, as I'm adding terms, I'm adding details. I want to add details in an "even" fashion. It's like I'm first adding horizontal features at a given level, followed by diagonal features at the same level, followed by vertical features at the same level,. So that's adding the terms on one of the matrix's "anti-diagonals". Next I go to the next anti-diagonal, which are the details at the next level. And so on, and so on, and so on. JPEG does things in a zigzag way as well.)

Here's an example with an image of size 256x256, showing how the image gets more detailed as more terms are added:

1 term
10 terms
20 terms
50 terms
75 terms
100 terms
500 terms
1000 terms
2000 terms
All (65536) terms




Given an NxN image, here's how JPEG works (roughly):
  • First the image is divided into 8x8 blocks. (So an image F of size 256x256 is divided into 1024 blocks each of size 8x8.)
  • Next, JPEG calculates the DCT of each of those 8x8 blocks (so if G denotes a particular 8x8 block of the 256x256 and A is the 8x8 DCT matrix (the columns are the sampled Cosines), then JPEG computes transpose(A)*G*A. JPEG does this for all blocks, getting 1024 different G-hats).
  • Then JPEG "massages" each G-hat by multiplying each G-hat with a special matrix designed to throw away (set to 0) most of the smaller (i.e. unimportant) coefficients. This is why JPEG is known as a "lossy" compression scheme.
  • And finally, JPEG reconstructs the image. Enough coefficients have been kept so that the reconstructed image (i.e. A*modified(G)* transpose(A)) looks pretty much like the original.


  • We can see this "lossiness" in the example below. JPEG was able to set nearly 73% of the DCT coefficients of the image to 0 (and there are 65,536 coefficients in total).

    Original Image
    JPEG'd Image
    Difference between the original and jpeg'd image



    Here is another example of an image, both the partial sums and the JPEG versions.