A tale of two benchmarks
Good benchmarks are extremely useful tools for software developers. They tell you the weaknesses of your hardware and software, so you know what to optimize. Customers also care about them a lot, and higher benchmark scores mean more sales.
What is a benchmark anyway?
Usually, benchmarks focus on a specific area, like how many fish your web browser can draw per second, how many triangles your GPU can render, or how good your compiler is at optimization. Within that area, a good benchmark will have many different tests which represent various real-world tasks. It's really important that tests represent real-world tasks or you will end up optimizing something that doesn't matter. It's also important that tests cover many real-world scenarios so you don't miss optimizing something important.
Let's pretend we're creating a new benchmark. We'll assume we are measuring speed and that we've already written a suite of tests which represent a broad set of real-world scenarios. Our challenge is to use these tests to come up with information that is useful for comparing performance on different platforms.
There are two basic ways to measure speed: we could measure how long it takes to do a certain amount of work (i.e., how long it takes to run a test 10 times), or we could measure how much work we can get done in a certain amount of time (i.e., how many times can we run each test in 10 seconds). These are both good approaches, and they effectively measure the same thing (work per time is the reciprocal of time per work). I tend to prefer the second approach, since higher scores (more iterations) are better, which is intuitive. It also puts an upper bound on how long the benchmark will run, which is useful on slow platforms.
In both cases, you'll want to run each test for a minimum number of iterations and find the mean time per iteration. This will reduce the impact of "random" events like garbage collection pauses or OS task switching. You should also strive to make each iteration of a test as consistent as possible. For example, if you are trying to measure garbage collector performance, but the garbage collector only kicks in every 10 iterations, you would measure a large difference between a system that runs 19 iterations and a system that runs 20 iterations, even though the actual difference is very small.
Once you have average scores for all tests in your benchmark, you need a way to compare them with scores on other platforms. The best way to do this is to pick a "reference" score, and divide by that. This is called normalization, and it's really important. Let's say your company releases its first product, and it runs 200 iterations of FooTest in 10 seconds. When you release version 2.0 the following year, it can run 300 iterations of FooTest. You want to tell your customers "version 2.0 is 50% better at Foo!" You don't want to say "version 2.0 is 100 iterations better at FooTest." That's not meaningful since it depends on how the test was written; if you had written FooTest so it did twice as much work per iteration, you would be saying "50 iterations better", even though the same amount of work would be done per second.
You also need a way to combine scores from individual tests into one grand total score. If your test scores are normalized, i.e., you divided by a reference score, each score is a ratio. In order to get a total score, you must not add them together or use the arithmetic mean. This will give you meaningless results. Instead, use the geometric mean. The geometric mean is the nth root of the product of n ratios:
The geometric mean is intended for averaging ratios (normalized data). If you don't believe me on this, I recommend you read How not to lie with statistics: The correct way to summarize benchmark results [PDF] by Fleming Wallace. Note that it's not meaningful to use the geometric mean on raw, non-normalized data. Don't bother summarizing raw data without normalizing it.
V8 gets it right
There is one test in V8, which is unfortunately very inconsistent. Splay measures garbage collector performance by building a large data structure, then repeatedly adding and removing small parts of it. Usually, garbage collectors kick in after a certain amount of memory has been allocated, but this amount varies a lot. The amount of time taken by the garbage collector depends not only on the implementation (what we want to measure) but also the current state of the heap (which depends on earlier tests, what other tabs you have open, what you have in the cache, etc). The time taken for a collection cycle can vary by orders of magnitude depending on whether the whole heap is collected or only the young generation. To make matters worse, the garbage collector usually won't run at all on most iterations of Splay. Whether you get a good score on Splay is largely a matter of luck.
SunSpider gets it wrong
SunSpider also doesn't normalize scores for each test. It just runs each test for a fixed number of iterations, takes the average running time for each test, then adds the averages together. If you look at the results, it should be obvious why normalization is so important and why adding raw data together is wrong. On my desktop, bitops-3bits-in-byte runs in 0.8 ms on average, but access-fannkuch runs in 16.2 ms. That means, from an optimization perspective, that access-fannkuch is 20.25 times as important bitops-3bits-in-byte. Even if you made bitops-bits-in-byte run infinitely fast (in 0 ms), you would only reduce your SunSpider score by 0.5%. This is the wrong conclusion. Bit operations in general may be just as important as access operations, but the benchmark suite doesn't tell us that.
Benchmarks are a critical tool for measuring performance. Developers use them to determine what areas of a system need the most optimization. Because of they way benchmarks are used, it is crucial that they accurately reflect real world performance so that optimizations actually improve user experience.