Friday, February 18, 2000
Now that both AMD and Intel offer 800MHz processors, users and OEMs have a genuine choice when it comes to high performance PC platforms. But the contest is not merely between the two processors. Chip sets and DRAM can also add or detract from platform throughput as well as impact cost effectiveness.
This report is a performance evaluation of three hot new platforms for high performance graphics/desktop computing. At the processor level, Athlon and Coppermine will be compared under intense graphics loads and under synthetic bandwidth loads. At the platform level, VIA’s KX133 and Apollo Pro 133A chip sets with SDRAM will face off against Intel’s 820 with Rambus DRAM.
We are setting out to stress the platform to extremes in processor performance, DRAM performance and in 3D graphics. Clearly, of these three, the performance category that will scale the most in the future is 3D graphics. But how will it scale and how will it stress platform resources in the future?
Traditional Benchmarking Approach
The industry seems to be quite kosher with the concept that the PC performance threshold is being driven today largely by games and by graphically oriented workstation applications. These two market segments perpetually demand increased computational horsepower for improved image quality, higher detail and image complexity, faster user response, etc.
There are numerous popular benchmarking techniques and tools to aid in evaluating component and system level performance for the graphics power user. As 3D benchmarks are used to evaluate CPU, chip set and DRAM performance, it has become common practice to crank down the screen resolution, color depth and image quality features to their minimum. On the surface, this may seem logical, but in effect, the hottest, fastest PCs on earth are being benchmarked at 1997 screen resolutions.
The idea is to prevent the accelerator from becoming a system performance bottleneck. Though this approach has merit, it does not reflect the way a power user would actually use such a system. Instead, games and applications should be tested in their most popular or perhaps most demanding modes in order to reflect how they are used today, or will be used in the future.
Future Performance Considerations
Recently, Intel has emphasized this same point in a slightly different context. Reacting to a lukewarm response to the performance of Intel’s 820 chip set, Intel has openly criticized current benchmarks as inadequate to demonstrate the true performance ‘headroom’ of the 820 with Rambus.
Intel claims that ‘current benchmarks reflect the past,’ and that future applications and benchmarks will somehow demonstrate RDRAM’s performance ‘headroom’. This is, of course, a very difficult claim to substantiate. Since Intel has not yet made a clear attempt to substantiate its claim, it is left to technology analysts to take their best shot. We are in essence, in search of that ‘headroom’. We move forward with the notion that we need better benchmarks or better benchmarking techniques in order to demonstrate the benefit of high bandwidth potential.
We are not going to try to hypothetically benchmark the talking computer on StarTrek. When we finally get there, I doubt that we will find a P3 or 820 chip set under the hood. Also, I would like to debunk the notion that the Internet will become a huge MIPS drain for PCs attached to it. In fact, the opposite is true. The Internet is the best excuse I have ever seen to buy a low end PC. Even as we surpass connection data rates of 10Mbits per second, Internet content will still pale in comparison to computational loads generated locally.
The life expectancy of a hot machine in the hands of an enthusiast is perhaps 1 year (before it is put to pasture, upgraded, or handed off to a non-power user). So the practical outlook horizon for defining ‘future applications and benchmarks’ should be about a year or so. Beyond that point, the enthusiast will be busy re-investing in his 2001 platform. It is pointless to suggest that an enthusiast’s current PC should be ready for applications that wont exist for several years in the future.
Extreme Bandwidth Benchmarking
In one year, what will the most demanding applications look like? I think the answer is simple - bigger, hotter versions of today’s most popular and most computationally intense applications. On the critical 3D gaming front, will require three things - higher fill rates, greater geometric detail, and bigger texture databases. These three attributes have already reached ‘freight train’ status - moving forward and growing with unstoppable inertia.
Beyond the ‘big three’, other exotic computational loads, such as artificial intelligence (AI) etc., will gradually be added to games to make the characters ‘come alive’. But all this AI stuff is really complex to develop. It would be a little optimistic to expect AI to permeate the computing landscape within the next year or two. Instead, over the next year or two, it is much safer to predict that users will continue to gain increased gratification from the three big evolutionary freight trains (as previously stated):
1. Huge Fill Rates
2. Highly Detailed Textures
3. Increased Geometric Detail and Lighting
So, our goal is to see how these platforms stack up using the ordinary ‘non-scaleable’ benchmarks, but also to use scaleable 3D game benchmarks to further stress the platforms, to identify bottlenecks, then to identify areas where additional performance headroom can or cannot be used.
Bottlenecks are Bad
First some practical commentary on ‘Bottlenecks vs. Headroom’.
Bottlenecks are usually easy to spot.
A bottleneck is something that immediately constrains performance in some way.
Headroom is something completely different.
Intel has recently used the term ‘Headroom’ with regard to RDRAM.
Headroom is the antithesis of a bottleneck.
Bottlenecks are urgent.
Headroom is a luxury.
If a bottleneck exists, it can be found and benchmarked.
If headroom exists, it may not be possible to detect it until one or several other bottlenecks are moved out of the way.
Thus, if you have been sold ‘headroom’ it is quite difficult to know if you actually got it.
Perhaps ‘headroom’ can exist only if it is inaccessible due to bottlenecks.
Is headroom a feature, or an excuse?
Any marketing person will tell you… Bottlenecks - Bad. Headroom - Good. You can always trust the marketing guy.
If you suspect that you have headroom, and want to know for sure, perhaps the only way to find out is to try to remove it (turn the feature off, or whatever).
When headroom disappears, should you notice?
If you can’t tell when something is turned off, it must definitely be headroom.
Should we be aspiring for more headroom when there are lots of genuine bottlenecks still in our path?
Perhaps we already have lots of headroom that we don’t even know about, but there is some bottleneck in the way preventing us from detecting it.
This is all so confusing… Maybe we should just focus on the bottlenecks.
Our test platforms are based on Intel’s 820 (Camino), VIA’s Apollo Pro 133A chip set, and VIA’s new KX133 chip set for Athlon. Other full-featured platform options exist today or will soon be available, including chip sets from ALI, SIS, AMD, ServerWorks (formerly RCC), Micron and others. We will search for future opportunities to evaluate these platforms.
In order to avoid the possibility of selecting an un-optimized board design, we have chosen to use the chip set vendor’s reference platform in each case.
Intel VC820 (Vancouver)
VIA 694X Reference Platform
VIA KX133 Reference Platform
We expect that other board designs based on these chip sets will deliver slightly different levels of performance.
All tests were conducted with 256MB main memory configurations. Each motherboard was updated with the latest BIOS and driver set available on the respective chip set vendor’s web site. The graphics accelerator choice was the Creative Labs Annihilator Pro based on nVidia’s GeForce with 32MB of DDR SDRAM using driver version 3.68. The choice of this accelerator is vital to the analysis. GeForce DDR offers unprecedented memory bandwidth in a soon-to-be mainstream accelerator, plus T&L acceleration and AGP 4X support. GeForce and other future accelerators with this much horsepower could profoundly alter the performance balance between graphics, the processor, the chip set and DRAM.
Memory Limited Benchmarks
As a warm up, lets start off with some memory intensive benchmarks. The first is LinPack (Linear Algebra Pack). This is a well-respected double precision floating point benchmark commonly associated with scientific computing. This Windows benchmark performs its functions on data arrays of increasing size in order to determine platform computational efficiency on small data sets and on large data sets. Large data sets will exceed the cache size and begin to consume huge amounts of external memory bandwidth.
The results in the chart below show a clear indication of a combination of floating point computational efficiency, CPU bus bandwidth and DRAM performance.
The Athlon VIA Virtual Channel combination produces the highest scores by a considerable margin, delivering nearly 2x the bandwidth of the lowest performing systems. For Coppermine, VIA and Virtual Channel also produce the winning scores, while Rambus and PC133 produce a nearly identical set of results.
Using StreamD, the different platforms stack up very similarly. The Athlon platforms take the honors, while the 820+Rambus begins to show competitive scores with VIA+Virtual Channel.
Athlon’s consistently high DRAM throughput is related to its CPU bus characteristics. At 200MHz it is much more capable of extracting the untapped performance (headroom) of PC133 and Virtual Channel SDRAM.
3D Graphics Tests
Using the publicly available Quake3 Arena Test program (not the retail demo), we scaled through a broad performance range by extending resolution and other image quality settings to both high and low extremes. It should be noted that low resolution and image quality modes are not likely to be used by anyone buying an 800MHz platform, yet low quality modes are frequently selected by reviewers to evaluate high-end systems.
At low image quality modes, moderate performance differences can be observed between the different platforms (see frame rates in the table below). Athlon suffers a performance weakness here, probably due to its 1/3 speed L2 cache. The other systems are clustered together in a +/- 1.5% spread, with VIA+Virtual Channel on top. But considering the 1997 screen resolutions, these scores are almost meaningless.
The next step was to crank up all image quality attributes to the highest levels. We used Quake3’s ‘High Quality’ setting as a base, increasing texture quality to medium and high levels, and testing various screen resolutions between 800x600x32 and 1600x1200x32. For the purposes of this analysis, we can refer to this group of settings as ‘high quality’ or ‘high bandwidth.’ The following table contains all of the benchmark scores for high res and low res modes, plus the geometric mean for each group.
A quick scan of the high-resolution section at the bottom of the table indicates no performance difference between systems. At the resolutions and quality levels that gamers want to use, performance is entirely choked by the accelerator. Faster processors cant make a difference – nor will a slower processor.
This is clearly a case of ‘Headroom’ er--uh, ‘Bottleneck’.
Re-Balancing The Platform
To illustrate, lets leave everything about the platform the same except the processor speed. Lets say we slash the processor speed in half, to 400MHz (this, by the way would represent a system level purchase price savings of perhaps $700). We re-ran the high-resolution tests and observed a mean performance decline of less than 2% (dropping from 41.5 to 40.7, as seen in the table below). Who Would Notice?
The next step in the ‘bottleneck vs headroom’ search and destroy procedure is to overclock the accelerator. We slashed the CPU by 50% and saw almost no impact on the ‘playable’ resolutions. Now lets overclock GeForce by a small margin and see what happens. We chose 17%. The graphics core clock moved up from 120MHz to 140MHz, and the DDR bus clock was raised from 301MHz to 351MHz.
The performance impact was astounding. Mean performance shot up from 40.7 to over 47. Not only did it recover the small performance loss from the processor speed change, but now our lowly 400MHz platform outperforms ALL of the 800MHz systems by nearly 15%!!!
I did a little more experimenting. You would be amazed to see how good these numbers are with a 233MHz processor. But we don’t have to go that far. Besides, 3D game technology advancement does not cease with Quake3. Gradually, more of these computational resources will begin to be consumed.
These game tests are clearly accelerator fill rate limited. Fill rate is not the only performance dynamic of interest. We must also look at games and other benchmarks that make more intense use of geometry.
The DMZG demo/benchmark is distributed with the Creative Labs GeForce DDR board. It is a high geometry game demo that is intended to show off the T&L capabilities of the GeForce. Many scores were produced in the testing (at different resolutions). Below is a histogram of frame rates for each second during the run of the test. At the moderate resolution of 800x600x32, this benchmark is already showing signs of being accelerator limited.
Here the delta between 400 and 800MHz is a mere 10%. As resolutions grow to 1024x768x32 the difference shrinks to under 2%. At both resolutions, this is more evidence that processor speeds are not always the determining performance factor.
Geometry Limited Benchmarks – ViewPerf
While Quake3 and DMZG do contain reasonable geometric detail, it pales in comparison to workstation applications today. ViewPerf is a great example of a geometry, CPU and memory limited benchmark that is well grounded in reality. It makes intense use of lighting under OpenGL with high precision object models that place a heavy strain on the CPU and DRAM.
As a quick sanity check, we sample tested Viewperf at 800x600x16 and at 1600x1200x32. Between these two resolutions, graphics controller bandwidth and memory capacity demand increased by somewhere between 2x and 8X. When Quake3 is raised from 800x600x16 to 1600x1200x32, frame rates took a 75% hit (crashing from 93fps to 21fps). In Viewperf, these same conditions bring about a frame rate drop of only 6%. There is no mistake, ViewPerf is not fill rate limited.
In this test, the Athlon processor begins again to show its legs. Athlon’s fast floating point unit places it squarely in the lead. Virtual Channel also makes a strong showing.
The ‘currency’ of workstation performance analysis is not MIPS or FPS or MHZ. We will call it ESG (Equivalent Speed Grades). Workstation buyers already assume that they will buy the fastest processor speed grade available – and they are happy to pay the price. They would pay more, if they could get a higher speed grade. But performance improvements are not only available in the form of faster processors.
To scope this out, lets set a reference point based on the performance impact of the last processor speed upgrade from 733 to 800MHz (a 9% increase in clock speed). At 1600x1200x32 the 733-800MHz CPU speed upgrade yields a 2.4% increase in the overall ViewPerf score. This figure indicates that ViewPerf is not purely CPU performance limited – and it also provides us with a measuring stick.
As compared to our PC133 baseline system, Virtual Channel SDRAM or Rambus deliver a performance boost that is equivalent to more than two CPU speed grades. Moving to Athlon with PC133 yields another CPU speed grade equivalent, and adding Virtual Channel to Athlon adds yet another. As an example, in order to make up for the two effective speed grade delta between the best Athlon configuration and the 820, it would be necessary to populate the 820 with a 933MHz processor.
Athlon is the clear winner when it comes to geometry performance.
This effort has produced fairly clear results in most cases. When it comes to end-to-end bandwidth between processor and memory, Virtual Channel SDRAM wins among memory types and Athlon wins among processor types. When it comes to high-end geometry limited workstation performance, Athlon also proves superior. In the chip set category, VIA must be congratulated for bringing two excellent chip sets to market at the right time, and for delivering winning performance in the process.
Finally, there appears to be no winner when it comes to popular game performance, particularly when these platforms and game titles are evaluated as they are used today and as they will be used in the future. There may not be a shortage of processing power even for the value PC to match the performance of the fastest PCs on earth (as long as that value PC has the right accelerator).
We should resist the temptation to benchmark around this issue by returning to 1997 display modes. Certainly PC Gaming is one of the applications that is touted to be able to absorb almost infinite amounts of host processing power. The truth of such an assumption has yet to be comprehensively tested.
By: Bert McComas, Inquest Inc.
Copyright © 2019 CST, Inc. All Rights Reserved