Tuesday, April 18, 2000
Bryce – a 3D scene rendering application
An Exhaustive Performance Analysis of
Intel’s 840+RDRAM vs. Micron’s DDR Reference Platform.
Authored by Bert McComas
Since its introduction in September 1999, Direct Rambus has been cursed with high prices, low availability, technical troubles and questionable performance. As Intel’s primary RDRAM platform, the 820 chip set has similarly fallen under widespread criticism in the media and on the web. In the channel, system and board makers are reporting an inventory buildup of these hard to sell 820 platforms. With the 820 chip set on the ropes, PC133 enters the mainstream almost entirely uncontested. Even Intel’s contrived resistance to PC133 is scheduled to disappear shortly with the introduction of Solano (815).
In the wake of disappointment over the 820, Intel’s 840 dual channel Rambus platform assumes the role of Intel’s reference platform for RDRAM performance. As a result, the next battle over PC main memory standards and performance will have to be fought between the 840 (dual Rambus) and DDR. DDR for graphics is already in volume production at 300MHz, and volume production of PCs with DDR main memory will begin to show up in Q3’00. As such, these DDR design wins are occurring today and validation platforms are already available to many industry insiders.
Micron is leading the pack with its Samurai DDR chip set. VIA, AMD, ALI, nVidia, Transmeta, Serverworks, Intel and others are right behind with DDR chip sets for mainstream desktops, workstations, notebooks, info appliances, game PCs and servers. As yet, Rambus is supported by only two chip sets from one vendor, both aimed at a very narrow slice of the market.
In November and December of 1999, various independent sources reported very favorable hands-on benchmark results for DDR using first silicon of Micron’s Samurai DDR chip. After further optimization, Micron is validating production worthy silicon. We had the good fortune of being able to spend a week with this platform in our lab.
The board supports 64-bit and 32-bit PCI slots, dual processor slots and 4 DIMM slots using buffered or unbuffered DDR SDRAM. When used with unregistered memory at PC2100 (266MHz), only two or three DIMM slots may be occupied. The board also included a new south bridge chip capable of ATA-100 support, though we only used ATA-66 mode in order to avoid any unfair advantage. I configured the system with 256MB of unbuffered DDR SDRAM, an IBM 10G drive and nVidia GeForce accelerators. Intel’s OR840 motherboard was identically configured, using 256MB of PC800 RDRAM.
Both platforms were loaded with the latest ATA66 HD drivers and hard disks were defragged for NT and 98. We used nVidia’s version 3.65 graphics driver with all driver performance attributes set to default (except Vsynch). On the Micron platform, we installed their latest GART driver that enables AGP FastWrites, and also a specific nVidia adaptation of a DLL that allows AGP to operate under NT4 with Micron’s chip set.
All tests were done using P3 at 733MHz. By Q4 2000 standards, this is a midrange processor. As a general rule, faster processors tend to magnify the performance impact of DRAM. It is reasonable to assume that at speeds over a gigahertz, the winners and losers in these benchmarks should remain constant, but the magnitude of performance delta could increase.
As one of the oldest and most trusted benchmarks in existence, the Linpack MFLOPs benchmark evaluates and memory limited double precision floating point performance. By varying the size of the data matrix, the performance impact of the L1, L2 and DRAM can be observed. Our charts exclude results that are dependent entirely on L1 and L2 performance, focusing instead on DRAM limited performance with dataset sizes ranging from 512KBytes to 1.5MBytes.
Linpack was run under Win98 and under NT4. In each case, the results were taken after a clean boot. DDR delivers an impressive performance advantage over the 840 - 16.4% in Win98 and 5.3% in WinNT4.
We ran Stream under DOS, Win98 and WinNT4. As with Linpack, results were recorded only after a clean boot to the OS. Under DOS, DDR delvers a DRAM performance advantage of nearly 20% on average
Under Win98se, the performance delta increases to nearly 30% favoring DDR.
Under NT4 the performance delta shrinks to less than 4%, this time favoring the 840 chip set.
The NT4 results introduce a performance aberration (compared to the other versions of Stream) that I am not fully able to explain. After a bit more testing we confirmed that when the Micron platform was configured with four 128M registered DIMMs, its Wstream-NT performance increased, essentially eliminating the performance delta between the two systems. This indicates that Wstream under NT benefits from wide interleaving (4 way), and may not be as sensitive to latency as other applications and benchmarks. It is an interesting case. We should be careful to observe how other tests turn out when comparing Win98 vs. WinNT.
WinTune Memory Bandwidth Test
Version 4 of WinTune was used to evaluate many aspects of DRAM performance under Win98 and NT. In this case there was no real difference in the results between the two operating systems, though under NT the results fluctuated less run to run compared to Win98.
The results show a remarkable advantage for DDR, particularly for Write and Copy activity. Reads showed essentially no difference, though the 840 did actually exceed DDR in 2Mbyte reads by 0.5%.
The actual numbers generated by the benchmark are listed for reference in the table below. As seen in the table, the average performance difference for Writes is 45.5% favoring DDR. For Copy transactions, DDR outperforms the 840 by 16.5%. On average for Reads, Writes and Copies, DDR outperforms the 840 by 22.6%.
The overall memory score quoted in the table above is produced by the benchmark. This score is 14.3%, slightly lower than the averages generated for this table. WinTune generates its overall bandwidth score as an average of all other bandwidth scores measured in MB/s (including a processor centric 4Kbyte number that was excluded from this analysis).
WinTune’s method of averaging weighs heavily toward cache performance rather than DRAM performance. We have used a more evenhanded method to derive ratios, allowing DRAM performance differences to be more clearly observed. But no matter how you do the math, DDR delivers the winning performance.
The most comprehensive and reliable business application benchmark in the industry is Sysmark 2000. It loads and runs a dozen leading applications for basic business productivity and for advanced content creation.
Compared to synthetic benchmarks, it is very challenging and significant for DRAM to introduce a performance delta of even two or three percent in one of these application benchmarks.
In three of the applications there is no appreciable difference in performance, namely Corel Draw, Excel 2000 and Elastic Reality (an image morphing application). But in more than half of the applications (seven), DDR exceeds dual channel RDRAM performance by a significant margin. Specifically, these applications are:
Naturally Speaking – Real time continuous speech recognition application
Netscape Communicator – Web page authoring package
Paradox – Database processing environment
Photoshop – Image processing software
PowerPoint 2000 – Presentation software
Word 2000 – Word Processing
DDR beats the 840 by an average of 2.4% in these applications, and by 1.4% overall.
The 840 outperforms DDR in only two applications - Premier and Microsoft Media Encoder. Interestingly, these two applications are very closely related. Both perform batch oriented video file compression. These two applications fall into the category of professional or semi-professional content creation applications, along with Bryce and Elastic Reality. They are generally not intended for the casual user, a home user or business PC. These software tools are in use by a relatively small user base of professional computer graphics artists – as compared to the vast number of users that rely on the other business and personal productivity applications tested in SysMark 2000.
The table below contains the precise best case run time results for each of the applications and configurations in SysMark 2000. The numbers shown are the fastest of three or four iterations of each program script.
As a key element of Intel’s ICOMP index, CPUmark has proven to be a reputable test to evaluate the processor’s integer performance and cached memory performancem, independent of graphics or hard disk.
The 1.2% performance delta is quite respectable considering the small amount of DRAM activity produced by this benchmark. Unlike previous high end processors, with Coppermine’s reduced 256KB cache size, DRAM performance differences can be identified using this benchmark,
3D Game Performance – Expendable
Using the popular game demo Expendable, we tested at two different screen resolutions. It turns out that at either resolution, this game is still primarily CPU limited in its performance. As resolution increases from 640x480x16 to 1024x768x32, the overall accelerator fill rate demand increases by more than 5X. The GeForce DDR fill rate capacity is so high that there is only a very small frame rate delta between these two resolutions.
When it comes to the CPU and DRAM limited performance evaluation, there is a consistent 2 – 2.3% performance delta between the 840 and DDR, with DDR in the lead in both cases. This is consistent with the performance delta seen in many other types of applications.
3D Game Performance - Quake3 Arena
Quake is clearly the most enduring game and perhaps the most credible game benchmark in the industry. The platforms were configured under Win98 with identical drivers using the nVidia Quadro accelerator with all performance features enabled - including AGP fastwrites. With all chip set register and Windows registry settings left in their default modes, the DDR platform was physically verified by Micron to properly run AGP fastwrite cycles. As such Micron’s DDR platform is the only non-Intel platform that I know of to report fastwrite compatibility.
We ran both demo scripts contained in the retail demo of Quake 3 Arena. Demo 1 is used most frequently by the hardware sites. Demo 2 presents a slightly more complex load to the processor and DRAM. In both cases, DDR’s advantage is substantial.
As one would expect, DDR’s advantage is at its highest at lower resolutions (which are less fill rate limited) and also under Demo2 rather than Demo1 (because of the difference in CPU load described above). Overall, Micron’s chip set delivers a 6.6% – 8.6% advantage over the 840 in this test. As resolution increases, one would expect this delta to shrink as performance becomes almost entirely accelerator limited.
3D WinBench 2000
With the platforms identically configured as above (Quake3), but screen resolutions at 1024x768x16, ZD’s 3D WinBench 2000 also demonstrates a significant performance delta - again favoring DDR. In the floating point centric processor tests the performance delta is small, but in the accelerated game script tests DDR pulls ahead by over 4%.
MCAD Workstation Performance - Viewperf
Under NT4, we tested Viewperf using the Diamond FireGL1 accelerator at 1024x768x32. We took the best score of three runs, but saw very little variation from run to run. In this case, the 840 came out ahead by an average of 0.8%.
As seen in the table below, in the DX-05 test, the 840’s lead peaks at 2.7%, but in all other tests the performance differences are 1% or less.
ZD Serverbench Performance
ZD Serverbench measures sustained server throughput based on a varying number of simultaneously active client PCs. Using 100mbit Ethernet, up to 20 client sessions placed long-term continuous demand on the server. Both platforms demonstrated nearly equivalent performance, though some very small differential was observed.
At 12 clients and below, server throughput is processor limited. Using single processors the number of transactions per second is reduced by almost exactly half. TPS throughput declines significantly above 16 clients, as the benchmark becomes entirely hard disk bound. Performance above 16 clients is proven to change based almost entirely on the characteristics of the hard disk subsystem. In this test we used a two drive RAID system attached via the PCI bus.
DDR wins over the 840 by a small margin when the number of clients is low, sustaining an equal or better position up to 12 clients. At 16 and 20 clients, the 840 displays an advantage. If these results prove reproducible, it might be attributed to the 840’s PCI implementation rather than to it DRAM performance charactistics.
Frankly, these two platforms are not meant to be servers. Still, small companies are likely to employ them as workgroup servers supporting perhaps 10-25 users each, with only a few users contending for server resources at
any given time. As such, the 1-8 client serverbench tests are an accurate representation of the practial performance demands for these platforms. Within the target performance spectrum, there appears to be no appreciabl