Document: discussions.txt Author: Shri Amit (amit@scl.ameslab.gov) Date: September 17 1997 Synopsis: The following document is a compilation of various e-discussions mostly between John and various other people. This was salvaged from John's HINT mail folder which contained layers and layers of sedimentary mail. These discussions have been compiled primarily for the benfit of future members of the HINT group or any person interested in HINT's evolution. They provide quite a bit of insight into the basic philosophy behind HINT. The email addresses of the relevant people have been removed. They can be obtained from John if needed. I will try my best to keep this updated with interesting discussions about HINT. ----------------------------------------------------------------- Date: Tue, 25 Jul 95 09:47:05 -0500 From: Donald Petravick To: gus@scl.ameslab.gov, snell@scl.ameslab.gov Subject: THe HINT benchmark all, Thnaks for the bief conversation we had over the phone a few weeks ago. I finally have the time to compose the note I had promised. My assignment was to develop a concept for a bid to purchase computing for the Sloan Digital Sky Survey at Fermilab. A large portion of this computing is to reduce data tapes of images we acquire at over 5MByte/second. durninmg one phase of data reduction, we conder fine inages of 6MBytes each. I am new to the world of benchmarks, so I feared that many of the standard benchmarks do not exercise address spaces of this order of magnitude. In addition, I found your suite simple and easy to understand. It is simply more credible to work in a system that exposes hardware features, like the size of the secondary cache, than one which is so popular, systems engineers may very well be tuning for it. Since I do this work for scientists, Your tool presents its data in a way that they feel more comfortable with: it - Draws graphs - Shows real features of the total system I did most of the work last fall. We had a vendor named TYNE in, who (to my recollection) promised a 133MHz MIPS R4xxx machine with a 1/2 (or was it 1) MByte cache at a good price. It was running windows NT. (Our system is UNIX based, but we were willing to port for the "right" price.) Quite suprisingly, HINT failed to show the effect of the Secondary cache. After sweating the salesman, he was forced to call engineering, who explained that they had built a secondary cache system around the "no cache" version of the R4xxx chip; that they had used slow S-Rams to boot; that the machine was thought of as a file server; and that the purpose of the cache was to shield the processor from the effects of disk traffic on the machines's main bus. ------------------------------------ I am lobbying to get HINT included in some sort of Fermi benchmark suite. I've made a shell archive of the results, and a few simple plotting tools, and will send that immediately as well. --Don ---------------------------------------------------------------------------- Date: Wed, 2 Nov 94 10:21:27 GMT To: hint@scl.ameslab.gov From: Dave Snelling Subject: Notes on HINT and QUIPS John, Quinn, and PARKBENCH, I am making these comments (constructive I hope) openly in order to disseminate the ideas as quickly as possible. I believe that, whether right or wrong, QUIPS and HINT are likely to catch on. The problems below should therefore be rectified quickly. Problem 1: The document describing HINT and the source code claim that the list of intervals is sorted. Even with the well behaved function integrated in the HINT benchmark, the relative difference between the errors of consecutive intervals ranges from plus to *minus* ten percent by the 500 interval mark. I assume that this trend will continue. Possible Solution: Correct the documentation to indicate that the list is only partially sorted. Problem 2: The Net QUIPS computation is heavily weighted toward the speed at which the smallest case can be computed. For example, by treating this case specially and returning the answer almost immediately (a very simple optimization not prohibited by the rules), the Net MQUIPS of a Sun SPARC station improved from 0.741 to 1.615. Possible Solution: Start the Net QUIPS at something more reasonable, say 32 intervals rather than two as it stands now. Problem 3: By tuning the parallel version on the KSR to select the number of processors actually used (including the special case from problem 2 above), it was possible to obtain similar variations in the apparent performance of the KSR. This tuning operation basically eliminates the low QUIPS ratings associated with start-up on small numbers of intervals in parallel configurations. Further improvements are also possible by varying the ADVANCE parameter. Possible Solution: Most of these problems can be eliminated by stating exactly how many intervals are to be used at each stage of the Net QUIPS computation. Problem 4: There is no direct verification of the answers. The value of the answer is of course included in the value of the QUIPS rating, but for small numbers of intervals, where the error is large anyway, random numbers returned very quickly could artificially improve results. Possible Solution: Check the actual results. Unlike many benchmark programs HINT's results can be checked to the last bit; why not do it? Problem 5: As it stands, memory use in parallel tests increases simultaneously on all processors. The result on the KSR is that, when testing configurations smaller than the whole system, one level of the memory hierarchy is ignored. On VSM systems like the KSR, the memory of other processors can be used as "paging" space instead of disk. A more gradual increase in memory use, for the parallel cases, would allow the benchmark to detect this level of the hierarchy. Possible Solution: In the parallel version, increase memory use in the same way as in the sequential case. Conclusions: I hope that these comments promote further discussion. I believe the quest for a unifying benchmark is futile, but I see no reason for not trying. Good luck. Take care: Dr. David F. Snelling ---------------------------------------------------------------------------- From: John Gustafson Subject: Re: Notes on HINT and QUIPS To: Dave Snelling Date: Mon, 28 Nov 1994 11:07:39 -0600 (CST) Cc: hint Dear Dr. Snelling: Now that the Supercomputing '94 conference is over, and we've had time to think about your comments, let me reply to them in full detail. But before I do, may I ask a favor? I'd really like to have a copy of your article, "How Long is a Megaflop?" and a reference such that I can cite it. It's the article I wanted to cite in the HINT paper, but couldn't find a full citation so I used the one that appears in that hardback collection of articles on benchmarking. My mail address is John Gustafson 236 Wilhelm, ISU Ames, IA 50011-3020 USA Thanks. Now, to your comments: > Problem 1: > > The document describing HINT and the source code claim that the list of > intervals is sorted. Even with the well behaved function integrated in the > HINT benchmark, the relative difference between the errors of consecutive > intervals ranges from plus to *minus* ten percent by the 500 interval mark. > I assume that this trend will continue. You are correct and perceptive. To make sure each iteration is O(1), we avoid a full sort. A sort of the last k items for k>2 gives diminishing returns for most functions, and experiments showed k=2 or k=3 was probably the best compromise to use in the distributed code. However, there is nothing in the rules that require a perfect sort... it's simply advisable to get the job done more efficiently. The parallel and vector machines sacrifice sorting accuracy by doing splittings in batches without proper selection by priority, thereby trading hardware speed for algorithmic optimality. The only problem is that we've specified the function, and it's hard to tell when the sorting method has made use of behavior specific to that function. One thing we contemplated was to also require integration of the function that is the 180-degree rotated version of (1-x)/(1+x), which is concave down and produces the opposite ordering of errors unless sorted. For simplicity, we decided to simply use inspection to see if anyone was avoiding the sort by using knowledge of the function. > Problem 2: > > The Net QUIPS computation is heavily weighted toward the speed at which the > smallest case can be computed. For example, by treating this case specially > and returning the answer almost immediately (a very simple optimization not > prohibited by the rules), the Net MQUIPS of a Sun SPARC station improved > from 0.741 to 1.615. > > Possible Solution: > > Start the Net QUIPS at something more reasonable, say 32 intervals rather > than two as it stands now. First of all, we labored to remove the "magic numbers" from the benchmark, such as a particular number of seconds, a maximum size at which to stop, a minimum size at which to start, skip distances, etc. I'd rather not pick a larger number, especially since the case of n=2 can be used to measure the subroutine overhead on serial machines. See next point. Second, we made a goof in dividing the quality Q by the elapsed time. It should have been Q-1, since there is a "head start" of Quality=1 before anything is computed, and we are interested in quality _improvement_. Hence, our first QUIPS number was too large by 2/1, the second by 3/2, the third by 4/3, etc. That's why the curves were starting so high and plummeting to a flat performance about half the case of n=2. We have fixed that in the 1.1 release now on the Internet, and also have a postprocessor for data files computed the old way. There's no need to re-run the benchmark. What we see now is the case n=2 slightly lower QUIPS than for larger n within the first memory regime, clearly reflecting the cost of entering the loop after the timer is called. I suspect that will diminish the contribution from low n that you have observed. Our experience is that it only lowered Net QUIPS by about 6% on our SGI workstations. Finally, your "by treating that case specially" has me concerned. It sounds like it DOES violate the rules. How can you treat that case specially without preknowledge of the function? You cannot, for example, hard-code the midpoint between 0 and 1 as 0.5 even though it seems trivial. Every variable has to be computed at run time, not at compile time. So please explain what you mean by "treating that case specially." > Problem 3: > > By tuning the parallel version on the KSR to select the number of > processors actually used (including the special case from problem 2 above), > it was possible to obtain similar variations in the apparent performance of > the KSR. This tuning operation basically eliminates the low QUIPS ratings > associated with start-up on small numbers of intervals in parallel > configurations. Further improvements are also possible by varying the > ADVANCE parameter. > > Possible Solution: > > Most of these problems can be eliminated by stating exactly how many > intervals are to be used at each stage of the Net QUIPS computation. We considered doing this, because we also hated to see discarded performance on parallel computers for small n. It looked bad on the graphs, but had almost no effect on Net QUIPS. Fixing it complicated the parallel version of the code, especially for message-passing machines. Remember, the HINT program is not allowed to know what value of n one will stop at, but must be prepared to improve the answer until memory or precision are exhausted. So as n increases, at some point the scattered decomposition of responsibility for subintervals in [0,1] must be re-allocated to processors, producing a cliff-like drop in QUIPS. Maybe that cliff isn't too bad on the KSR, but it looks pretty scary on message-passers with latencies of 100 microseconds. As we say in the README file, you are free to tune a certain number of driver parameters like ADVANCE to obtain better Net QUIPS and smoother curves, limited only by your patience. > Problem 4: > > There is no direct verification of the answers. The value of the answer is > of course included in the value of the QUIPS rating, but for small numbers > of intervals, where the error is large anyway, random numbers returned very > quickly could artificially improve results. > > Possible Solution: > > Check the actual results. Unlike many benchmark programs HINT's results can > be checked to the last bit; why not do it? Thanks. We now do this using a literal value in the driver to a large number of decimals, checking that the low and high sums bracket the value. Returning random numbers is prohibited by the rules. The computation must be mathematically rigorous or it is not allowed. > Problem 5: > > As it stands, memory use in parallel tests increases simultaneously on all > processors. The result on the KSR is that, when testing configurations > smaller than the whole system, one level of the memory hierarchy is > ignored. On VSM systems like the KSR, the memory of other processors can > be used as "paging" space instead of disk. A more gradual increase in > memory use, for the parallel cases, would allow the benchmark to detect > this level of the hierarchy. > > Possible Solution: > > In the parallel version, increase memory use in the same way as in the > sequential case. You have more expertise than we do on the KSR. Quinn Snell, my co-author, tells me he disagrees on this point and doesn't understand what we're doing that ignores a memory regime. Please elaborate. > Conclusions: > > I hope that these comments promote further discussion. I believe the quest > for a unifying benchmark is futile, but I see no reason for not trying. > Good luck. I used to agree. Cynicism about benchmarking is rampant, but positive efforts to put it on solid ground so rare. Now I have hope that benchmarks can be made independent of many architecture-specific features, and can predict actual performance on real applications with something like the accuracy we had back in the 1950s and 1960s with op counts and arithmetic rates on serial computers. Time will tell whether there are gaping holes in our reasoning, but so far things look good for HINT. Thanks again for all your comments, and we hope to hear from you again soon. -John Gustafson ---------------------------------------------------------------------------- Date: Tue, 22 Nov 1994 11:37:59 -0500 (EST) From: Raghurama REDDY To: hint@scl.ameslab.gov Hi, I have looked at the paper "HINT: A New Way To Measure Computer Performance," and it looks like a very useful tool get a lot of useful information about the overall true achieivable performance of a machine. But one thing that was not clear to me is how applicable it is for benchamarking vector machines. For machines that are not vector processors, is gives a good clue about how well all the componenets (memory bandwith, I/O bandwidth, CPU speed, etc.) are balanced. But I am not very clear if it is applicable for evaluating vector processors too. With the C90 for example, the memory bandwidth is not likely to be a bottleneck as long as you are doing scalar processing. Actually the correct way to say this would be that the memory bandwidth for scalar and vector processing on a machine such as the C90 can be very different. If you execute a lot of work in scalar mode, then the data access cannot be pipelined, and memory bandwidth appears to be low. Whereas if you execute a loop in vector mode, then, at least as far as peak hardware bandwith is concerned, it is could be pretty high. Just considering data (ie. ignoring instruction which has a separete memory path to CPU anyway), it could load 4 64 bit words, and store 2 64 bit words in clock cycle, which translates to about about 12 GB/sec of memory-to-CPU bandwidth, and ofcourse the CPU also can process the data at that speed (though not many applications can do that!) I was wondering if you have already considered this issue. I do also agree that a lot of dusty deck codes are likely to be primarily scalar, so it could still make sense to use the benchmark, but I was wondering it does truely benchmark vector processor or not. As a parallel benchmark, it does measure the global sum aspect of it, and that is indeed a good measure. I appreciate any information you can provide on this. I wish I was there are SC 94 to talk to you personally! Thanks. R. Reddy ---------------------------------------------------------------------------- From: Quinn Snell Subject: Re: HINT To: Richard Marciano Date: Fri, 4 Nov 1994 14:24:19 -0600 (CST) Cc: hint > Dear Quinn: > > > A colleague of mine, from the National Supercomputing Center > for Energy and the Environment, here at the UNLV campus, sent you > mail recently inquiring about whether you had a vector version of > HINT. I am assisting David, with running your benchmark on our > local Cray YMP2/220. We will be glad to convey the serial results > back to you as soon as we are done. Thanks we would appreciate it very much. I don't have much experience on Cray machines and my access to them is limited also. > > I wanted to follow up on David's question and ask you about > the potential for parallelization of the hint C code. How well > does it parallelize? Can you give me some hints on how you would > go about it? Here is what I was considering doing to the HINT kernel for vectorization. The important part is that you don't eliminate any of the partial sorting that is being done by the serial version. A pseudo description follows: 1) Run enough iterations of the serial code to fill the queue for vectorized code. In the early stages of the benchmark, this will be the only section of the kernel that will run. 2) for (i = .................) { for (j = 0; j < VECTOR_SIZE; j++) { a) rect[ma] = next subinterval to be split b) divide the interval in half resulting in rect[io] and rect[jo] c) evaluate and fill rect[io] and rect[jo] e) record the indices (io and jo) in a temp array to be used later to evaluate the error values and partially sort the indices into the queue f) update sumhi and sumlo } g) Using the recorded indices in the temp array, evaluate the error value for each subinterval that was created. h) Perform pairwise comparison of the error values to place the indices into the queue } 3) Perform Global Sum Collapse if needed. (only when multiple processors are used) 4) Return sumhi - sumlo > > Also, I noticed you reported on results for the MasPar MP-1 > and MP-2. I currently have accounts on both of these machines at > your center, and would be very interested in running hint on those > platforms. Could you send me your SIMD implementations of HINT? > I will place a copy of the SIMD implementation of HINT on the ftp server this weekend. You should be able to download it on Monday. > Thanks for your assistance. I am very interested in > learning more about your benchmark. > > Sincerely, > > -Richard ---------------------------------------------------------------------------- From: Paul J. Hinker Date: Tue, 18 Oct 1994 09:33:03 -0600 To: John Gustafson Subject: Re: HINT John Gustafson writes: > I haven't heard from you regarding HINT. Did you do any more work on it? > Did you use PVM, the "native" library for the T3D, or did you try more > low-level routines? Some folks in Minnesota are thinking that might > explain your somewhat disappointing numbers. I used the shmem communications [native] stuff but thought the numbers I'm getting are pretty good. I'm seeing 102+ Quips on 64 nodes which is far and away better than anything else on the chart. Am I reading the chart [in your paper] wrong? As for the CM5, TMC seems to think that I need to hand code the HINT benchmark to take advantage of vector units on the machine to be fair. I'm not going to spend that kind of time working on the code just because the TMC compilers are so weak. I didn't have to hand code to get at the multi-issue pipeline on the T3D alpha chips. It looks like I'm allowed to release the CM5 numbers but some mention needs to be made about the fact that the VU's aren't being applied to the problem. > In other respects, how are things going? Things are going well. We're ramping up for the dog and pony show at SuperComputing where we'll [hopefully] be showing several widely separated MPP's [Nersc T3D, Lanl T3D, JPL T3D, Lanl CM5] working on the same problem. Paul ---------------------------------------------------------------------------- From: John Gustafson Subject: Re: HINT Benchmarks on Sandia's 1840-Node Paragon To: David R. Gardner Date: Thu, 29 Sep 1994 09:27:01 -0600 (CDT) > John, > > What were the results of the HINT benchmarks on Sandia's 1840-node Paragon? > > ...David Gardner > David, They were pretty impressive, certainly the fastest by far of anything we've measured. They also made clear the superiority of SUNMOS over OSF. To give you a feel for our results, here's some "Net QUIPS" numbers: No. of Net MQUIPS Op. Sys. Compilation PE's ------ ---------- -------- ----------- 1840 633. SUNMOS icc -O4 -knoieee -Mvect 512 249. 64 46.2 32 25.7 16 13.5 8 7.07 4 3.76 2 2.03 1 1.22 32 12.6 OSF/1 1.04 cc -O3 -knoieee In comparison, our 256-node nCUBE 2S got 35.8 Net MQUIPS; an 8-processor SGI Challenge L (R4400/150) got 17.5 Net MQUIPS; and most high-end personal computers or low-end workstations are getting about 0.5 to 2.0 Net MQUIPS. Most of the difference between SUNMOS and OSF/1 is due to the superior latency of SUNMOS and the increased usable memory. I don't know how much you know about HINT, but it's very sensitive to both things (unlike most benchmarks). Look for our Tech Report... it's on the way to a number of people in 1400. (We've been delayed by setting up the mailing list on a Mac). -John Gustafson ---------------------------------------------------------------------------- From: John Gustafson Subject: Re: HINT output form and some questions To: Arne H. Juul Date: Mon, 18 Dec 1995 11:29:07 -0600 (CST) Arne, Thanks for sending your HINT results and questions. If Quinn Snell has time, he might want to add some things to the reply. I'm writing this prior to graphing your data, but I think I understand your questions anyway. You know we have an Onyx on our HINT Web site, don't you? You can compare your graph with ours. > It's the university's main student mail- and fileserver. So far > I've only run the default (DOUBLE data/int index?) test. For the SGI machines, this is a safe bet as the best data type. Once 64-bit integers are supported (sometimes called "long long" in C), they might be worth a try. It's easy to change types if you're curious... see the README file. The other types run out of detail to discretize on machines that are this fast, and that inability to run for very long hurts more than the extra speed of, say, float or int, helps. > I've appended the 'form' from the README file with data for the run, > and also the HINT program output, the DOUBLE datafile produced, and > my modifications to the source. Thanks. I wish all of our correspondents were so thorough. > Here's my interpretation of the graph produced: > - the first results arrive in 2e-6 seconds and has a 'quality' of 450k. > - in the range between 1e-5 seconds and 0.0003 seconds, there's a > 'quality' of ~550k (probably using first-level cache?) > - in the range roughly between 0.001 seconds and 0.1 seconds, there's > a 'quality' of ~400k (probably using second-level cache mostly?) > - in the range from 1 seconds upwards there's a "quality" around 240k, > probably partly going to RAM. Paging to disk was not tested at all, > because I couldn't raise my memoryuse limit high enough. Without looking at the graph, it sounds like you're referring to the vertical axis as "quality." It's actually the time derivative of quality, QUIPS (quality improvement per second). Quality should improve monotonically with time, since you keep finding out more and more about the bounds on the integral of the function. Other than that, your interpretations sound exactly right. It can sometimes be enlightening to use Gnuplot or similar tool to plot QUIPS as a function of _memory_ instead of time... that way, you see the sharp cliffs at places like 32 KB, 1 MB, 32 MB, etc. where you can easily confirm that those are the sizes of memory regimes. After 1 second you should definitely be in main RAM. > Does this sound about correct to you? Also, when plotting against > problem size, what does the 'problem size' refer to, bytes or number > of doubles, or something else? "Problem size" is the number of subintervals used to evaluate the integral. If precision loss were nonexistent, the quality would equal the problem size, so it's interesting to track the discrepancy between problem size and quality for various data precisions. By defining "problem size" in this way, it's more of a mathematical universal quantity (and hence in keeping with the design principles of HINT), whereas the bytes or number of doubles, etc. can vary all over the place depending on the computer. Not even a "byte" is a standard... it can be anywhere from 5 to 9 bits depending on the machine! You can also think of "problem size" as the number of iterations through the HINT loop. It's a single, simple loop. > Would it be possible to produce more datapoints? (This wasn't a problem > here, it's fairly smooth, but I would like to try on some PC's for > comparison and my first cut graphs looked pretty rough). Yes. I see that you reduced the 1.25 ratio to 1.15, which is the first thing to do. We've also seen that the graphs can be smoothed by increasing RUNTM, though we don't understand why we have to run more than 0.5 second when these computers are supposed to have such excellent timers. Until we understand this better, the easiest way to smooth the graphs is to increase RUNTM until you can't stand the length of the run. You might be able to economize some of the other overkill-type things to help. Ultimately, we'll put all of these adjustments in a GUI interface and probably discover we're doing something really stupid and slow. PC timers are often very poor-resolution; we have some experience with them and Quinn Snell should probably give you some tips. > Hope this is of some interest so I'm not bothering you. Bothering us? Hardly. Great to hear from you! Keep in touch. -John Gustafson ---------------------------------------------------------------------------- Date: Mon, 19 Feb 1996 12:57:44 -0500 From: R. Serota Subject: Re: QUIPS To: John Gustafson Dear John, Thank you for your detailed answer. I was thinking that because the goal is to get the closest to the right answer, one should integrate under the curve to find the total quality achieved after the run. I guess, the reason for misunderstanding is that it is not clear how your data relates to the execution of a particular application program, be it a C-compiler, Mathematica, or word processor. It is generally assumed that the performance of those roughly scales with standard benchmarks, e.g. megaflops. So, my confusion has to do with which part of the plot need to be considered to gauge an application performance. Sincerely, Slava Serota ---------------------------------------------------------------------------- >Professor Serota, > >You wrote: > >> We were slightly perplexed as to the meaning of QUIPS. Is there a >> discussion that we could find useful? In particular, what is the >> meaningful interpretation of graphs QUIPS vs. time: should we think >> about the area under the curve? - then, because of the log scale it >> appears that initial rise is meaningless... Please, advise. > >First, have you read the technical paper on HINT? It's on our Web >site. It might help. > >You ask good questions. As you probably know, QUIPS stands for >QUality Improvement Per Second. Quality is defined as the reciprocal >of the error, where error is the difference between the upper bound >and the lower bound, done very rigorously, of a decreasing continuous >function on [0,1]. The method is hierarchical, subdividing the source >of the largest remaining error (HINT = Hierarchical INTegration), so >quality should improve steadily until some machine resource (precision, >memory, etc.) is exhausted. As much as possible, we have tried to >find a numerator for the definition of computer speed (somethings per >second) that is on very firm mathematical ground; QUIPS is about as >close to "meters per second" as you're going to find in the computer >world. > >I'm a former physicist myself, so I appreciate your asking. When you >go from physics to computer science, it is appalling how lacking the >field is in decent metrics for anything. > >Now, about the log scale: First of all, computers need study at >every scale from nanoseconds to several seconds, suggesting the >scale compression afforded by log scales. Second of all, we do >a logarithmic integral of the HINT graph, Int(Quality/t)d(log t). >To get to a given quality in .1 second is clearly more important and >valuable than getting there in .2 second, and we say it is exactly >twice as important and valuable by using a log scale. It weights >the importance by 1/t, since d(log t) = 1/t dt. Otherwise, the area >under the HINT curve would always equal the last quality obtained, >sort of like rating a vehicle by how far it traveled without asking >how long it took to get there! An elegant side effect is that the >units on the logarithmic integral again have time in the denominator, >so the area under the log scale curve has units of QUIPS again. Which >feels very right. > >Attaching a whole lot of memory or disk to a slow processor will help >your Net QUIPS, but not all that much because it's way over on the right. >Having a quick processor with low subroutine overhead and fast register >or cache access helps by making a spike on the left, but only a spike. >The only way to "cheat" at HINT is to make the whole curve high, which >is not cheating at all and will benefit all applications commensurately. > >Why do you say "because of the log scale it appears that the initial >rise is meaningless"? The log scale increases the importance assigned >to the first few steps, not decreases it. Can you clarify your question >please? > >-John Gustafson ---------------------------------------------------------------------------- Date: Mon, 19 Feb 1996 10:22:28 -0800 (PST) From: eric@mote.ME.Berkeley.EDU To: John Gustafson Subject: Re: Hint Site I really appreciate the response. I think I'm still steeped in the old benchmark methods. The HINT numbers seem to combine both integer and floating point performance (i.e. no HINTint and HINTfp), correct? The numbers that seemed the most curious to me were the Pentium100 and Pentium120 numbers. These numbers show that a P120 is 1.5% faster than a P100. The P100 number was obtained using integer. However, if you publish the best numbers, why isn't the best P120 number also integer and 20% bigger than the P100? I would think that for the same chip class, the best numbers would all come from the some 'type' of test. I'd love to d/l the code and play with it myself but I don't have a native PPC compiler. I guess I could try it under emulation though. Again, thanks for the response. On Mon, 19 Feb 1996, John Gustafson wrote: > Eric, > > Thanks for your response to the HINT page. > > > I love your HINT site. It really shows how many PC's have caught up > > to workstations in performance. > > We see the main distinction between PC's and workstations to be their > operating systems. "Workstation" and "Unix" seem to go together, > regardless of price. We were stunned when we found a Mac 8500 beating > an SGI Indy. > > > I was wondering why all the > > Macintoshes have (Double) after them. I believe this indicates that > > the code was compiled in double-precision. It's hard to make valid > > comparisons of machine speed when some code is running in > > double-precision and others in single-precision. Is there a way for > > you to put some single-precision numbers up for the Macintoshes? > > Here's where you have to read the background material on HINT to > understand the performance data. HINT is not, repeat _NOT_, like > any benchmark you've ever seen. It fixes neither time nor problem > size, and allows any numerical precision you like. That way we can > do comparisons of wildly different computers, fairly. We typically > test data types of 16-bit int, 32-bit int, 64-bit int if the system > supports it, 32-bit float, and 64-bit float (which often is internally > computed to more bits, like 80 or 128). For our tables, we publish > the BEST number. If you think "single-precision" will help the > performance, you might be surprised. The quality of the answer is > hurt, especially for large problems (the right-hand part of the > HINT graph), so even if one single-precision op is faster than its > double-precision equivalent, you get reduced QUality Improvement > Per Second (QUIPS). Check out our paper, also on the Web! > > If a computer prefers int to float, it's probably designed for > text editing or financial applications etc. instead of scientific > computing or graphics. I think this happened in the Indy PC, versus > the Indy SC version of the same machine. So it's kind of a > "personality indicator." You can download our code and try > different precisions yourself, if you like. > > This approach represents about 15 years of trying to fix what is > wrong with most computer benchmarks. Please let me know if HINT > ever gives you misleading information, because that's how we learn > how to improve it. > > -John Gustafson > ---------------------------------------------------------------------------- Date: Thu, 22 Feb 1996 16:47:53 -0800 From: Guy Kawasaki Subject: HINT performance Remember my posting of the Ames Laboratory benchmark Web site? This is the URL: I just got a message from the people there: "Exciting news: The HINT numbers have jumped up for the Mac 9500 here, because of use of a newer version (7.0) of CodeWarrior that has instruction scheduling in the optimizer. It's now up to 5.9 MQUIPS, about twice a Silicon Graphics Indy! It's also about twice the high-end Pentiums we've tested. It's faster than a Pentium Pro." The bottom line, evangelistas, is that one great reason for buying a Macintosh is superior PERFORMANCE. And you can take that to the bank. Kick Butt! Guy ---------------------------------------------------------------------------- Date: Mon, 13 May 1996 14:19:41 -0500 (CDT) To: Allan Berger From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT benchmarks Hello Allan, Sorry if we didn't reply promptly to your message. I remember reading it. I'll also ask Quinn to write a response... he has more specific answers to some of your questions. >>>A few months ago there was much discussion that the HINT benchmarks (when >>>first made available) were too slow on PowerPC Macintoshes. There were >>>claims that higher numbers were achieved, but the chart was not updated >>>at the time. Too slow? They are outrunning our Silicon Graphics Indy's, and have from the first time we measured a Mac 7100, I think. The 8500 really blew us away. And the chart is fairly up to date. >>>Is there any truth to these claims? Are the numbers posted now >>>(http://www.scl.ameslab.gov/HINT/) the most recent, most accurate >>>benchmarks? We're hoping to commercialize HINT, or maybe create a non-profit corporation to run it. Updating the table is a low-priority task for us right now. So who is making these claims? The PPC Macs look markedly more powerful than the fastest Pentium-based PC's, so I can't believe anyone's saying that HINT puts the Macs in a bad light. >>>Thanks for any info you can offer. It's hard to evaluate any benchmark, >>>but the claims that data may be out-of-date make it even more difficult. One thing... any rumors that HINT is sensitive to the compiler are false. And also probably my fault. I got a much higher reading with a new release of CodeWarrior, but it was because we'd also switched to measuring 32-bit integer speed instead of floating-point speed. HINT does not seem much affected by compiler tricks of any kind, which supports the theory of its construction. After two years, it still appears bombproof (cheatproof, that is). >>>PS Did I miss a PPro machine benchmark, or is that not yet posted? >>>Thanks again. That one's for Quinn. There are thousands of independent brands of computer out there (2000 Wintel machines alone), so we won't keep up with the onslaught until we have people updating the data full time. -John Gustafson ---------------------------------------------------------------------------- Date: Fri, 17 May 1996 09:40:54 -0500 (CDT) To: Peter Hahn From: gus@ameslab.gov (John Gustafson) Subject: Re: Rutime comparison for Intel i860 Dear Peter, Thanks for your comments and questions on HINT. >Is there by any chance an i860 processor among the machines listed? I need to >compare runtimes for an integer intensive program on the 40MHz i860 and a 75Mhz >SuperSPARC on a SPARC10. Even a guess on your part would be valuable. I have >been unable to figure it out. All I can find for the 40 MHz i86 is the >Dhrystone (MIPS) of 50.2(Ver2.1) and 64.9(Vers1.1) and a Stanford Integer MIPS >of 29.6. I know we measured a single processor of the i860 in the Intel Paragon, but for some strange reason I don't see it in our table. I'll see if we can add that. If you look at our Intel Paragon 32 nodes, you see 24.7 MQUIPS. A simple divide would say one node is 0.77 MQUIPS, but obviously a serial run with no parallel communication overhead would be faster. But not that much faster. Probably no current processor has been more deceptive in its specifications than the i860. Pay no attention to Dhrystone MIPS, matrix multiply, or LINPACK, especially if the numbers in any way come from Intel instead of an unbiased third party. Based on memory bandwidth, the peak MFLOPS supportable by the i860 is 6.67 MFLOPS. Compare this to the 80 MFLOPS in the Intel marketing literature and you begin to see why I urge caution... and perhaps why the i860 is now a discontinued processor line at Intel. Current Pentiums and Mac PPC are about three to four times faster than the i860 on balance. -John Gustafson ---------------------------------------------------------------------------- Date: Fri, 10 May 1996 11:08:16 -0400 From: John D. McCalpin Subject: benchmarks To: gus@ameslab.gov Hello John, I enjoyed reading about HINT. There are two other reasonably important benchmarks that cover the same territory that I wanted to be sure that you were aware of: (1) STREAM STREAM is a simple synthetic test of main-memory bandwidth. Its advantages are that it is very simple to understand and simple to relate to the performance of some large codes. It is also nice that it upsets most workstation vendors (because their results are so bad). It is used by all major vendors (though most will not release results), and it was developed and continues to be promoted by me. (2) LMBENCH LMBENCH is a largish set of microkernel benchmarks measuring latency and bandwidth for memory, networks, and disks. This package was developed by Larry McVoy while he was at Sun, and is still supported and promoted now that Larry is at SGI. The memory read latency test provides results that are especially similar to the patterns shown by HINT. STREAM is popular because it is YASFOM (Yet Another Single Figure Of Merit), and therefore easy to understand. LMBENCH is far more comprehensive, but also harder to run and requires more work to interpret. I am just about to make the transition from academia to industry, but I suspect that I will be given enough freedom to continue the development and promotion of STREAM. My target is a simple benchmark that will return (latency,bandwidth) pairs as a function of problem size, but still maintain a single figure of merit for the main memory bandwidth. john ---------------------------------------------------------------------------- Date: Tue, 2 Jul 1996 09:43:34 +0100 To: Nathan K. Meyers From: gus@ameslab.gov (John Gustafson) Subject: Re: Fixed-Time Benchmarking Cc: gus Nathan, I've been remiss in responding to your question about fixed-time benchmarking for graphics performance. The short answer is that I know of no published work in this area. I've given it some thought, however, which I'll describe here... and I encourage you to develop your own benchmark method and publish it with the scaling ideas of SLALOM and HINT. First, one needs a _quality_ measure for graphic output. It may be possible to consolidate features such as pixels per inch field of view bits per pixel frames per second contrast range into a single figure of merit that measures distance between what a human eye observes and what it is _capable_ of observing. We know a lot about the limits of human perception. Current display systems are still pretty far from that limit. Whether displaying line/polygon graphics or photographs or console windows, one can (I think) compute quality as linear in all five dimensions listed above. By multiplying all of them together to find quality, any weakness in a graphics subsystem is exposed since the low factor will cripple the single figure of merit. Which is what one wants. But once we saturate the optic nerve bandwidth and the imaging system in the eye, the benchmark ceases to be linear. The fixed-time aspect is trickier. I don't see the HINT idea as being particularly applicable here. The persistence of vision provides the human time scale... about 1/50 second, I suppose. But when I try to quantify "What can the system do in 1/50 second," I get stuck. It's a very interesting question. We do plenty of work with graphics in my lab (we have a fast method of solving, really solving, Kajiya's Rendering Equation that runs in parallel, for example) and will soon be faced with this graphics performance issue in the papers we write. I'd be happy to continue this dialog if it is productive to you. -John Gustafson >Dr. Gustafson, > >I've recently encountered your papers on HINT and SLALOM. I work in the >graphics world, which is just as heavily populated with meaningless and >deceptive performance benchmarks as is general computation. Do you know >of any work that has been done in applying the principles of fixed-time >benchmarking to the area of graphics performance? > >Nathan Meyers ---------------------------------------------------------------------------- Date: Fri, 2 Aug 1996 10:23:02 +0100 To: Scott From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT results... Cc: hint Hello Scott, Thanks for you note on HINT, and for the data file. I'll see if we can put it in our database, though I think we have an 8500 on there already. >My name is Scott Thompson. I am intrigued by computer speed! I have >always tried to get the fastest computer for the least amount of money >with the greatest amount of capability. You're our kind of guy. >I now have an Apple 8500/132 >(which I have accelerated to 150Mhz). You mean you replaced the clock chip, perhaps adding a fan to prevent overheating, and checked everything for correctness? Interesting; we have not data for that kind of experiment, and it would make very clear that the speedup was not a factor of 150/132 because of other limitations. >While this is not the fastest >computer (by far!), it seems to be a great one for the kind of work that >I do. 8500s are wonderful machines. I was blown away the first time we tested one, since it outran our Silicon Graphics Indy. Most of the current generation personal computers are now faster than the Unix workstations of slightly higher price. >I am a computer programmer essentially, although I would say I am >more of a system engineer. I produce multimedia mostly these days, but I >have done lots of scientific stuff over the last ten years. I have a >B.S. in Applied Math from the University of Idaho ('86) and a B.S. in >Computer Science ('86) as well. Anyway... My co-developer, Quinn Snell, is from Idaho. My background is Applied Math. Looking for a job, Scott? Things are tight here, but things can change. >I was checking out the HINT stuff. I ran the Hint.ppc code. I got the >results, but don't know how to graph them and/or interpret them. What do >I do next? Here are the data points: Let's see if I can remember the columns. I think it's Time in Sec. QUIPS Quality Iterations Memory, bytes >0.6916666667 180468 124824 124941 10495044 >: >0.0000053674 372617 2 3 252 >0.0000030847 324178 1 2 168 And we usually use Gnuplot on the first two columns to make our HINT graphs. See the technical report that goes with our Web page. The graph will reveal the cache structure of your system, something no other benchmark can even begin to do. >and here is the output results: >: >Finished with 3420076.216025 net QUIPs >HINT has finished. Choose Quit from the File menu That's 3.42 net MQUIPS using 64-bit floating-point data types. Not bad! And that's the 150-MHz modification, right? >I will be anxious to see how HINT runs on the Power Computing machines >(see > > http://www.zdnet.com/macuser/mu_0996/reviews/review01.html > >for a interesting review of 225Mhz PPC 604e based machines). Yes, I saw that review. We're very interested in getting our hands on one of those, also. >Is there source code for the HINT.ppc program available? >Are the result tables on-line somewhere as well? We distribute a C version that you can compile yourself, and I think a Mac version that is really warmed-over Unix interface. But you are welcome to study the source code, and especially the README file that accompanies it. Any data you contribute will be credited to you in our database. We would very much appreciate anything you can send us that is measured in a careful way that others can confirm. We'd have to put your 150 MHz mod into our "hardware experiments" section, I think. Mostly we'd like to measure machines exactly as they are out of the box from the factory, since all the extensions and system mods tend to slow things down and clutter up the measurements with interrupts. Thanks again for writing us, and keep in touch. -John Gustafson ---------------------------------------------------------------------------- Date: Mon, 12 Aug 1996 13:49:24 -0700 (PDT) From: Gary E. Later Subject: Definition of QUIPS in html article To: gus@ameslab.gov I haven't looked at the code yet, but in reading your article, it appears from the plots that what is being calculated and plotted is the *mean* "quality improvement per second". Wouldn't it make more sense to plot dQ/dt rather than the average? This would certainly make the plots easier to interpret, since it would eliminate the linear tails that occur whenever there is a sharp change in performance such as reaching the end of cache or reaching the precision limit for quality improvement. What would be seen instead is a sharp dropoff to the new performance level. Similarly, NetQUIPS = Integral( (dQ/dt)*d(log t) ) = Integral( (1/t)*(dQ/dt)*dt ) would be the sum total of the actual QUIPS, not the total of a moment to moment average QUIPS. It seems that it would be no more difficult to calculate than the present definition of QUIPS, since the best measure of dQ/dt is simply the difference in quality from one iteration to the next. Thanks for listening; I was just curious. | Gary E. Later ---------------------------------------------------------------------------- Date: Mon, 12 Aug 1996 17:01:53 +0100 To: Gary E. Later From: gus@ameslab.gov (John Gustafson) Subject: Re: Definition of QUIPS in html article Cc: hint Gary, Thanks for your comment. Your idea is one we sweated over for a long time: >I haven't looked at the code yet, but in reading your article, it >appears from the plots that what is being calculated and plotted is the >*mean* "quality improvement per second". Wouldn't it make more sense to >plot dQ/dt rather than the average? Yes, in many ways it would. But have you looked at the HINT graphs? They have unremovable jitter plus removable noise in the data that is amplified by using instantaneous dQ/dt. We can get rid of this by fitting the data with an analytical model, and then the instantaneous dQ/dt becomes smooth and meaningful. We're working on that. I very much agree that it is the right way to view memory regimes. On the other hand, I also find it useful to think about the average speed including prior time in faster memory regimes. Imagine asking the question, "How fast is a human on foot?" for a wide range of time scales. >From sprinting to jogging to walking to walking-and-resting-overnight, you can understand the speed as a total distance divided by total time. And with a logarithmic scale, the effect of earlier speeds is quickly diluted by that of the larger time range. So for now, that's what we're doing. HINT produces five columns of numbers. There are many ways to look at that data, not just the one we typically use. The only thing set in stone is the task itself and the restrictions on how it can be modified and run. >: >Similarly, > NetQUIPS = Integral( (dQ/dt)*d(log t) ) > = Integral( (1/t)*(dQ/dt)*dt ) > >would be the sum total of the actual QUIPS, not the total of a moment to >moment average QUIPS. I agree. Like the above comment, this becomes doable with an analytical model that removes the noise. >It seems that it would be no more difficult to calculate than the >present definition of QUIPS, since the best measure of dQ/dt is simply >the difference in quality from one iteration to the next. > >Thanks for listening; I was just curious. We appreciate your input. You're the first person outside our research group to make this observation... you must be a very careful reader. Please let us know if you find HINT useful in your work or decision- making. John Gustafson ---------------------------------------------------------------------------- From: bonze Subject: Correction to earlier mail about HINT To: gus@ameslab.gov I believe that I mis-spoke slightly in my earlier e-mail. I think that I said the dQ/dt would be the difference between Quality for an itera- tion and its predecessor, when what I of course meant was that it would be that difference divided by the time it took to do that iteration. I still don't think that it should have any adverse effect on the benchmark, since you must be finding those times to do the average (Q(t) - Q(0))/(t-t0) that you appear to be plotting. So my question still stands. (Hopefully, you'll be able to find my earlier mail. My id at work is "gel1845", not "later".) ---------------------------------------------------------------------------- Date: Mon, 08 Jul 1996 10:05:46 -0700 From: Nathan K. Meyers Subject: Re: Fixed-Time Benchmarking To: gus@ameslab.gov (John Gustafson) John, Thanks for the reply... > I've been remiss in responding to your question about fixed-time > benchmarking for graphics performance. The short answer is that > I know of no published work in this area.... I've had some opportunity to explore the problem since my original letter. The particular problem I'm trying to solve involves simplification of geometric models for interactive graphics performance. The ethos behind the project is similar to that behind SLALOM and HINT: to allow the computer to handle bigger models, not just to spin the same old teapot faster. > First, one needs a _quality_ measure for graphic output. It may > be possible to consolidate features such as > > pixels per inch > field of view > bits per pixel > frames per second > contrast range I will have two images -- a "correct" full-detail image and a simplified one -- and a frame rate to synthesize into a figure of merit. My current thinking for an image quality measure is to identify significant features in the scene and cross-correlate them between the simplified image and the full-detail image. Correlation is an easy way to measure similarity between signals, but unfortunately does not account at all for how signals are perceived visually -- focusing on significant features rather than simply correlating the entire image should help to remedy that. > The fixed-time aspect is trickier. I don't see the HINT idea as > being particularly applicable here. The persistence of vision > provides the human time scale... about 1/50 second, I suppose. > But when I try to quantify "What can the system do in 1/50 second," > I get stuck. The best suggestion I've received to date for a scalable problem is 3D fractals: they are easily scaled in a well-controlled manner to any degree of complexity -- much like SLALOM's radiosity problem and HINT's numerical integration problem. The "fixed time" for this problem could be a fixed frame rate, with the curves of merit generated by measuring the degradation of the image as a function of model complexity. (I still need to better understand how SLALOM and HINT synthesize these curves from the raw performance data.) For this approach to be credible, it will be necessary to design a fractal model that stresses the system in ways similar to real models being employed by real graphics users -- just as SLALOM and HINT focused on stressing the system in ways characteristic of scientific computation. > I'd be happy to continue this dialog if it is productive to you. Consider this a continuation :-). Although there are no question marks in this letter, I welcome any comments you have on what I've said. Nathan Meyers ---------------------------------------------------------------------------- Date: Fri, 02 Aug 1996 10:51:24 -0600 Subject: Re: HINT results... To: gus@ameslab.gov (John Gustafson) Hello Again, >Hello Scott, > >Thanks for you note on HINT, and for the data file. I'll see if we >can put it in our database, though I think we have an 8500 on there >already. > >>My name is Scott Thompson. I am intrigued by computer speed! I have >>always tried to get the fastest computer for the least amount of money >>with the greatest amount of capability. > >You're our kind of guy. > >>I now have an Apple 8500/132 >>(which I have accelerated to 150Mhz). > >You mean you replaced the clock chip, perhaps adding a fan to prevent >overheating, and checked everything for correctness? Interesting; >we have not data for that kind of experiment, and it would make >very clear that the speedup was not a factor of 150/132 because of >other limitations. I used the PowerLogix clock accelerator. Essentially it is a small board that clips onto the processor card. It makes contact with the CPU crystal and there is a ribbon cable that connects to a small PC board with a small processor and two rotary switches. By rotating the switches you can adjust the speed of the clock chip in .5 Mhz steps. Very elegant for a hardware hack. And it works! It takes about 5 minutes to install and ... SHAZAM! You have up to 175MHz of PPC speed. The system board, PPC CPU and cache cards must all be capable of the higher speeds for things to work out. At some point the timing of a component fails. So, you just back off the speed until things work well. My board seems to work great at 150.0 Mhz. So, other than this modification, my machine is a 'stock' 8500/132. I have a 512K cache card and 64MB RAM. What prompted me to send you the note is that I am not sure my cahce card is working properly and I wanted to do a benchmark to test it. Well, your benchmark seems to have distinct 'knees' where the caches saturate, so I thought I would try it. Plus, I wanted to compare this setup to a 9500 (which you had data for). I noticed in the 8500 data that no cache card was listed. My 512K cache is 256K over the baseline machine cache (256K, of course). > >>While this is not the fastest >>computer (by far!), it seems to be a great one for the kind of work that >>I do. > >8500s are wonderful machines. I was blown away the first time we >tested one, since it outran our Silicon Graphics Indy. Most of the >current generation personal computers are now faster than the Unix >workstations of slightly higher price. Yeah. Amazing what consumer markets can do to pricing, eh? > >>I am a computer programmer essentially, although I would say I am >>more of a system engineer. I produce multimedia mostly these days, but I >>have done lots of scientific stuff over the last ten years. I have a >>B.S. in Applied Math from the University of Idaho ('86) and a B.S. in >>Computer Science ('86) as well. Anyway... > >My co-developer, Quinn Snell, is from Idaho. So, how strong would you say his background is? I think the Univ. of Idaho was a good school, but I don't really have much to compare it to - having never really worked in a 'Math-oriented' environment. > My background is Applied >Math. I always tell people - "It's all math." They just look at me like I'm crazy (of course), but then, they didn't do their math, now did they? I really liked my *APPLIED* math courses. The theory is great, too, but I like to plug numbers in and make things happen. (Like simulation and phot-realistic rendering using *REAL* physics routines in less than order(n^3) time!) >Looking for a job, Scott? Things are tight here, but things >can change. No, not really looking for a job. I have my own company and work out of my home. I like it a lot. I have too much work to do right now, but that is a good thing in this case. > >>I was checking out the HINT stuff. I ran the Hint.ppc code. I got the >>results, but don't know how to graph them and/or interpret them. What do >>I do next? Here are the data points: > >Let's see if I can remember the columns. I think it's > >Time in Sec. QUIPS Quality Iterations Memory, bytes > >>0.6916666667 180468 124824 124941 10495044 >>: >>0.0000053674 372617 2 3 252 >>0.0000030847 324178 1 2 168 > >And we usually use Gnuplot on the first two columns to make our HINT >graphs. See the technical report that goes with our Web page. Well, see, that's sort of a problem. It is a Postscript document and I don't have a Postscript printer/emulator that would digest it for me. :-) >The graph will reveal the cache structure of your system, something >no other benchmark can even begin to do. Exactly! Speaking of which, why are the PowerMac 8500/9500 curves so spikey? While the RS 6000's are so smooth? What does that mean? Also, why does the HINT program only use 10MB? I have 64MB. Is there something in the theory of the test that uses 10MB as the memory size? Or is it configurable - but not on the Mac Hint.ppc program? So many questions... hope you don't mind! Don't feel too obligated to get back to me ASAP, I would imagine you have real work to do! This stuff interests me partly because I used to do high-speed system simulation at Texas Instruments and so now I want to analyze something (since I haven't done it for a couple of years). > >>and here is the output results: >>: >>Finished with 3420076.216025 net QUIPs >>HINT has finished. Choose Quit from the File menu > >That's 3.42 net MQUIPS using 64-bit floating-point data types. Not >bad! And that's the 150-MHz modification, right? Yep. 150MHz PPC 604 w/ 512K L2 cache, 64MB RAM. I did not reboot with system extensions off, I just ran the thing to see what would happen. Then I took a nap and came back and saw the results. Wouldn't it be nice if there was a cool graphical interface and frontend on that baby? (Hint: That's what I want to do with the code.) > >>I will be anxious to see how HINT runs on the Power Computing machines >>(see >> >> http://www.zdnet.com/macuser/mu_0996/reviews/review01.html >> >>for a interesting review of 225Mhz PPC 604e based machines). > >Yes, I saw that review. We're very interested in getting our hands >on one of those, also. I know the VP of sales, I may try to get him to let me run the Hint code, sound interesting to you? (You know, Power Computing is based here in Austin, Texas.) > >>Is there source code for the HINT.ppc program available? >>Are the result tables on-line somewhere as well? > >We distribute a C version that you can compile yourself, and I think >a Mac version that is really warmed-over Unix interface. But you are >welcome to study the source code, and especially the README file that >accompanies it. Where is it? Can you e-mail it to me? > >Any data you contribute will be credited to you in our database. We >would very much appreciate anything you can send us that is measured >in a careful way that others can confirm. Ahhhm, the scientific method. :-) >We'd have to put your 150 MHz >mod into our "hardware experiments" section, I think. Mostly we'd like >to measure machines exactly as they are out of the box from the factory, >since all the extensions and system mods tend to slow things down and >clutter up the measurements with interrupts. > >Thanks again for writing us, and keep in touch. > >-John Gustafson ---------------------------------------------------------------------------- Date: Wed, 14 Aug 1996 05:37:58 -0700 (PDT) From: Gary E. Later Subject: Definition of QUIPS in html article To: snell@ameslab.gov Cc: gus@ameslab.gov Well, I'll try sending this from home since it bounces when I send it from work! If you got my message afterthought, you can disregard, since I have fixed the text below. Re: your article at http://www.scl.ameslab.gov/Publications/HINT/ComputerPerformance.html I haven't looked at the code yet, but in reading your article, it appears from the plots that what is being calculated and plotted is the *mean* "quality improvement per second". Wouldn't it make more sense to plot dQ/dt rather than the average? This would certainly make the plots easier to interpret, since it would eliminate the linear tails that occur whenever there is a sharp change in performance such as reaching the end of cache or reaching the precision limit for quality improvement. What would be seen instead is a sharp dropoff to the new performance level. As it is, the tails have an exponential curve on the semi-log plots that is difficult to distinguish from performance results. Similarly, NetQUIPS = Integral( (dQ/dt)*d(log t) ) = Integral( (1/t)*(dQ/dt)*dt ) would be the sum total of the actual QUIPS, not the total of a moment to moment average QUIPS. It seems that it would be no more difficult to calculate than the present definition of QUIPS, since the best measure of dQ/dt is simply the difference in quality from one iteration to the next divided by the time taken to do that iteration: ( Q(t(n)) - Q(t(n-1)) ) / ( t(n) - t(n-1) ) Those times must already be available, since you seem to be dividing total change in quality by total time: ( Q(t)-Q(t0) ) / ( t-t0 ). Thanks for listening; I was just curious. | Gary E. Later ---------------------------------------------------------------------------- Date: Mon, 12 Aug 1996 17:01:53 +0100 To: Gary E. Later From: gus@ameslab.gov (John Gustafson) Subject: Re: Definition of QUIPS in html article Cc: hint@gemini.scl.ameslab.gov Gary, Thanks for your comment. Your idea is one we sweated over for a long time: >I haven't looked at the code yet, but in reading your article, it >appears from the plots that what is being calculated and plotted is the >*mean* "quality improvement per second". Wouldn't it make more sense to >plot dQ/dt rather than the average? Yes, in many ways it would. But have you looked at the HINT graphs? They have unremovable jitter plus removable noise in the data that is amplified by using instantaneous dQ/dt. We can get rid of this by fitting the data with an analytical model, and then the instantaneous dQ/dt becomes smooth and meaningful. We're working on that. I very much agree that it is the right way to view memory regimes. On the other hand, I also find it useful to think about the average speed including prior time in faster memory regimes. Imagine asking the question, "How fast is a human on foot?" for a wide range of time scales. >From sprinting to jogging to walking to walking-and-resting-overnight, you can understand the speed as a total distance divided by total time. And with a logarithmic scale, the effect of earlier speeds is quickly diluted by that of the larger time range. So for now, that's what we're doing. HINT produces five columns of numbers. There are many ways to look at that data, not just the one we typically use. The only thing set in stone is the task itself and the restrictions on how it can be modified and run. >: >Similarly, > NetQUIPS = Integral( (dQ/dt)*d(log t) ) > = Integral( (1/t)*(dQ/dt)*dt ) > >would be the sum total of the actual QUIPS, not the total of a moment to >moment average QUIPS. I agree. Like the above comment, this becomes doable with an analytical model that removes the noise. >It seems that it would be no more difficult to calculate than the >present definition of QUIPS, since the best measure of dQ/dt is simply >the difference in quality from one iteration to the next. > >Thanks for listening; I was just curious. We appreciate your input. You're the first person outside our research group to make this observation... you must be a very careful reader. Please let us know if you find HINT useful in your work or decision- making. John Gustafson ---------------------------------------------------------------------------- Date: Tue, 06 Aug 1996 17:10:11 -0600 Subject: O yeah... To: gus@ameslab.gov John, I forgot to mention... it is interesting to see that the 'power' available in a desktop computer is still steadily increasing while costs are steadily decreasing. I saw that Apple has formally announced the 8500/180, and the 9500/200. So, I am sure they are well on their way to the 250/300Mhz range by now. I can hardly wait! Of course, I have plenty of processing power now, but, if I could do a 3-D rendering in half the time, I would be very happy. It can take as long as 1 hour per frame of video to render an animation at todays PowerMac speeds (i.e., 150-180 Mhz-ish) Since many animations are geared for television broadcast, they are actually rendered at 60 frames per second (this allows for the interlacing nature of TV). That means it can take 60 hours of computer time for only one second of animation. So, a 30 second animation would take about 75 days to render (on a single machine, assuming 1hour per frame). Sure, there are higher-end computers that are somewhat better suited for this task and there are 'farms' of computers that one can 'hire' to render their stuff, but for the hobbyist (like me), that is not so practical. But, if I can spend as little as $2800 and get this kind of power-Wow! If I can spend a little more (like 50% more) and get double that, then I am truly getting closer to having a real opportunity to be creative and afford it! Pretty soon it will only take a week to render that same 30-second animation at nearly the same price. Make sense? It is really amazing the kind of power an unsuspecting consumer has on his desk! ... So, if that new Daystar 4-processor 604 @ 200Mhz board comes down in price... well, maybe I should get a new (read 'cheaper') hobby before it's too late! Have fun, Scott ---------------------------------------------------------------------------- From: "Rajat K. Todi" Subject: Re: NEW VALUES: HINT values for IBM E30 PPC 604 133MHz To: greg stovall Hi Greg, Your IBM E30 results seems to be very much interesting. In fact we were trying to simulate your results with our Analytical HINT (AHINT). This letter is in response to your last two mails to us. >Notes: Only ran the data size up to 206MB. Even fiddling with the >paging memory size and the user allocations, I only got the size >used up to 206MB, which is certainly less than the physical memory >OR the paging memory. Any suggestions? We got similar problem with other machines. We were able to fix this by fiddling around with the system. All we are doing is to ask for more memory and the system is not giving it. We will get back to you in case we get a solution but for now you need to fiddle around with the shell and operating system. >Also, looking at the gnuplot curves, this system peaks higher than >the IBM 590, but quickly drops off to a sharply lower value, just >like the Motorola PowerStack did. We've been thinking about the >shape of the curve, and believe it is entirely due to the limitation >of only 64bit width memory accesses versus the 256bit width of the >IBM 590. This seems to hold true even in the large memory limit, >since the 133MHz and the 100MHz machine have similar performances >at the large memory, due to the fixed memory access speed. >Does this mesh with your evaluation? It doesn't mess with our evaluation. Infact you have made the right observation. In high memory, memory access rather that clock rate has lot to do with the higher curve. The sharp drop of IBM30 vs the the IBM 570 higher curve looks to be due to memory access rate difference (cross checked by AHINT). Another observations we have made is as follows: unlike the Motoral Powerstack PPC 604 (double) and IBM RS 6000 590 (double), the graph of the result submitted by you does not show the presense of the secondary cache which is 512 K according to the data sheet submitted by you. There can be two things which can be said - Either access rate of the secondary cache is high (which seems unlikely for IBM machine) or the secondary cache utilization is not done properly. Greg, do you know of web-site where we can get more information on IBM E30. We will like to get more information of the system parameters of this IBM E30. Thanks for your results. We appreciate your interest in HINT. We like to hear more from you. Please feel free to give us your suggestions or ask any questions. Thanks and Best Wishes, Rajat Todi ---------------------------------------------------------------------------- Date: Thu, 16 May 1996 11:06:07 -0700 (PDT) From: M. Edward Borasky Subject: Re: HINT benchmark >HINT is a distant relative of SLALOM. It's neither fixed-time NOR >fixed-size. Both vary, giving a performance curve instead of one >number (though we provide a single number, the integral >of the semilog curve, also). It's only about two pages of C for the >timed part, to keep it simple to convert to a variety of computer systems. >SLALOM had gotten to 8,000 lines, and became too expensive to use as >a benchmark... though we're still working on graphic rendering problems >as ends in themselves. > >>Other news: I now work for ADP Dealer Services in Portland, OR. I've been here >>since Election Day 1992 :-). ADP Dealer Services makes the computer systems >>that run many car and truck dealerships. Most of my job involves solving >>performance problems in UNIX on these systems, but we also do benchmarking of >>UNIX systems from time to time. Somehow, I doubt that HINT is anything >>like our >>application codes, but it might be useful in wringing out UNIX and hardware >>level performance problems. > >You might be surprised. HINT measures non-floating-point machines just as >well as floating-point ones. You can use any data type you want, any >precision, any word size. And because it taxes the memory system at each >level of the hierarchy, I assert that it is quite good for business >applications, personal computers, etc. Please download it and check it >out. I would value your critique. > >You once proposed that almost all benchmark numbers be supplied with a >variance as well as a mean (or a max, more often the case). You can >see noise in the data at a glance in the HINT curves, which is some >information in that direction. Giving people exactly the right amount >of information, no more and no less, is the hard part of benchmark >reporting. If we can get people to acquire a feel for HINT graphs, an >awful lot of the BS about system performance will be blown away, I'm >hoping. > >-John Gustafson >(515) 294-9294 I have run HINT on a few of our Motorola systems. One change that needs to be made is to provide alternative versions of the "gettimeofday" call in function "When". For Motorola SVR4 systems, I had to drop the second argument. It turns out that one of our other engineers, Jerry Nelson, is interested in using this benchmark on a Windows NT system. Do you know if anyone has ported it to Windows NT yet? It looked to me like the only thing that needed to change was the "gettimeofday" call, but Jerry is the NT expert. How would you run HINT on, say, a dual-processor machine used as a time-sharing system rather than as a parallel processor. I would think you'd simply run two copies started simultaneously and let them fight between each other for real memory. One of the things we're most interested in is performance measurement in client-server environments -- has anyone looked at extending HINT to that world? ---------------------------------------------------------------------------- From: greg stovall To: todi@scl.ameslab.gov Cc: hint@gemini.scl.ameslab.gov Subject: Re: NEW VALUES: HINT values for IBM E30 PPC 604 133MHz In message " NEW VALUES: HINT values for IBM E30 PPC 604 133MHz" sent on Sep17, todi@scl.ameslab.gov writes: >Another observations we have made is as follows: unlike the >Motoral Powerstack PPC 604 (double) and IBM RS 6000 590 (double), >the graph of the result submitted by you does not show the presense >of the secondary cache which is 512 K according to the data sheet >submitted by you. There can be two things which can be said - Either >access rate of the secondary cache is high (which seems unlikely >for IBM machine) or the secondary cache utilization is >not done properly. I am puzzled by your statement. I've gone back and looked at the data I submitted, and I don't see the effects you see on the Motorola data; it looked just like the IBM E30 data, except that the curve for the 100MHz machine is a little lower than the curve for the 133MHz machine. Also, an IBM 590 does not have a secondary cache, so it could not show any secondary cache effects in the first place. What effects should be visible if there is a working and useful secondary cache? I would expect some effect that would moderate the 1/x behavior due to the access speed of the main memory, but would like to verify that presumption. > >Greg, do you know of web-site where we can get more information >on IBM E30. We will like to get more information of the system >parameters of this IBM E30. The closest thing I can find so far is http://www.rs6000.ibm.com/cgi-bin/systems/pci.pl which is probably more safely accessed by following the links down from http://www.rs6000.ibm.com ---------------------------------------------------------------------------- Date: Thu, 7 Nov 1996 16:39:35 -0800 From: James Hodgson To: "Rajat K. Todi" Subject: Console is a VDM not Win32 application and therefore Win32 Subsystem direct Hi. I believe I have created a console version. It looks the same as the executable DOUBLE I downloaded off your site when executed. Here is the problem with that as I see it. From "Support Fundamentals for Microsoft Windows NT 3.5 page 487 an MS Dos application has an abstraction layer called a Virtual Dos Machine (with a single thread of execution) running between it and the Win32 Subsystem. This layer runs 16 bit emulated Virtual Device Drivers that themselves communicate with the corresponding Windows NT 32-bit Device Drivers. Now memory is a flat model even in the VDM and I belive Virtual Memory is being used transparently, so the only time the 16 bit virtual drivers would be used for HINT is when writing to the output file. As long as the code is running 32 bit and is not accessing device drivers (which I can find no reason for during hint program execution assuming memory is not using a device driver!) the benchmark should be mostly unaffected (assuming the abstraction layer itself is very efficient!). Does that seem correct to you? It would seem to put you in a pickle for a multiprocessor type run since it would have to be multi threaded and be a Win32 application. Back to what I've been trying to do, that is concisely to run the hint benchmark as a Win32 application. So far I'm not having any luck. If my hypothesis about not using virtual device drivers during hint execution is correct it won't make a significant difference. I've been running a FreeBSD/NT comparison and relized I was not interacting directly with the Win32 subsystem. Also not that the VDm using an Instruction Execution Unit Intel emulator on non-intel NT platforms. Surely this would degrade performance! HINT as a WIN32 application should give the most QUIPS. ---------------------------------------------------------------------------- Date: Fri, 27 Dec 1996 11:27:13 -0600 To: Joe Ragosta From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT benchmarks Cc: hint Joe, You wrote: >I've been studying cross platform benchmarks (PPC vs Intel, mostly) for >some time. From all of my comparisons, only SPEC and HINT indicate that >the P6 is comparable to the PPC 604e. Most other benchmarks show the PPC >to be 30-75% faster at equal clock speed. Sorry to take so long to reply to your message. I haven't looked lately, but I had thought the PPC looked superior on the benchmark overall. Guy Kawasaki points people to our site for this reason. >Do you have any ideas to account for this? Specifically, is the PPC 604 >alignment optimization turned on in your compilations? I'm not sure; perhaps my colleagues Quinn Snell or Rajat Todi can answer this one. We generally pull out all the stops, and HINT is unusual in that it requires only a little intelligence from the compiler; massive effort at compilation trickery usually yields little improvement. >Any comments you could make would be appreciated. Beware of comparing processors to processors. HINT is designed to test the whole system, especially the memory system. Things like cache size and memory latency have much more impact than MHz rates. We've seen 486 systems outperform Pentium systems on both HINT and actual applications, even though traditional (I would say naive) benchmarks favor the Pentium, for exactly this reason. You're probably aware that the PPC is faster by a rather spectacular margin when doing the Fast Fourier Transform (FFT) operation... about ten times. Having enough registers pays off big time for some things. When I look at the Pentium Pro architecture, it amazes me that it does as well as it does. -John Gustafson PS: Please send comments and questions to hint@scl.ameslab.gov, and that way about three people are available to respond right away. ---------------------------------------------------------------------------- Notes on the rules of HINT ========================== Stewart Reddaway 15th Jan 1997 In some respects the rules are far from clear. We have done an implementation on the DAP (an SIMD machine) that will integrate any monotonically decreasing function in a variable range of quality vs. time. It performs very fast, but various changes are made compared with other codes. More changes could be made that would make it even faster, but it would be good to get some comments before proceeding further. The Gustavson and Snell paper talks of sorting the queue of intervals for subdivision according to the remaining error, but says `The subdivisions may be batched or selected less carefully'. Our current code, as is normal for an SIMD machine, subdivides a batch of intervals at a time, but puts all the Left daughter into one sheet' of memory and all the Right daughter into another `sheet'. It could (but does not currently) total the errors for each batch of daughters and `sort' the bigger error daughter batch ahead of the smaller error daughter in the queue, but it is not beneficial to do so. This is because the current code has no need to compute the remaining error at all. The only additional benefit to computing the error is to find out if the remaining error is zero, and the corresponding (batch of) interval(s) are complete. Even on a serial machine the latter is of dubious value for the specified function, as with dx = 2 (I think) there are no zero errors, and with dx = 1 no further division is possible (and all intervals have zero error). The alternatives on queuing are: 1. Each PE maintain its own `sorted' queue. Available codes do this only to the extent of putting the larger-error daughter ahead of the lower-error daughter. For SIMD machines this involves indirect addressing which is slower. The algorithmic benefit is not big, especially for smoothly varying functions. 2. A batch queue is maintained, so that the larger-sum-error daughter batch is ahead. This avoids indirect addressing, but involves additional summation and again the algorithmic benefit is not big and only applies when mcnt is not a power of 2. 3. Error testing is only done to detect zero-error batches. For the specified function this has no benefit. Even for significantly different functions a zero-error batch (other than for dx = 1) would be rare or non-existent. 4. No error testing is done. EG the left daughter is always placed ahead of the right daughter. For the specified function this happens to be the best choice, but if the reverse was done the worsening in QUIPS is not big. This means there is no need to compute the error. 5. With this approach it is possible to go further and manage without any bulk memory. The available time guides the problem partitioning. Our current code is 4. with 7 out of 9 memory-consuming array variables. A different area is precision. We have a lot of flexibility. I am surprised that all RECT variables are held at the same precision in available codes. ---------------------------------------------------------------------------- Date: Thu, 16 Jan 1997 16:18:31 -0600 To: Stewart.Reddaway From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT rules Cc: hint Dear Stewart, >Thanks for your prompt response. As a result we will produce a HINT version >that measures `removable error' and `sorts' in terms of whole blocks >(Alternative 2. in my previous doc). This we will do first and I hope you >consider it within the rules. >From what I understand, it is completely legitimate. >We might also produce an Alternative 1. version that involves indirect >addressing and would be both rather slow and more programming work. On the DAP especially, I can see the need to avoid the indirection. Again, I see nothing wrong with this. I presume what you are doing would not be helpful to typical serial computers. If it is, we'd like to know about it and perhaps incorporate it into our standard issue. >There are still several issues to discuss in terms of `what is allowed'. > >Even with sorting, there is scope to greatly reduce the storage needed at the >expense of extra computing. All the f** variables can be recomputed from >scratch each time (this trebles the f** computing), as can the alo and ahi >variables (this multiplies that work by 1.5). xr can be computed from xl >and dx, and, with the block organised Alternative 2, both xl and dx can be >stored on a once-per-block basis. (Other xl values are computed from `iproc' >and the ORIGINAL dx). That accounts for all 9 RECT variables, and could result >in hundreds of times less storage, but with extra computing. errs and ixes are >once-per-block arrays in alternative 2. What produces the best net QUIPS >remains to be seen. Recomputing the variables instead of storing them is perfectly "legal." Most computers would not benefit, but having the alternative of doing so is interesting enough that we probably should have a "compute-intensive" version that does this. As you point out, some of the variables take very little work to recompute, whereas others should probably be stored. I think you are correct that there is no need to store the error in the array. We'll chew on this a while to see if we can think of a reason to store it once the ordering is known, but I suspect we will wind up removing it from the standard version. Thanks for pointing this out. >> First of all, we have a SIMD version in our collection (MasPar) that is >> on the Web. That should translate trivially to the DAP, and I'd be >> curious to know if you have found major shortcuts to that version. > >Unfortunately it is not trivial, especially with Fortran_Plus. We also have a >form of C++, but facilities are not the same. We would also prefer to avoid >indirect addressing, or at least keep it to a minimum. I understand. We would be interested in having your final version to put in our collection when you are done, if that is possible. By the way, is your effort on HINT in response to a sales opportunity or just as a general marketing effort to maintain benchmark data? >As said above, we will do a version that computes removable error and records >block totals in errs and uses them to sort blocks. It is still possible, as >discussed above, to produce a version with greatly reduced memory >requirements. >We will produce various versions and report clearly what we have done. That's great... but like I said, it sounds like your faster version will be allowed. We are working on a vector version, and have had to think about many of the same things (like indirection, memory savings, and branch simplification). >> >Our current code is 4. with 7 out of 9 memory-consuming array variables. This is a bit cryptic; could you elaborate, please? >One technique we have already used is to use Booleans in place of the 2 f*h >variables, which can be only 0 or 1 more than the corresponding f*l values. This is an excellent idea. On most machines, Booleans are not stored or computed economically, but on the DAP it would obviously pay off. -John ---------------------------------------------------------------------------- Date: Wed, 15 Jan 1997 14:38:16 -0600 To: (Stewart.Reddaway From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT rules Cc: hint Dear Stewart, I just got your telephone message a minute ago and gather that you need a reply fairly soon. I hope e-mail is adequate response time. I'm not sure I remember meeting you; Dennis Parkinson once mentioned a fellow among the DAP crowd who could take any benchmark and find shortcuts that no one had ever thought of. I don't remember his name, but have always wondered if I'd meet him someday. I used to do the same kind of thing at Floating Point Systems. Do you suppose Dennis was referring to you? Anyway, regarding your questions: >In some respects the rules are far from clear. >We have done an implementation on the DAP (an SIMD machine) that will >integrate any monotonically decreasing function in a variable range of >quality vs. time. It performs very fast, but various changes are made >compared with other codes. More changes could be made that would make it >even faster, but it would be good to get some comments before proceeding >further. I've always wanted to know about HINT on the DAP because it's the only system I know with such flexibility regarding numeric precision. On other machines we can select 8-bit, 16-bit, 24-bit, 32-bit, 53-bit, or 64-bit whole number representation, but the older DAP hardware could do any size you wanted without waste. Maybe the new ones are restricted to multiples of 8-bits...? >The Gustavson and Snell paper talks of sorting the queue of intervals >for subdivision according to the remaining error, but says >`The subdivisions may be batched or selected less carefully'. Yes, and I don't have a copy of the paper handy, but I remember in our minds we kept in reserve the right to switch the integrated function from (1-x)/(1+x) to (2-2x)/(2-x) and see if it affects performance. If it does, use the worst case of the two. The (2-2x)/(2-x) has the same shape as (1-x)/(1+x) but produces the opposite "if" statement results during sorting. Just in case the branching has been removed and made deterministic, which one unfortunately can do for (1-x)/(1+x). It is against the rules to remove the branching conditions for exactly that reason. >Our current code, as is normal for an SIMD machine, subdivides a batch >of intervals at a time, but puts all the Left daughter into one sheet' >of memory and all the Right daughter into another `sheet'. It could (but >does not currently) total the errors for each batch of daughters and >`sort' the bigger error daughter batch ahead of the smaller error >daughter in the queue, but it is not beneficial to do so. This is >because the current code has no need to compute the remaining error at >all. The only additional benefit to computing the error is to find out >if the remaining error is zero, and the corresponding (batch of) >interval(s) are complete. Even on a serial machine the latter is of >dubious value for the specified function, as with dx = 2 (I think) there >are no zero errors, and with dx = 1 no further division is possible (and >all intervals have zero error). First of all, we have a SIMD version in our collection (MasPar) that is on the Web. That should translate trivially to the DAP, and I'd be curious to know if you have found major shortcuts to that version. The statements you make about error don't sound quite right to me, and I wonder if we mean the same thing when we say "error." For even the first split, x = 1/2, the true value of the function is 1/3 which cannot be represented as a binary fraction without error. There is definitely discretization error caused by rounding down for the lower bound and up for the upper bound, even when dx = 2 or 1. Have you read the short technical paper on HINT on the Web? It almost sounds like your approach has been to assume that (1-x)/(1+x) evaluates exactly and then the error is simply the size of the rectangle. That's not rigorous. Whatever method is used must work even if you only have 4 bits of precision, say! >The alternatives on queuing are: >1. Each PE maintain its own `sorted' queue. Available codes do this only >to the extent of putting the larger-error daughter ahead of the >lower-error daughter. For SIMD machines this involves indirect >addressing which is slower. The algorithmic benefit is not big, >especially for smoothly varying functions. No knowledge of the "smoothness" of the function is allowed. It could be discontinuous everywhere, so long as it is monotone! However, the sorting of the queue does not have to be perfect so long as whatever shortcuts used are not specific to properties of (1-x)/(1+x). >2. A batch queue is maintained, so that the larger-sum-error daughter >batch is ahead. This avoids indirect addressing, but involves additional >summation and again the algorithmic benefit is not big and only applies >when mcnt is not a power of 2. I'm not sure I understand. Certainly we expect batches of intervals to be split on vector, MIMD, or SIMD machines, possibly with less sorting than would be required on a serial or scalar machine. >3. Error testing is only done to detect zero-error batches. For the >specified function this has no benefit. Even for significantly different >functions a zero-error batch (other than for dx = 1) would be rare or >non-existent. Error testing tells how much of the remaining error is removable error as opposed to discretization error. Somewhere in our Web pages we show what happens with 8-bit precision, and I think you'd find that enlightening. >4. No error testing is done. EG the left daughter is always placed ahead >of the right daughter. For the specified function this happens to be the >best choice, but if the reverse was done the worsening in QUIPS is not >big. This means there is no need to compute the error. This is exactly what we were afraid some vendors might do; it is strictly against the rules to do anything that works "for the specified function." If you are allowed to know anything about the function beforehand, one might as well integrate it symbolically and simply compute the logarithm required! >5. With this approach it is possible to go further and manage without >any bulk memory. The available time guides the problem partitioning. We specifically require tracking of errors by the data structure. I don't think you can eliminate bulk memory and rigorously account for both discretization error and partition (removable) error, to the exact whole number. If you could, the benchmark would be completely cheatable using the method you describe. >Our current code is 4. with 7 out of 9 memory-consuming array variables. > >A different area is precision. We have a lot of flexibility. I am >surprised that all RECT variables are held at the same precision in >available codes. It'll make performance comparison more difficult, but you are certainly allowed to vary the precision of each variable to maximize the Net QUIPS. I'd be very interested in the results of that experiment. Please keep in touch. -John Gustafson ---------------------------------------------------------------------------- Date: Thu, 16 Jan 1997 16:12:48 GMT From: Stewart.Reddaway Subject: Re: HINT rules Dear John, Thanks for your prompt response. As a result we will produce a HINT version that measures `removable error' and `sorts' in terms of whole blocks (Alternative 2. in my previous doc). This we will do first and I hope you consider it within the rules. We might also produce an Alternative 1. version that involves indirect addressing and would be both rather slow and more programming work. There are still several issues to discuss in terms of `what is allowed'. Even with sorting, there is scope to greatly reduce the storage needed at the expense of extra computing. All the f** variables can be recomputed from scratch each time (this trebles the f** computing), as can the alo and ahi variables (this multiplies that work by 1.5). xr can be computed from xl and dx, and, with the block organised Alternative 2, both xl and dx can be stored on a once-per-block basis. (Other xl values are computed from `iproc' and the ORIGINAL dx). That accounts for all 9 RECT variables, and could result in hundreds of times less storage, but with extra computing. errs and ixes are once-per-block arrays in alternative 2. What produces the best net QUIPS remains to be seen. Other comments on your email are below. Stewart Reddaway ---------------------------------------------------------------------------- > From gus@ameslab.gov Wed Jan 15 20:44:54 1997 > > Dear Stewart, > > I just got your telephone message a minute ago and gather that you > need a reply fairly soon. I hope e-mail is adequate response time. > > I'm not sure I remember meeting you; Dennis Parkinson once mentioned > a fellow among the DAP crowd who could take any benchmark and find > shortcuts that no one had ever thought of. I don't remember his > name, but have always wondered if I'd meet him someday. I used to > do the same kind of thing at Floating Point Systems. Do you suppose > Dennis was referring to you? Anyway, regarding your questions: Dennis was probably referring to me. > > >In some respects the rules are far from clear. > >We have done an implementation on the DAP (an SIMD machine) that will > >integrate any monotonically decreasing function in a variable range of > >quality vs. time. It performs very fast, but various changes are made > >compared with other codes. More changes could be made that would make it > >even faster, but it would be good to get some comments before proceeding > >further. > > I've always wanted to know about HINT on the DAP because it's the > only system I know with such flexibility regarding numeric precision. > On other machines we can select 8-bit, 16-bit, 24-bit, 32-bit, 53-bit, > or 64-bit whole number representation, but the older DAP hardware > could do any size you wanted without waste. Maybe the new ones are > restricted to multiples of 8-bits...? High level language is 8-bit increments. Low level language is mix of 1 and 8. Boolean is also processed efficiently. > > >The Gustavson and Snell paper talks of sorting the queue of intervals > >for subdivision according to the remaining error, but says > >`The subdivisions may be batched or selected less carefully'. > > Yes, and I don't have a copy of the paper handy, but I remember in our > minds we kept in reserve the right to switch the integrated function > from (1-x)/(1+x) to (2-2x)/(2-x) and see if it affects performance. > If it does, use the worst case of the two. The (2-2x)/(2-x) has the > same shape as (1-x)/(1+x) but produces the opposite "if" statement > results during sorting. Just in case the branching has been removed > and made deterministic, which one unfortunately can do for (1-x)/(1+x). > It is against the rules to remove the branching conditions for exactly > that reason. With a deterministic queue I would be quite happy for different functions to be used, and the worst chosen. It will probably still be the highest QUIPS. > > >Our current code, as is normal for an SIMD machine, subdivides a batch > >of intervals at a time, but puts all the Left daughter into one sheet' > >of memory and all the Right daughter into another `sheet'. It could (but > >does not currently) total the errors for each batch of daughters and > >`sort' the bigger error daughter batch ahead of the smaller error > >daughter in the queue, but it is not beneficial to do so. This is > >because the current code has no need to compute the remaining error at > >all. The only additional benefit to computing the error is to find out > >if the remaining error is zero, and the corresponding (batch of) > >interval(s) are complete. Even on a serial machine the latter is of > >dubious value for the specified function, as with dx = 2 (I think) there > >are no zero errors, and with dx = 1 no further division is possible (and > >all intervals have zero error). > > First of all, we have a SIMD version in our collection (MasPar) that is > on the Web. That should translate trivially to the DAP, and I'd be > curious to know if you have found major shortcuts to that version. Unfortunately it is not trivial, especially with Fortran_Plus. We also have a form of C++, but facilities are not the same. We would also prefer to avoid indirect addressing, or at least keep it to a minimum. > > The statements you make about error don't sound quite right to me, and > I wonder if we mean the same thing when we say "error." For even the By `error' I meant what is in the code, ie `removable error'. > first split, x = 1/2, the true value of the function is 1/3 which > cannot be represented as a binary fraction without error. There is > definitely discretization error caused by rounding down for the lower > bound and up for the upper bound, even when dx = 2 or 1. Have you > read the short technical paper on HINT on the Web? It almost sounds like > your approach has been to assume that (1-x)/(1+x) evaluates exactly > and then the error is simply the size of the rectangle. That's > not rigorous. Whatever method is used must work even if you only > have 4 bits of precision, say! We are rigorous with discretisation error. > > >The alternatives on queuing are: > >1. Each PE maintain its own `sorted' queue. Available codes do this only > >to the extent of putting the larger-error daughter ahead of the > >lower-error daughter. For SIMD machines this involves indirect > >addressing which is slower. The algorithmic benefit is not big, > >especially for smoothly varying functions. > > No knowledge of the "smoothness" of the function is allowed. It could > be discontinuous everywhere, so long as it is monotone! However, the > sorting of the queue does not have to be perfect so long as whatever > shortcuts used are not specific to properties of (1-x)/(1+x). I have not used any specific short cuts. I only refer to the specific function when discussing the effects. > > >2. A batch queue is maintained, so that the larger-sum-error daughter > >batch is ahead. This avoids indirect addressing, but involves additional > >summation and again the algorithmic benefit is not big and only applies > >when mcnt is not a power of 2. > > I'm not sure I understand. Certainly we expect batches of intervals > to be split on vector, MIMD, or SIMD machines, possibly with > less sorting than would be required on a serial or scalar machine. We would like to avoid indirect addressing, so it is helpful to deal with a `block' of, say, 4096 intervals as a whole, even if that means summing two `removable error' child totals across 4096 intervals every iteration. The block with the largest error total can then be sorted to be ahead of the other child block. > > >4. No error testing is done. EG the left daughter is always placed ahead > >of the right daughter. For the specified function this happens to be the > >best choice, but if the reverse was done the worsening in QUIPS is not > >big. This means there is no need to compute the error. > > This is exactly what we were afraid some vendors might do; it is strictly > against the rules to do anything that works "for the specified function." > If you are allowed to know anything about the function beforehand, one > might as well integrate it symbolically and simply compute the logarithm > required! Our predetermined queue code knows nothing about the function. I was merely pointing out the effect with the specified function. As I have said, I would be happy for the worst queuing decisions to be made; in most cases I believe, even then, the reduced computing would result in the best net QUIPS. > > >5. With this approach it is possible to go further and manage without > >any bulk memory. The available time guides the problem partitioning. > > We specifically require tracking of errors by the data structure. > I don't think you can eliminate bulk memory and rigorously account > for both discretization error and partition (removable) error, to > the exact whole number. If you could, the benchmark would be > completely cheatable using the method you describe. As said above, we will do a version that computes removable error and records block totals in errs and uses them to sort blocks. It is still possible, as discussed above, to produce a version with greatly reduced memory requirements. We will produce various versions and report clearly what we have done. > > >Our current code is 4. with 7 out of 9 memory-consuming array variables. > > > >A different area is precision. We have a lot of flexibility. I am > >surprised that all RECT variables are held at the same precision in > >available codes. > > It'll make performance comparison more difficult, but you are certainly > allowed to vary the precision of each variable to maximize the Net QUIPS. One technique we have already used is to use Booleans in place of the 2 f*h variables, which can be only 0 or 1 more than the corresponding f*l values. > I'd be very interested in the results of that experiment. Please > keep in touch. > > -John Gustafson > ---------------------------------------------------------------------------- Dear John, Thanks for your last response, I include a couple of comments below. The last 2 weeks I have had pressure of other work, but I have now put in a little more on HINT. Although I plan to evolve it considerably, I attach a version of my kernel that I think satisfies the rules. I would be interested in any comments about rule non-conformance. A few notes that may help understand the code: 1. I have used NCHNK = 1. I may move to NCHNK = 2 2. I have used a vector length of 8192 for all array work, but run the code on a DAP with 4096 PEs. I may change to a vector of 4096. 3. I have avoided saving the ALO and AHI by computing the bound changes. 4. I have used INTEGER*6 in the function evaluation and for the global bounds on the integrals. I have used lower precisions elsewhere. I plan to move to UNSIGNED to get one more factor of 2. I then plan to move to INTEGER*8 and/or floating point to extend to more accurate integration; this will require reducing the memory requirement by avoiding all big arrays, primarily by recomputing the f**. 5. dx is stored once per array, and xr is computed. I plan to store xl once per array, and f** can be recomputed. I will do a proper performance measurement after more improvements. I think the current code is over 100 net MQUIPS. Stewart Reddaway ---------------------------------------------------------------------------- Date: Mon, 3 Feb 1997 11:20:51 -0600 To: Stewart.Reddaway From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT rules Cc: hint Dear Stewart, I'm delighted you've gotten so much performance out of the DAP. (That is still what it's called, isn't it?) From what I know of the DAP historically, the performance and price/performance on HINT are right in line with expectations. It's not the fastest machine in the world, but it is much faster than any single-processor architecture... and is probably the price/performance leader if one does not take into account the hourly cost of tuning by experts like yourself! >A few notes that may help understand the code: >1. I have used NCHNK = 1. I may move to NCHNK = 2 >: >5. dx is stored once per array, and xr is computed. I plan to store xl >once per > array, and f** can be recomputed. Everything sounds legitimate to me. We intentionally made the HINT rules liberal to permit this kind of optimization, which I hope you find refreshing after dealing with the arcane rules of SPEC or the Perfect Club or even LINPACK. >I will do a proper performance measurement after more improvements. I >think the >current code is over 100 net MQUIPS. Very impressive. If I were you, I'd write and submit a paper on this project. The aspect of finely adjustable precision in benchmarking has never been so thoroughly explored, and you have made a real contribution to the science of performance analysis that others should hear about... and not just through marketing brochures! Thanks for the code listing. Do we have your permission to put it in our Web site? -John ---------------------------------------------------------------------------- Date: Wed, 5 Feb 1997 14:45:24 -0600 To: Charles Grassl From: gus@ameslab.gov (John Gustafson) Subject: Re: Vectorized HINT Cc: hint Charles, I'm glad it's still possible to reach you by e-mail. With all the things going on with respect to Cray, Sun, and SGI, I didn't know if you were even still with the company! >As the HINT program is now vectorizable, can it now be interpretted >as a measurement of bandwidth for vector and scalar/cache systems? Yes, and that brings up another accomplishment of which you might not be aware: "Analytical HINT." Quinn Snell parameterized HINT performance to a very high degree of accuracy for serial and parallel computers. By fitting the curve (very nonlinear, of course, and sometimes manual intervention is needed) one can derive the true effective bandwidth of any computer. To a large extent, it is possible to infer the bandwidth even if the computer is arithmetic bound (e.g. because it isn't vectorized). However, the error bars are bigger if one does that. Now that we have a vector version, we will need to extend the Analytical HINT model to account for vector performance, and then bandwidth will be experimentally determinable for each memory regime. >Also, is the communication separable from the problem or computation? By "communication" do you mean memory references or interprocessor messages? I guess the answer is "yes" in either case. We depend on having a FEW parameters, like clock speed or maybe even memory latency, as givens, perhaps from engineering specs. Without that we probably have an underdetermined system. We also have some trouble when the cache is used for both data and instructions, which gives rise to ill-defined regime size at the cache level. If Cray is still looking for a benchmark that can clarify both the need for bandwidth and the bandwidth of its products, I'll venture that HINT is the best thing out there. -John ---------------------------------------------------------------------------- Date: Fri, 7 Feb 1997 15:07:19 -0600 From: gus@ameslab.gov (John Gustafson) Subject: HINT Cc: hint Jim, I'm glad to see that word about HINT has spread to the UK. How did you hear about it? Do you know Stewart Reddaway? He's just now finishing a study of HINT on the DAP. >I have just completed an initial read into your work in HINT and >performance analysis/prediction. This is good stuff and I'm not aware >of anything like it elsewhere, including UK. I have some questions on >the subject which I hope you may find time to consider........ Certainly. What exactly did you read? Have you found our Web site, or are you reading the paper copy we sent out around 1995 and in the HICSS'95 proceedings? >1) Have you done any work in thinking about how HINT relates to a >multi-threaded application architecture? A little. For example, we've tried comparing performance of a single run of HINT parallelized across a shared memory computer with multiple instances of HINT running independently on the same computer, with job parallelism only. That seems to give a measure of memory traffic interference, and some things about multi-threaded application performance can be inferred from the results. If you are more specific I might be able to provide a better answer. I notice that all of you have "linux" in your e-mail addresses... we use Linux on our Pentium Pro cluster here and have some HINT results for that, I believe. >2) What would be the effect of RAID/disk-striping in the outer regions >of the HINT charateristic? > >3) Are you planning to add characterisation of i/o performance? We have experimented with versions that do nothing but the "outer regions" of HINT, including explicit calls to read from and write to disk (as opposed to automatic paging as recognized by malloc calls in C). We needed those results when we were trying to decide which large computer to buy a couple of years ago. We have not carried that idea as far as we might. As you seem to realize from your questions, HINT is perfectly capable of measuring the effects of RAID and other I/O without alteration of the program or the rules for running it... but we can certainly find some way of presenting that part of the result in a more enlightening way, and we can force the driver to get better resolution in that time range. I think one problem we had was that the benchmarking takes a long time when measuring that part of the curve. There are some ways to make it run faster, however, that we aren't currently using. >4) Can you point me at the paper&pen version of HINT to make it easier >to explain/demonstrate to my colleagues. Your question has set in motion a project to put that version on the Web. (We're at http://www.scl.ameslab.gov/HINT if you haven't found it yet). >5) Is there source code available to peruse? Lots. It's all on the Web site. Help yourself. And let me know what target architecture you have in mind if you have trouble finding a close fit... we're collecting all the variations we can to save time, but it is usually VERY easy to make it run in parallel on any system. -John Gustafson ---------------------------------------------------------------------------- Date: Mon, 24 Feb 1997 17:26:48 -0600 To: Steve Koons From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT Cc: hint Steve, I'm glad you sent a note to me about HINT; I have some late-breaking news you should know. In particular, we've greatly improved the measurement of Cray-type computers by vectorizing HINT. It made it a whopping seven times faster, proving that compilers still can't do all vectorizing automatically. This gives a fairer assessment of what one is likely to get out of a vector Cray, NEC, Fujitsu, or Hitachi. >I recently stumbled across your pages about HINT and note that >development continues. I also read that this may become a commercial >product? Yes, we'd like to see some of this done as a commercial product. The problem is that the principals aren't ready to make the commitment of time and money yet, and so far no one has offered to simply pay for the license and do all the work. But it might still happen. >You will note that my address is at Lewis Research Center. Is HINT >available at no cost fore Government entities? If that's your worry, relax. You can download anything on our Web pages and use it at no charge. I presume you found the code on the Web site, right? We run it much the same way Dongarra runs LINPACK, offering the code to everyone right now and hoping that some people will send us results so we can fatten up our database. >Now my real question. Given that HINT will provide an assessment of the >performance for various computers, what performance minimums are >acceptable in a general environment? This may take a longer answer than I can supply by e-mail. HINT exposes shortcomings in certain architectures that may or may not affect you, depending on your application. I strongly recommend looking at the graphs and spending a little time to learn how to read them, and we are putting up a guide on the Web to help people do this. Until that guide is ready, I'll happily answer more specific questions via e-mail. If all you want to do is look at one number, then you should know that single processor summaries (Net MQUIPS) range from about 5 for good personal computers to 30 for one node of a C-90. The range is narrowing, as we have all observed! Parallel supercomputers are in the 100 to 600 range. Some of the massively parallel computers suffer on HINT by being so fast that they run out of hierarchical subdivisions using IEEE 64-bit numbers; one wants more than a 53-bit mantissa when speeds exceed about 10 GFLOPS, and few systems provide the option of extra precision without software calls. >As you might suspect, at NASA as >elsewhere, we are undergoing changes in the computing environment. >These changes cost $$, however, and we are looking at what is sufficient >in the way of a general Office Automation environment. If you have any >insight or can recommend someone who might, that would be very helpful. I think HINT can be very helpful in this respect; it is very broad spectrum, does not require floating-point arithmetic, and should be able to compare office automation computers across very different operating systems and vendors, etc. We have also done some benchmarking of heterogeneous environments using HINT, and have specific tools for measuring network performance that are similarly unforgiving of design flaws. >Thanks, >Steve Koons Have we met? Your name is very familiar. Supercomputing '9X conferences, perhaps? -John Gustafson ---------------------------------------------------------------------------- Date: Mon, 3 Mar 1997 17:00:54 -0600 To: John R Graham From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT Question Cc: hint John, I appreciate your use of the HINT benchmark in your seminar. Whenever I present it or any other benchmark I've designed, I try to stress the caveat that no one benchmark can serve as a perfect predictor of performance at an application... other than the target application itself, of course. Benchmarks are for people who can't obtain performance on the task mix they have in mind, because (for instance) they haven't obtained the computer yet. So of COURSE benchmarks suffer from being unrepresentative. The trick is to make a benchmark as enlightening as possible with as little effort (like program conversion, execution time) as possible. We think HINT is pretty successful in that respect. >1. HINT seems to be a 'machine' benchmark and not an a'application' benchmark. >People have asked why should I use the HINT results. I am interested in how >fast "my application" runs on arious machines and HINT may or may not give >a correct indication. Since HINT is an arithmetic problem, for example how >would that scale to a string or pattern matching program, such as a compiler? The name "HINT," besides its acronym origin, is intended to remind people that all any benchmark can provide is a hint and not predictive certainty. Think of a HINT graph as a "response curve" of a computer to different demands. Applications might emphasize the left part of the graph that shows in-cache performance on small loops, or the right part that sweeps through the entire RAM or could pick out pieces of the address space with low repeatability. The other half, characterization of the application, we call the "Application Signature." That looks like a distribution function that tells you which part of the HINT curve to weight and by how much. HINT involves a mix of operations that is quite representative of a broad spectrum of computing. It is NOT especially arithmetic-intensive, and doesn't even require floating-point math if you want to use integers. There's enough integer comparison and branching to be fairly representative of compiler activity, I think. If you find a pair of computers for which the relative compiling speed seems unrelated to any part of the HINT curve, I'd like to hear about it. Finally, HINT is a small but real application. It actually computes the answer to a problem, unlike LINPACK or Whetstones. The structure very closely resembles that of simulations that use mesh refinement or hierarchical methods or Monte Carlo methods. What it does not do a very good job of representing is huge problems that tax the instruction cache, since the loop of HINT easily fits in the instruction cache on modern computers of all sizes. >2. I ran the code on a SPARCstation 4 with the -O2 option for the compiler >and without the flag. As expected the NET Mquips was higher for the optimized >version, In fact it was much higher (~70%). I explained the correct way is not >to use the Net MQuips but to examine the graph and look at the peak values >for interpretation. Is that correct? Ideally, one does not boil the graph down to a single number unless one has to, like for list comparison or a procurement criteria. There is so much more one can learn from the graph than any summary value! I would guard against use of peak values and use the area (Net MQUIPS) since there is very little chance that any particular application spends all of its time in that maximum performance region. We always use all the compilation tricks that we can find, figuring that any application would do the same. Perhaps I don't understand the connection you are making between compiler optimization and the peak values of the graph. Compiler optimization probably lifts the entire curve up, not just one part of the curve. Perhaps you are thinking that optimization lifts low parts of the curve up to the maximum? That's not what happens, since most of the variation is because of the different speeds of different memory regimes. Compilation with optimization improves performance on HINT, generally, as a step function. A massive effort in compilation tricks usually provides no gain beyond that of a modest effort. Some day I'd really like to see your seminar. Are there any materials on the Web or is there anything else you can share? If you are interested in collaborating with us to study performance analysis and prediction methods, we might be able to work something out. -John Gustafson ---------------------------------------------------------------------------- Date: Fri, 14 Mar 1997 14:40:57 -0600 To: hint From: gus@ameslab.gov (John Gustafson) Subject: New idea Hello HINT gang, I had an idea we might need if we are ever going to get the "Application Signatures" to work. Maybe we should try plotting the reciprocal of the QUIPS graph... seconds per quality improvement. The reason is that speeds add harmonically, not arithmetically. To apply the HINT graph to an application signature means _dividing_ by the HINT graph, not multiplying by it. Except we can't do that, because it isn't defined for small times. If you do it right, the reciprocal graph is defined everywhere... it's not just 1/QUIPS(t). I know everyone's busy with other projects right now, but when we come back to the work promised in the WAS, we'll have to do something about this. -John ---------------------------------------------------------------------------- Date: Fri, 14 Mar 1997 17:47:16 -0500 From: Narayan Venkatsubramanian Subject: HINT benchmark To: gus@ameslab.gov Dear Dr. Gustafson, I work for the High Performance Computing group at SUN Microsystems Inc. I am implementing a shared memory version of the HINT benchmark for the SUN SMP machines. I was using the various compilation options that are provided as part of the SUN CC compiler to get the best performance. I remember that there are certain restrictions on the kind of optimizations that can be done as part of the benchmark. One of the optimizations that I use is the '-fns' option for the SUN CC compiler which turns on the nonstandard floating-point mode. This deviates from the strict conformance to the IEEE 754 standard. The question I have is whether the HINT benchmark require the codes to be in strict conformance to the IEEE 754 standard. I will appreciate it if you can get back to me about this at your earliest. Thanks -Narayan ---------------------------------------------------------------------------- Date: Fri, 14 Mar 1997 21:23:51 -0600 To: Narayan Venkatsubramanian From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT benchmark Cc: hint Dear Narayan, You asked a question about HINT: >I work for the High Performance Computing group at SUN Microsystems Inc. I am >implementing a shared memory version of the HINT benchmark for the SUN SMP >machines. I was using the various compilation options that are provided as >part >of the SUN CC compiler to get the best performance. I remember that there are >certain restrictions on the kind of optimizations that can be done as part of >the benchmark. One of the optimizations that I use is the '-fns' option >for the >SUN CC compiler which turns on the nonstandard floating-point mode. This >deviates from the strict conformance to the IEEE 754 standard. The question I >have is whether the HINT benchmark require the codes to be in strict >conformance >to the IEEE 754 standard. I will appreciate it if you can get back to me about >this at your earliest. You'll be glad to know that ANYTHING GOES. There are no restrictions on compilation options... where did you hear that? You simply have to get right answers. Optimization should never change the answer, only the speed, because there are no roundoff errors in HINT. I know that sounds impossible, but only whole numbers are used, even when they are represented using floating point storage. That's why we allow integer or floating point data types, whichever gives higher performance in Net QUIPS! I hope you will send us the results of our study. Others have worked on SUN SMP machines for HINT, and perhaps one of our graduate students, Rajat Todi, can share info that will help you. Todi and I (and a few others) get everything you send to "hint@ameslab.gov" so just use that for future correspondence. By the way, the UltraSPARC is one of the most impressive uniprocessors we've ever measured with HINT. We'd like to get more data on it; we benchmarked it briefly at a computing trade show where one was on display but were not allowed to keep the results file. -John Gustafson ---------------------------------------------------------------------------- To: hint@scl.ameslab.gov Subject: HINT on SX-4 Date: Wed, 26 Mar 1997 19:16:09 +0900 From: Takeo Fujimori Hello, This is Takeo Fujimori NEC Tokyo. I'll try to execute HINT benchmark on NEC SX-4. Could you tell me the benchmark which is suitable for vector-parallel supercomputer? These benchmarks are only written in C language? Are there in use F77 or F90 ? ---------------------------------------------------------------------------- Date: Wed, 26 Mar 1997 10:04:03 -0600 To: Takeo Fujimori From: gus@ameslab.gov (John Gustafson) Subject: Re: HINT on SX-4 Cc: hint Hello Mr. Fujimori, I am delighted by your interest in HINT. There are few computers higher on our "wish list" than the NEC SX-4, but we did not know how to get access to one to test. >Hello, This is Takeo Fujimori NEC Tokyo. >I'll try to execute HINT benchmark on NEC SX-4. > >Could you tell me the benchmark which is suitable for >vector-parallel supercomputer? My research assistant, Rajat Todi, has just completed a version for vector-parallel supercomputers, and has been testing it on the Cray C90. Vectorization is very important for HINT; it must be done by code alteration instead of automatic compiler methods, it appears. We saw a 7X increase in speed on the Cray after vectorizing the code manually. I'm sure Rajat Todi will send you an e-mail in a few hours explaining how you can get access to his vector-parallel version. I'm not sure it has been put on the Web page at this time. Are you accessing the Web page, http://www.scl.ameslab.gov/HINT ? It has various versions of the algorithm to use as a starting point, and we continue to add versions. >These benchmarks are only written in C language? >Are there in use F77 or F90 ? I believe we support a Fortran 77 version. We have not seen a performance difference, however, as is the case for other kinds of applications. The performance also benefits from elementary optimization, but little further gains are to be found from intensive optimization, suggesting that HINT is not a test of compiler cleverness but of hardware. And I expect the SX-4 to do extremely well. It will probably be the fastest computer on a per-processor basis on our list, and of course we will publicize that through our Web database if you can provide us full documentation. Thanks, John Gustafson ---------------------------------------------------------------------------- Date: Thu, 3 Apr 1997 16:59:03 -0600 To: Van Wilkinson From: gus@ameslab.gov (John Gustafson) Subject: Re: SLALOM Cc: hint This was a real bolt from the blue: >Have any schools (school districts) used this as a buying criterion? No, I don't think so. And I hope they never do, because it was never intended to measure general computing capability for things like text editing and graphics. It was meant to measure capability for scientific and engineering computing, simulations, computation science, that kind of thing. We stopped work on SLALOM around 1993, having discovered something much better: HINT. Check out our Web page, http://www.scl.ameslab.gov/HINT to see what happened to our ideas about broad-spectrum benchmarking. HINT would work for K-12 school districts; it does a great job of measuring Macs versus Wintel computers, for example. I can readily defend HINT being used for this purpose, as part (and only PART) of a procurement criterion. We have quite a few personal computer systems in our Web database, and can easily add more. I'd like to know more about what motivated your question, and what your situation is. I wish school districts would take into account the cost of people to maintain the systems, especially the networking aspect. If you buy a bunch of Wintel systems then you almost have to hire a full-time expert to configure and maintain the network, and a lot of the apparent price advantage disappears... That's the kind of thing no benchmark can measure. Thanks for your interest in our performance measurement efforts. John Gustafson ---------------------------------------------------------------------------- Date: Mon, 12 May 1997 09:10:51 -0600 To: hint From: gus@ameslab.gov (John Gustafson) Subject: Power Mac results I presume everyone in the HINT mail list saw the latest results from Carsten Meyer. I just want to make sure you all noticed the bottom line... something like 11 MQUIPS net for integer, 7 MQUIPS for double. This is about twice what Macs were the last time I looked. Todi, please enter this stuff on our database... is this faster than the Pentium Pro, or have the Pentium Pro numbers been in this range for some time? If it really is faster, we should let Guy Kawasaki know. -John ----------------------------------------------------------------------------