A minimal amount of performance assessment has been performed on Armpit Scheme using test code from the Gambit-C Scheme Benchmarks (available here) discussed (among others) by Pierard and Feeley (2007) in the context of Mobit (a Portable and Mobile Scheme Interpreter). Three functions were used in the present assessment to estimate Armpit Scheme's performance in relation to the Gambit-C Scheme compiler and to GNU Guile: tak.scm, ctak.scm and mazefun.scm. The first test computes the Takeuchi function recursively: (tak 18 12 6), the second test computes the same function in continuation-passing style: (ctak 18 12 6), and the third test generates a 11x11 cell maze in a purely functional way: (make-maze 11 11). The test results included with the benchmark source from the University of Montreal were obtained on a 2GHz Intel Core MacBook Pro. Tests with GNU Guile version 1.7.1 were performed on a 500MHz Sparc IIe SUN Blade 100 (development machine). Armpit Scheme tests were performed on 60MHz Tiny2106, LPC-H2214 and LPC-H2888 (arm7tdmi), 167MHz CS-E9302 (arm920t) and 50MHz LM3S1968 (cortex-m3). The LPC-H2888 and CS-E9302 had their instruction and data caches enabled. To obtain a common base for comparison, results obtained for multiple benchmark iterations and at speeds other than 60 MHz were converted to the time, t1/60, that it would take to perform 1 iteration of the benchmark if the CPU clock was 60MHz and the system performance scaled linearly: (time-for-n-iterations / n) * (CPU-clock-speed / 60MHz). Typically, times were measured only to the nearest second on Guile and Armpit Scheme.
For (tak 18 12 6) the t1/60 are 0.025s for Gambit-C, 5s for Guile and 4s, 5s, 6s, 6s and 10s for Armpit Scheme on the LM3S1968, Tiny2106, LPC-H2888, CS-EP9302 and LPC-H2214, respectively. For this test, the interpreters (Armpit Scheme and Guile) are 200 times slower than the compiled code of Gambit-C. Armpit Scheme's performance is similar to that of Guile. The cortex-m3 MCU is slightly faster than arm7tdmi and arm920t. ARM systems where the heap is stored in external RAM are two times slower than when on-chip RAM is used but the data cache (where present) nearly eliminates memory access penalties (LPC-H2888 and CS-EP9302).
An ArmSchembly implementation (see Program Examples) of the Takeuchi function was developed to assess whether a "hand-compiled" version of the code could rival Gambit-C's performance. The function code and 200-iteration test loop are listed below. The test was run on the EP-9302 and the t1/60 was 0.11s which is 50 times faster than the interpreted version above. This is also 1/4 of the speed of Gambit-C, or half the speed of Bigloo and MzScheme reported in the Gambit-C benchmarks. The slower performance of Armpit Scheme in this case is likely due to less aggressive pipelining and branch prediction on the ARM core relative to the Intel Core (the ARM system would however probably win in performance per Watt).
For (ctak 18 12 6) the t1/60 are 0.6s for Gambit-C, 36s for Guile and 8s, 9s, 9s, 10s and 18s for Armpit Scheme on the LM3S1968, Tiny2106, LPC-H2888, CS-EP9302 and LPC-H2214, respectively. In this case, the interpreted Armpit Scheme is faster than Guile and approximately 15 times slower than compiled Gambit-C. This performance ratio is better than for the non-continuation-based tak because Armpit Scheme continuations are short tagged lists that point into the data stack (itself a list) which is a simpler implementation than in most other Schemes (see callcc: in the source code). Common stack operations are slower but continuations are most efficiently obtained and used. As a result, ctak is just half as fast as tak in Armpit Scheme whereas it is 10 to 20 times slower than tak in several other implementations.
For (make-maze 11 11) the t1/60 are 0.05s for Gambit-C, 6s for Guile and 15s, 21s, 11s, 11s and 22s for Armpit Scheme on the LM3S1968, Tiny2106, LPC-H2888, CS-EP9302 and LPC-H2214, respectively. In this case, Armpit Scheme is 2 to 4 times slower than Guile and approximately 200 times slower than compiled Gambit-C (as was the case for tak). Results also indicate that large external memory and the presence of a data cache provide substantial performance gains in this test example, and the cortex-m3 core performs better than arm7tdmi for an equal amount of on-chip memory (64kB on the LM3S1968 and Tiny2106).
; define and install atak (Takeuchi function in ArmSchembly) (define atak (install (assemble 'var 3 ; sv1 <- x, sv2 <- y, sv3 <- z '(takin ; [internal entry] (cmp sv2 sv1) ; done? (pl set! sv1 sv3) ; if so, sv1 <- z, result (pl set! pc cnt) ; if so, return (save cnt) ; dts <- (cnt ...) (save sv3) ; dts <- (z cnt ...) (save sv2) ; dts <- (y z cnt ...) (save sv1) ; dts <- (x y z cnt ...) (sub sv1 sv1 4) ; sv1 <- (- x 1) (call takin) ; sv1 <- xnew = (tak sv1 sv2 sv3) (snoc! sv3 sv4 dts) ; sv3 <- x, sv4 <- (y z cnt ...) (snoc! sv4 sv5 sv4) ; sv4 <- y, sv5 <- (z cnt ...) (car sv2 sv5) ; sv2 <- z (save sv1) ; dts <- (xnew x y z cnt ...) (sub sv1 sv4 4) ; sv1 <- (- y 1) (call takin) ; sv1 <- ynew = (tak sv1 sv2 sv3) (cdr sv4 dts) ; sv4 <- (x y z cnt ...) (snoc! sv2 sv4 sv4) ; sv2 <- x, sv4 <- (y z cnt ...) (snoc! sv3 sv4 sv4) ; sv3 <- y, sv4 <- (z cnt ...) (car sv4 sv4) ; sv4 <- z (save sv1) ; dts <- (ynew xnew x y z cnt ...) (sub sv1 sv4 4) ; sv1 <- (- z 1) (call takin) ; sv1 <- znew = (tak sv1 sv2 sv3) (set! sv3 sv1) ; sv3 <- znew (restore sv2) ; sv2 <- ynew, dts <- (xnew x y z cnt ...) (restore sv1) ; sv1 <- xnew, dts <- (x y z cnt ...) (cddr dts dts) ; dts <- (z cnt ...) (cdr dts dts) ; dts <- (cnt ...) (restore cnt) ; cnt <- cnt, dts <- (...) (b takin))))) ; jump to compute (tak sv1 sv2 sv3) ; check (atak 18 12 6) ; run 200 iterations (let loop ((n 200)) (if (= n 0) #t (begin (atak 18 12 6) (loop (- n 1)))))