A Scheme Interpreter for ARM Microcontrollers:
Performance of Version 00.0160

Executive Summary:

Armpit Scheme, a minimalist interpreter, performs at a decent clip, similar to Guile, a full-fledged interpreter.

Performance Details:

A minimal amount of performance assessment has been performed on Armpit Scheme using test code from the Gambit-C Scheme Benchmarks (available here) discussed (among others) by Pierard and Feeley (2007) in the context of Mobit (a Portable and Mobile Scheme Interpreter).

Three functions were used in the present assessment to estimate Armpit Scheme's performance in relation to the Gambit-C Scheme compiler and to GNU Guile: tak.scm, ctak.scm and mazefun.scm. The first test computes the Takeuchi function recursively: (tak 18 12 6), the second test computes the same function in continuation-passing style: (ctak 18 12 6), and the third test generates a 11x11 cell maze in a purely functional way: (make-maze 11 11). On Armpit Scheme, all tests were started with a clear heap (system reset).

The test results included with the benchmark source from the University of Montreal were obtained on a 2GHz Intel Core MacBook Pro. Tests with GNU Guile version 1.7.1 were performed on a 500MHz Sparc IIe SUN Blade 100 (development machine). Armpit Scheme tests were performed on 60MHz Tiny2106, LPC-H2214 and LPC-H2888 (arm7tdmi), 166MHz CS-E9302 (arm920t), 202MHz TCT-Hammer (arm920t) and 50MHz LM3S1968 (cortex-m3). The LPC-H2888, CS-E9302 and S3C2410 (TCT-Hammer) had their instruction and data caches enabled (performed in the Armpit Scheme initialization code).

To obtain a common base for comparison, results obtained for multiple benchmark iterations and at speeds other than 60 MHz were converted to the time, t1/60, that it would take to perform 1 iteration of the benchmark if the CPU clock was 60MHz and the system performance scaled linearly:

     t1/60 = (time-for-n-iterations / n) * (CPU-clock-speed / 60MHz).

Times were typically measured only to the nearest second on Guile and Armpit Scheme. The results (t1/60) were as follows:

    benchmark  Gambit-C  Guile LM3S1968 Tiny2106 LPC-H2888 CS-EP9302 TCT-Hammer LPC-H2214
    ---------  --------  ----- -------- -------- --------- --------- ---------- ---------
       tak       0.025s    5s     3s        4s        5s        5s        4s        8s
      atak        --       --     --     0.06s     0.19s     0.11s     0.08s     0.27s
      ctak       0.60s    36s     6s        7s        8s        8s        7s       12s
    mazefun      0.05s     6s    10s       12s        7s        6s        6s       10s

For (tak 18 12 6) Armpit Scheme's performance is similar to that of Guile and these interpreters (Armpit Scheme and Guile) are approximately 200 times slower than the compiled code of Gambit-C (the range is from 120 to 320 times slower). The cortex-m3 MCU is faster than the arm7tdmi and arm920t by approximately 25%. The ARM system where the heap is stored in external RAM without cache (LPC-H2214) is half as fast as when on-chip RAM is used but systems with external RAM and a data cache (LPC-H2888, CS-EP9302, TCT-Hammer) perform nearly as well as those using internal RAM. The t1/60 are also found to be 1 to 2 seconds shorter than in snapshot 00.0152 representing a speed improvement of the order of 25%.

An ARMSchembly implementation (see Program Examples) of the Takeuchi function, atak, was developed to assess whether a "hand-compiled" version of the code could rival Gambit-C's performance. The function atak was ARMSchembled into an installable code vector (vtak) on the TCT-Hammer and the code vector was then installed on the ARMv4T MCUs to run the benchmark (after a reset on each). The atak ARMSchembly code, install code and 200-iteration test loop are listed below. Only the EP-9302 was tested in snapshot 00.0152 and its performance in 00.0160 remains the same. The performance of other MCUs varies from nearly twice as fast as the CS-EP9302 (Tiny2106) to half as fast (LPC-H2214). The improvement over interpreted code ranges from 26 to 67 times faster. These computations of atak are performed 2.4 to 7.7 times slower than tak on Gambit-C (or approximately 1/2 to 1/8 the speed of Gambit-C). The speed of the fastest computations (Tiny2106) is comparable to that of Bigloo and MzScheme reported in the Gambit-C benchmarks. The slower performance of Armpit Scheme relative to Gambit-C in this case is likely due in part to less aggressive pipelining and branch prediction along with in-order execution on the ARM core relative to the Intel Core. Additionally, the use of a list structure for the Scheme stack (rather than a faster system-type stack) makes Armpit Scheme generally slower than the reference implementations.

For (ctak 18 12 6) the interpreted Armpit Scheme is 3 to 6 times faster than Guile and 10 to 20 times slower than compiled Gambit-C. This performance ratio is better than for the non-continuation-based tak because Armpit Scheme continuations are short tagged lists that point into the Scheme stack (itself a list) which is a simpler implementation than in most other Schemes (see callcc: in the source code). Common stack operations are slower as noted above but continuations are most efficiently obtained and used. As a result, ctak is just half as fast as tak in Armpit Scheme whereas it is 10 to 20 times slower than tak in several other implementations. The t1/60 for this test are 1 to 6 seconds shorter than in snapshot 00.0152 representing a speed improvement of 10% to 33%.

For (make-maze 11 11) the performance of Armpit Scheme ranges from equal to Guile to half as fast as Guile, which is also 120 to 240 times slower than compiled Gambit-C (similar to the case for tak). Results indicate that a large external memory combined with a data cache provide substantial performance gains in this test example, and the cortex-m3 core performs roughly 20% better than arm7tdmi for an equal amount of on-chip memory (64kB on the LM3S1968 and Tiny2106). The performance improvements over snapshot 00.0152 range from 4 to 12 seconds for a speedup of 50% to 120%.



; define vtak (install-vector for Takeuchi function in ArmSchembly)
(define vtak
   (assemble
    'var 3                ; sv1 <- x, sv2 <- y, sv3 <- z
    '(takin               ; [internal entry]
      (cmp sv2 sv1)       ; done?
      (pl set! sv1 sv3)   ;    if so,  sv1 <- z, result
      (pl set! pc  cnt)   ;    if so,  return
      (save cnt)          ; dts <- (cnt ...)
      (save sv3)          ; dts <- (z cnt ...)
      (save sv2)          ; dts <- (y z cnt ...)
      (save sv1)          ; dts <- (x y z cnt ...)
      (sub sv1 sv1 4)     ; sv1 <- (- x 1)
      (call takin)        ; sv1 <- xnew = (tak sv1 sv2 sv3)
      (snoc! sv3 sv4 dts) ; sv3 <- x,    sv4 <- (y z cnt ...)
      (snoc! sv4 sv5 sv4) ; sv4 <- y,    sv5 <- (z cnt ...)
      (car sv2 sv5)       ; sv2 <- z
      (save sv1)          ; dts <- (xnew x y z cnt ...)
      (sub sv1 sv4 4)     ; sv1 <- (- y 1)
      (call takin)        ; sv1 <- ynew = (tak sv1 sv2 sv3)
      (cdr sv4 dts)       ; sv4 <- (x y z cnt ...)
      (snoc! sv2 sv4 sv4) ; sv2 <- x,    sv4 <- (y z cnt ...)
      (snoc! sv3 sv4 sv4) ; sv3 <- y,    sv4 <- (z cnt ...)
      (car sv4 sv4)       ; sv4 <- z
      (save sv1)          ; dts <- (ynew xnew x y z cnt ...)
      (sub sv1 sv4 4)     ; sv1 <- (- z 1)
      (call takin)        ; sv1 <- znew = (tak sv1 sv2 sv3)
      (set! sv3 sv1)      ; sv3 <- znew
      (restore sv2)       ; sv2 <- ynew,  dts <- (xnew x y z cnt ...)
      (restore sv1)       ; sv1 <- xnew,  dts <- (x y z cnt ...)
      (cddr dts dts)      ; dts <- (z cnt ...)
      (cdr dts dts)       ; dts <- (cnt ...)
      (restore cnt)       ; cnt <- cnt,   dts <- (...)
      (b takin))))        ; jump to compute (tak sv1 sv2 sv3)

; display vtak
vtak ; -> #(39     0  1003     0     4 57685 16390 20896 61441 20896    34 60160  4096 58762
     ;      32 60160 24576 58762    30 60160 20480 58762    28 60160 16384 58762 16388 57924
     ;    4111 57760 65521 60159   192 59546   384 59543 20480 58776    20 60160 16384 58762
     ;   16388 57927  4111 57760 65513 60159 28676 58778   160 59543   192 59543 28672 58775
     ;      11 60160 16384 58762 16388 57927  4111 57760 65504 60159 24580 57760  1056 59546
     ;    1040 59546 40964 58778 40964 58778 40964 58778  1026 59546 65496 60159 61444 58655
     ;    2672     0)

; install the contents of vtak as atak, on a clear heap (post-reset)
(define atak
  (install
    '#(39     0  1003     0     4 57685 16390 20896 61441 20896    34 60160  4096 58762
       32 60160 24576 58762    30 60160 20480 58762    28 60160 16384 58762 16388 57924
     4111 57760 65521 60159   192 59546   384 59543 20480 58776    20 60160 16384 58762
    16388 57927  4111 57760 65513 60159 28676 58778   160 59543   192 59543 28672 58775
       11 60160 16384 58762 16388 57927  4111 57760 65504 60159 24580 57760  1056 59546
     1040 59546 40964 58778 40964 58778 40964 58778  1026 59546 65496 60159 61444 58655
     2672     0 )))

; check
(atak 18 12 6) ; -> 7

; run 200 iterations
(let loop ((n 200))
  (if (= n 0) #t
      (begin
	(atak 18 12 6)
	(loop (- n 1)))))

Last updated February 6, 2009

bioe-hubert-at-sourceforge.net

A Scheme Interpreter for ARM Microcontrollers:Performance of Version 00.0160

Executive Summary:

Performance Details:

Last updated February 6, 2009

A Scheme Interpreter for ARM Microcontrollers:
Performance of Version 00.0160