A Scheme Interpreter for ARM Microcontrollers:
Performance of Version 080

Performance Details:

A minimal amount of performance assessment has been performed on Armpit Scheme using test code from the Gambit-C Scheme Benchmarks (available here) discussed (among others) by Pierard and Feeley (2007) in the context of Mobit (a Portable and Mobile Scheme Interpreter).

Results reported for Armpit Scheme, below, were obtained during the development of version 080 (in May 2017) but are expected to be representative of the performance of the released code. They may be updated with more recent results when time allows.

Three functions were used to estimate Armpit Scheme's performance in relation to other (reference) interpreters and compilers: tak.scm, ctak.scm and mazefun.scm. The first test computes the Takeuchi function recursively: (tak 18 12 6), the second test computes the same function in continuation-passing style: (ctak 18 12 6), and the third test generates a 11x11 cell maze in a purely functional way: (make-maze 11 11). On Armpit Scheme, all tests were started with a clear heap (system reset). The specific form of tak used is shown below (it is named gtak):


  (define gtak
    (lambda (x y z)
      (if (>= y x)
	  z
  	  (gtak (gtak (- x 1) y z)
	        (gtak (- y 1) z x)
	        (gtak (- z 1) x y)))))

Reference results were obtained on an Intel Core i3 (2.3GHz, 2012) computer. To obtain a common base for comparison, results obtained for multiple benchmark iterations and at speeds other than 60 MHz were converted to the time, t1/60, that it would take to perform 1 iteration of the benchmark if the CPU clock was 60MHz and the system performance scaled linearly:

     t1/60 = (time-for-n-iterations / n) * (CPU-clock-speed / 60MHz).

The reference Scheme implementations used and the resulting t1/60 were as follows (all values in seconds):

                               -----  -----  -------  ---------  ---------
                               gtak   maze    ctak    maze/gtak  ctak/gtak
                               t1/60  t1/60   t1/60     t1/60      t1/60
                               -----  -----  -------  ---------  ---------
    Chez Scheme (compiled)     0.013  0.023    0.077     1.8          5.9
    gsc 4.6.6   (compiled)     0.019  0.031    0.34      1.6         18.
    petite Chez Scheme 9.4.1   0.13   0.27     0.32      2.1          2.5
    chibi                      0.14   0.73   211.        5.2       1500.
    Guile 2.0.5                0.15   0.27    18.        1.8        120.
    Scheme48 1.9               0.43   0.96     3.3       2.2          7.7
    gsc 4.6.6   (interactive)  0.88   1.3      2.6       1.5          3.0
    tinyScheme 1.41            9.6    N/A     15.0       N/A          1.6
                               -----  -----  -------  ---------  ---------

The Chez Scheme compiler (which is now FOSS) is the fastest of the tested implementations. Along with Gambit Scheme (compiled), these compilers produce code that is essentially 10 times faster than bytecode interpreters. Petite Chez Scheme is the fastest bytecode interpreter for the 3 test programs, and has the lowest ctak/gtak ratio, which indicates balanced optimization that does not overly favor regular code over continuation-oriented code. Scheme48 and Gambit (interactive) are slower than chibi and Guile on gtak but much faster on ctak, which highlights the potential impacts of implementation decisions, whereby optimization for one coding style can lead to surprises in performance for other types of code. TinyScheme is the only pure interpreter tested here, and while it is slower than bytecode interpreters and compilers, it also has the lowest ratio of ctak/gtak (of all the above tests) suggesting that this implementation may slightly favor continuation-oriented code (on ctak, TinyScheme is faster than Guile and chibi).

The t1/60 results obtained with Armpit Scheme 080 (during development) were (sorted by speed on the gtak test):

                                    gtak      maze      ctak
                                 (13MB/it) (19MB/it) (30MB/it)
                                 --------- --------- ---------  maze  ctak
   BOARD     CPU   MB  MHz #iter #gc t1/60 #gc t1/60 #gc t1/60  /gtak /gtak
  ---------  --- ---- ---- ---- ---- ---- ---- ---- ---- ----   ----- -----
  SAME70      M7    2  300  100  538  .85  639  1.3 1837  2.1    1.5   2.5
  Parallela   A9  512  667  200   10  .86   13  1.2   22  1.9    1.4   2.2
  NanoPC-T3  A53  512  800  400   41  .90   58  1.3   94  2.0    1.4   2.2
  Duovero     A9  512  900  200   10  .90   13  1.4   22  2.2    1.5   2.4
  STM32F746   M7    8  216  100  145  .94  164  1.6  454  2.4    1.7   2.6
  ---------  --- ---- ---- ---- ---- ---- ---- ---- ---- ----   ----- -----
  Beagle-XM   A8  512 1000  200   10  1.3   13  1.9   22  2.8    1.5   2.2
  SAMA5D4     A5  512  600  100    5  1.3    7  1.9   12  2.9    1.5   2.2
  LPC4330_Xp M4F 0.12  204  100 11465 1.3 11198 2.4 16233 2.9    1.8   2.2
  BeagleBBk   A8  512  900  100    5  1.5    6  2.1   12  3.1    1.4   2.1
  BeagleBrd   A8  128  600  100   22  1.5   32  2.1   52  3.3    1.4   2.2
  TM4C1294   M4F 0.25  120   20 1978  1.9 3580  3.4 5579  4.1    1.8   2.2
  ---------  --- ---- ---- ---- ---- ---- ---- ---- ---- ----   ----- -----
  STM32F4_Di M4F    8  168            2.4       3.2       5.3    1.3   2.2
  LPC4357_Xp M4F   32  204   20    6  3.4    7  4.4   19  7.3    1.3   2.1
  ---------  --- ---- ---- ---- ---- ---- ---- ---- ---- ----   ----- -----

The first result of note is that the four fastest boards have a performance that is similar to Gambit Scheme (interpreted) on a clock-for-clock basis. It is also notable how ARM MCU generations affect performance, with Cortex-M7 visibly faster than Cortex-M4F, the low-power Cortex-A5 is similar to Cortex-A8, and the Cortex-A9 is faster than both. The "little" 64-bit core (of ARM's big-little pairing concept) Cortex-A53, which is more recent than the other Cortex-A considered here, is near the top of the performance for these tests. It should be interesting, in the future, to see how Armpit Scheme performs on a Cortex-A7x (i.e. "big") 64-bit core. The performance is better than in version 060 for 2 of the 3 boards that carry-over from that previous performance analysis (LPC4330_Xp and Beagle_XM). In particular, for the LPC4330_Xp, the largest performance increase is in mazefun (t1/60 decreased from 3.4s to 2.4s) which is a result of the memory allocation nursery used on Cortex-M MCUs in this release. The performance of the third board, STM32F4_Di, is lower for gtak and ctak in 080 because the heap is now stored in SDRAM (larger, but slower) rather than on-chip RAM. However the t1/60 for mazefun did go down, from 4.3s in 060 to 3.2s in 080, thanks to the nursery.

Last updated July 12, 2018

bioe-hubert-at-sourceforge.net

A Scheme Interpreter for ARM Microcontrollers: Performance of Version 080

Performance Details:

Last updated July 12, 2018

A Scheme Interpreter for ARM Microcontrollers:
Performance of Version 080