A minimal amount of performance assessment has been performed on Armpit Scheme using test code from the Gambit-C Scheme Benchmarks (available here) discussed (among others) by Pierard and Feeley (2007) in the context of Mobit (a Portable and Mobile Scheme Interpreter).
Results reported for Armpit Scheme, below, were obtained during the development of version 080 (in May 2017) but are expected to be representative of the performance of the released code. They may be updated with more recent results when time allows.
Three functions were used to estimate Armpit Scheme's performance in relation to other (reference) interpreters and compilers: tak.scm, ctak.scm and mazefun.scm. The first test computes the Takeuchi function recursively: (tak 18 12 6), the second test computes the same function in continuation-passing style: (ctak 18 12 6), and the third test generates a 11x11 cell maze in a purely functional way: (make-maze 11 11). On Armpit Scheme, all tests were started with a clear heap (system reset). The specific form of tak used is shown below (it is named gtak):
(define gtak
(lambda (x y z)
(if (>= y x)
z
(gtak (gtak (- x 1) y z)
(gtak (- y 1) z x)
(gtak (- z 1) x y)))))
Reference results were obtained on an Intel Core i3 (2.3GHz, 2012) computer. To obtain a common base for comparison, results obtained for multiple benchmark iterations and at speeds other than 60 MHz were converted to the time, t1/60, that it would take to perform 1 iteration of the benchmark if the CPU clock was 60MHz and the system performance scaled linearly:
t1/60 = (time-for-n-iterations / n) * (CPU-clock-speed / 60MHz).The reference Scheme implementations used and the resulting t1/60 were as follows (all values in seconds):
----- ----- ------- --------- --------- gtak maze ctak maze/gtak ctak/gtak t1/60 t1/60 t1/60 t1/60 t1/60 ----- ----- ------- --------- --------- Chez Scheme (compiled) 0.013 0.023 0.077 1.8 5.9 gsc 4.6.6 (compiled) 0.019 0.031 0.34 1.6 18. petite Chez Scheme 9.4.1 0.13 0.27 0.32 2.1 2.5 chibi 0.14 0.73 211. 5.2 1500. Guile 2.0.5 0.15 0.27 18. 1.8 120. Scheme48 1.9 0.43 0.96 3.3 2.2 7.7 gsc 4.6.6 (interactive) 0.88 1.3 2.6 1.5 3.0 tinyScheme 1.41 9.6 N/A 15.0 N/A 1.6 ----- ----- ------- --------- ---------
The Chez Scheme compiler (which is now FOSS) is the fastest of the tested implementations. Along with Gambit Scheme (compiled), these compilers produce code that is essentially 10 times faster than bytecode interpreters. Petite Chez Scheme is the fastest bytecode interpreter for the 3 test programs, and has the lowest ctak/gtak ratio, which indicates balanced optimization that does not overly favor regular code over continuation-oriented code. Scheme48 and Gambit (interactive) are slower than chibi and Guile on gtak but much faster on ctak, which highlights the potential impacts of implementation decisions, whereby optimization for one coding style can lead to surprises in performance for other types of code. TinyScheme is the only pure interpreter tested here, and while it is slower than bytecode interpreters and compilers, it also has the lowest ratio of ctak/gtak (of all the above tests) suggesting that this implementation may slightly favor continuation-oriented code (on ctak, TinyScheme is faster than Guile and chibi).
The t1/60 results obtained with Armpit Scheme 080 (during development) were (sorted by speed on the gtak test):
gtak maze ctak (13MB/it) (19MB/it) (30MB/it) --------- --------- --------- maze ctak BOARD CPU MB MHz #iter #gc t1/60 #gc t1/60 #gc t1/60 /gtak /gtak --------- --- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----- ----- SAME70 M7 2 300 100 538 .85 639 1.3 1837 2.1 1.5 2.5 Parallela A9 512 667 200 10 .86 13 1.2 22 1.9 1.4 2.2 NanoPC-T3 A53 512 800 400 41 .90 58 1.3 94 2.0 1.4 2.2 Duovero A9 512 900 200 10 .90 13 1.4 22 2.2 1.5 2.4 STM32F746 M7 8 216 100 145 .94 164 1.6 454 2.4 1.7 2.6 --------- --- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----- ----- Beagle-XM A8 512 1000 200 10 1.3 13 1.9 22 2.8 1.5 2.2 SAMA5D4 A5 512 600 100 5 1.3 7 1.9 12 2.9 1.5 2.2 LPC4330_Xp M4F 0.12 204 100 11465 1.3 11198 2.4 16233 2.9 1.8 2.2 BeagleBBk A8 512 900 100 5 1.5 6 2.1 12 3.1 1.4 2.1 BeagleBrd A8 128 600 100 22 1.5 32 2.1 52 3.3 1.4 2.2 TM4C1294 M4F 0.25 120 20 1978 1.9 3580 3.4 5579 4.1 1.8 2.2 --------- --- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----- ----- STM32F4_Di M4F 8 168 2.4 3.2 5.3 1.3 2.2 LPC4357_Xp M4F 32 204 20 6 3.4 7 4.4 19 7.3 1.3 2.1 --------- --- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----- -----
The first result of note is that the four fastest boards have a performance that is similar to Gambit Scheme (interpreted) on a clock-for-clock basis. It is also notable how ARM MCU generations affect performance, with Cortex-M7 visibly faster than Cortex-M4F, the low-power Cortex-A5 is similar to Cortex-A8, and the Cortex-A9 is faster than both. The "little" 64-bit core (of ARM's big-little pairing concept) Cortex-A53, which is more recent than the other Cortex-A considered here, is near the top of the performance for these tests. It should be interesting, in the future, to see how Armpit Scheme performs on a Cortex-A7x (i.e. "big") 64-bit core. The performance is better than in version 060 for 2 of the 3 boards that carry-over from that previous performance analysis (LPC4330_Xp and Beagle_XM). In particular, for the LPC4330_Xp, the largest performance increase is in mazefun (t1/60 decreased from 3.4s to 2.4s) which is a result of the memory allocation nursery used on Cortex-M MCUs in this release. The performance of the third board, STM32F4_Di, is lower for gtak and ctak in 080 because the heap is now stored in SDRAM (larger, but slower) rather than on-chip RAM. However the t1/60 for mazefun did go down, from 4.3s in 060 to 3.2s in 080, thanks to the nursery.