tclThreadAlloc vs. CPU-cache (fragmentation/granularity/etc)...

tclThreadAlloc vs. CPU-cache (fragmentation/granularity/etc)...

Issue: the longer tcl works - the slower it becomes...

The simplest example illustrating this:

  proc test {} {timerate -calibrate {incr i} 10000 10000000}; test; # calibrate overhead for incr
  proc test args {puts "set: [timerate { set a([incr i]) _ } {*}$args]"; lset args 1 $i; set i 0; puts "get: [timerate { set a([incr i]) } {*}$args]" }
  time { test 10000 1000000 } 50

* 1x-threaded:
    set: 0.153809 µs/# 1000000 # 6501570 #/sec 153.809 net-ms
    get: 0.049532 µs/# 1000000 # 20188968 #/sec 49.532 net-ms
    ... 100 times ...
    set: 0.598031 µs/# 1000000 # 1672154 #/sec 598.031 net-ms
    get: 0.331693 µs/# 1000000 # 3014836 #/sec 331.693 net-ms
* 16x-threaded:
    set * 16x: 0.573478 µs/# 4000000 # 291248 #/sec 2293.913 net-ms
    get * 16x: 0.077514 µs/# 4000000 # 14923264 #/sec 310.059 net-ms
    ... 20 times ...
    set * 16x: 3.539208 µs/# 4000000 # 304646 #/sec 14156.834 net-ms
    get * 16x: 0.519618 µs/# 4000000 # 2048606 #/sec 2078.474 net-ms

I fixed this "wrong" behaviour in my own threaded-alloc module, which prefers moving of whole free pages the single objects:

* 1x-threaded:
    set : 0.140644 µs/# 250000 # 7110150 #/sec 35.161 net-ms
    get : 0.018572 µs/# 250000 # 53844497 #/sec 4.643 net-ms
    ... 100 times ...
    set : 0.083600 µs/# 250000 # 11961722 #/sec 20.900 net-ms
    get : 0.017704 µs/# 250000 # 56484410 #/sec 4.426 net-ms
* 16x-threaded:
    set * 16x: 0.810806 µs/# 4000000 # 129652 #/sec 3243.227 net-ms
    get * 16x: 0.556715 µs/# 4000000 # 246766 #/sec 2226.860 net-ms
    ... 20 times ...
    set * 16x: 0.665778 µs/# 4000000 # 1671624 #/sec 2663.113 net-ms
    get * 16x: 0.343585 µs/# 4000000 # 3428775 #/sec 1374.340 net-ms