知ってる人は知っている?話で、全く新しい話ではないんですが、以下のtweetの簡単な実例含んだ解説
Scalaでの末尾再帰最適化は
— Kenji Yoshida (@xuwei_k) September 18, 2023
- boxingされない
- loop相当にcompileされてメソッド呼び出しが挟まらない
などによって、他のもっと抽象度が高いメソッド呼ぶより原理上速くなる可能性がある
(原理上、というのはJVMの実行時最適化が賢ければ実質変わらない可能性もある) https://t.co/2u20MtMsHY
例えば、Listのreverse
をするとして、以下のようにすごく雑にわざとREPLで計測してみると、foldLeft
で書くよりも、実際にほとんど処理が同じだとしても、末尾再帰の方が明らかに数十倍も速そうな結果が出ます
https://github.com/scala/scala/blob/v2.13.12/src/library/scala/collection/LinearSeq.scala#L179-L187
Welcome to Scala 3.3.1 (11.0.20, Java OpenJDK 64-Bit Server VM). Type in expressions for evaluation. Or try :help. scala> def time[A](a: => A): Unit = { val start = System.currentTimeMillis; a ; println(System.currentTimeMillis - start) } def time[A](a: => A): Unit scala> lazy val list = (1 to 1000_000).toList lazy val list: List[Int] scala> time { list.foldLeft(List.empty[Int])((a, b) => b :: a) } 391 scala> time { list.foldLeft(List.empty[Int])((a, b) => b :: a) } 377 scala> time { list.foldLeft(List.empty[Int])((a, b) => b :: a) } 439 scala> time { list.foldLeft(List.empty[Int])((a, b) => b :: a) } 272 scala> time { list.foldLeft(List.empty[Int])((a, b) => b :: a) } 123 scala> @annotation.tailrec | def loop[A](src: List[A], acc: List[A]): List[A] = | src match { | case x :: xs => | loop(src = xs, acc = x :: acc) | case _ => | acc | } | def loop[A](src: List[A], acc: List[A]): List[A] scala> time { loop(list, Nil) } 25 scala> time { loop(list, Nil) } 6 scala> time { loop(list, Nil) } 22 scala> time { loop(list, Nil) } 4 scala> time { loop(list, Nil) } 4
しかし「JVMの実行時最適化が賢ければ実質変わらない可能性」の部分ですが、JMHというツールを使ってもう少し真面目に計測したら、以下のように、 少なくともこの実験コードだとまだ多少の差は出ますが、REPLのように数十倍の差は出ません、せいぜい1.7倍程度です。
( ops/s
は、単位時間あたり何回実行できたか?なので、数が多いほどパフォーマンスが良い、ということ)
https://github.com/sbt/sbt-jmh
「もう少し真面目」というのは、これでもまだ雑な部分が多くあり、JVMでJMH使って本当に正しく計測するには色々と奥が深いので、気になる人はもっと色々変えて試してみましょう。
1.7倍も、foldLeftと末尾再帰による違いというより、余計なboxingとunboxingの可能性がある気がします
loop部分のgenericsを消してIntに固定をしたら、むしろtailrecも同等に遅くなったから、これboxing含めて真面目に計測するにはprimitive型使わないパターンと、双方において必ずboxingとunboxingされるパターンとそれぞれ真面目にやった方が良さそうだな、と思ったけど面倒になった
— Kenji Yoshida (@xuwei_k) September 19, 2023
plugin.sbt
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.4.6")
build.sbt
scalaVersion := "3.3.1"
enablePlugins(JmhPlugin)
Main.scala
package example import org.openjdk.jmh.annotations.Benchmark import scala.annotation.tailrec object Main { val list: List[Int] = (1 to 1000_000).toList } class Main { @Benchmark def foldLeft(): List[Int] = { Main.list.foldLeft(List.empty[Int])((xs, x) => x :: xs) } @Benchmark def tailrecLoop(): List[Int] = { @tailrec def loop[A](src: List[A], acc: List[A]): List[A] = src match { case x :: xs => loop(src = xs, acc = x :: acc) case _ => acc } loop(src = Main.list, acc = Nil) } }
Jmh/run -i 5 -wi 5 -f1 -t1
での実行結果
[info] running (fork) org.openjdk.jmh.Main -i 5 -wi 5 -f1 -t1 [info] # JMH version: 1.37 [info] # VM version: JDK 11.0.20, OpenJDK 64-Bit Server VM, 11.0.20+8-LTS [info] # VM invoker: /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java [info] # VM options: <none> [info] # Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable) [info] # Warmup: 5 iterations, 10 s each [info] # Measurement: 5 iterations, 10 s each [info] # Timeout: 10 min per iteration [info] # Threads: 1 thread, will synchronize iterations [info] # Benchmark mode: Throughput, ops/time [info] # Benchmark: example.Main.foldLeft [info] # Run progress: 0.00% complete, ETA 00:03:20 [info] # Fork: 1 of 1 [info] # Warmup Iteration 1: 103.134 ops/s [info] # Warmup Iteration 2: 211.510 ops/s [info] # Warmup Iteration 3: 182.879 ops/s [info] # Warmup Iteration 4: 206.715 ops/s [info] # Warmup Iteration 5: 203.103 ops/s [info] Iteration 1: 187.781 ops/s [info] Iteration 2: 190.669 ops/s [info] Iteration 3: 183.167 ops/s [info] Iteration 4: 186.122 ops/s [info] Iteration 5: 189.595 ops/s [info] Result "example.Main.foldLeft": [info] 187.467 ±(99.9%) 11.420 ops/s [Average] [info] (min, avg, max) = (183.167, 187.467, 190.669), stdev = 2.966 [info] CI (99.9%): [176.047, 198.887] (assumes normal distribution) [info] # JMH version: 1.37 [info] # VM version: JDK 11.0.20, OpenJDK 64-Bit Server VM, 11.0.20+8-LTS [info] # VM invoker: /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java [info] # VM options: <none> [info] # Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable) [info] # Warmup: 5 iterations, 10 s each [info] # Measurement: 5 iterations, 10 s each [info] # Timeout: 10 min per iteration [info] # Threads: 1 thread, will synchronize iterations [info] # Benchmark mode: Throughput, ops/time [info] # Benchmark: example.Main.tailrecLoop [info] # Run progress: 50.00% complete, ETA 00:01:41 [info] # Fork: 1 of 1 [info] # Warmup Iteration 1: 300.062 ops/s [info] # Warmup Iteration 2: 326.647 ops/s [info] # Warmup Iteration 3: 324.159 ops/s [info] # Warmup Iteration 4: 324.243 ops/s [info] # Warmup Iteration 5: 323.696 ops/s [info] Iteration 1: 324.353 ops/s [info] Iteration 2: 324.258 ops/s [info] Iteration 3: 322.606 ops/s [info] Iteration 4: 324.562 ops/s [info] Iteration 5: 323.758 ops/s [info] Result "example.Main.tailrecLoop": [info] 323.907 ±(99.9%) 3.023 ops/s [Average] [info] (min, avg, max) = (322.606, 323.907, 324.562), stdev = 0.785 [info] CI (99.9%): [320.884, 326.931] (assumes normal distribution) [info] # Run complete. Total time: 00:03:21 [info] REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on [info] why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial [info] experiments, perform baseline and negative tests that provide experimental control, make sure [info] the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. [info] Do not assume the numbers tell you what you want them to tell. [info] Benchmark Mode Cnt Score Error Units [info] Main.foldLeft thrpt 5 187.467 ± 11.420 ops/s [info] Main.tailrecLoop thrpt 5 323.907 ± 3.023 ops/s