Saturday, December 22, 2018

When performance guarantees hurts performance - std::visit

The performance of std::visit came up in a discussion, and my first thought was that from reading generated assembly code, it's a jump table, so it should be pretty fast. But then it dawned on me, all examples I've seen have been jumps to functions. The compiler has not been able to inline the visitor. This is, of course, of no consequence if your visitor does any work worth speaking of, but if the visitor does trivial work, like incrementing the given variable, then it may be important if done in a hot path.

Looking at the (draft) standard, I found this on std::visit(visitor, variant...):

Click the image to read the (draft) std on eel.is
n, here, is the number of variants in the call to std::visit.

I believe the only way to achieve this is to implement visit more or less like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
namespace detail {
template <std::size_t ... I, typename F, typename V>
auto visit(std::index_sequence<I...>, F&& f, V&& v)
{
    using ftype = void (*)(F&&, V&&);
    static constexpr ftype funcs [] = {
        [](F&& f, V&& v){ return f(std::get<I>(std::forward<V>(v)));}...
    };
    if (v.index() == std::variant_npos) throw std::bad_variant_access{};
    return funcs[v.index()](std::forward<F>(f), std::forward<V>(v));
}
}
template <typename F, typename V>
auto visit(F&& f, V&& v)
{
    using Vb = std::remove_cv_t<std::remove_reference_t<V>>;
    constexpr auto size = std::variant_size_v<Vb>;
    return detail::visit(std::make_index_sequence<size>{},std::forward<F>(f), std::forward<V>(v));
}

A table of functions is created at compile time, lines 5-8. At line 9 a check is made if the variant is invalid by exception, and if not, at line 10, the call is made via the jump table. In this case I've simplified the implementation to only work with visitors that do not return any value.

Unfortunately, it is extremely difficult for the compiler to inline the calls that go via the table. Look at the example on the compiler explorer godbolt.org. It is also a possible pessimization to have to check for the valueless by exception state first.

I began to wonder if it would be better if I ignore the O(1) requirement and let the compiler generate trivial if-else tests, since it can probably inline them. How many alternative types can the variant hold before the jump table becomes faster?

My idea was to use templates to generate code that compares the value of .index() on each variant, with the possible indexes, and make the call when it matches. So, if we imagine a call to visit, using two variants variant<int,char> v1 and variant<bool,void*,char> v2, then the index space is a 2x3 rectangle, since v1 has two valid indexes, and v3 has 3. The desired generated code should be something like this:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
template <typename F>
auto visit(F f, V1 v1, V2 v2)
{
  if (       v1.index() == 0 && v2.index() == 0) {
    return      f(std::get<0>(v1),   std::get<0>(v2));
  } else if (v1.index() == 1 && v2.index() == 0) {
    return      f(std::get<1>(v1),   std::get<0>(v2));
  } else if (v1.index() == 0 && v2.index() == 1) {
    return      f(std::get<0>(v1),   std::get<1>(v2));
  } else if (v1.index() == 1 && v2.index() == 1) {
    return      f(std::get<1>(v1),   std::get<1>(v2));
  } else if (v1.index() == 0 && v2.index() == 2) {
    return      f(std::get<0>(v1),   std::get<2>(v2));
  } else if (v1.index() == 1 && v2.index() == 2) {
    return      f(std::get<1>(v1),   std::get<2>(v2));
  } else {
    throw std::bad_variant_access{};
  }
}

Doing this requires some helpers dealing with index sequences:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
template<std::size_t I, std::size_t ... Is>
constexpr
auto
prepend(std::index_sequence<Is...>) {
  return std::index_sequence<I, Is...>{};
}

constexpr
std::index_sequence<>
next_seq(std::index_sequence<>, std::index_sequence<>) {
  return {};
}

template<
  std::size_t I, std::size_t ... Is,
  std::size_t J, std::size_t ... Js
>
constexpr
auto
next_seq(std::index_sequence<I, Is...>, std::index_sequence<J, Js...>) {
  if constexpr (I + 1 == J) {
    return prepend<0>(next_seq(std::index_sequence<Is...>{},
                               std::index_sequence<Js...>{}));
  } else {
    return std::index_sequence<I + 1, Is...>{};
  }
}

The type function next_seq takes a current index sequence and an upper limit for each index. Calls evaluates to:

next_seq(<0,0>,<2,3>)  ->  <1,0>
next_seq(<1,0>,<2,3>) -> <0,1>
next_seq(<0,1>,<2,3>) -> <1,1>
next_seq(<1,1>,<2,3>) -> <0,2>
next_seq(<0,2>,<2,3>) -> <1,2>
next_seq(<1,2>,<2,3>) -> <0,0>

The meat of the visit implementation can now be written:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template<std::size_t ... I>
static
constexpr
std::size_t
sum(std::index_sequence<I...>) { return (I + ...); }

template<
  std::size_t ... Is,
  std::size_t ... Ms,
  typename F,
  typename ... Vs
>
constexpr
auto
visit(
  std::index_sequence<Is...> i,
  std::index_sequence<Ms...> m,
  F &&f,
  Vs &&... vs)
{
  constexpr auto n = next_seq(i, m);
  if (std::tuple(vs.index()...) == std::tuple(Is...)) {
    return f(std::get<Is>(std::forward<Vs>(vs))...);
  }
  if constexpr (sum(n) > 0) {
    return visit(n, m, std::forward<F>(f), std::forward<Vs>(vs)...);
  } else {
    throw std::bad_variant_access{};
  }
}

It's a compile time recursion over all possible indexes. Line 22 checks if the tuples current indexes matches the indexes of the recursion. If they do the call to the visitor function is made on line 23. The sum function, on lines 1-5 are a convenience to make the condition for recursion easier (line 25. If the sum of all indexes in the next sequence is 0, then it has wrapped.) If no match was found anywhere, at least one of the variants must have been invalid by exception, so an exception is thrown.

The real visit function is a wrapper that provides the index sequences:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
template<typename>
static constexpr std::size_t zero = 0;

template <typename T>
using remove_cv_ref_t = std::remove_const_t<std::remove_reference_t<T>>;

template <typename F, typename ... Vs>
auto visit(F&& f, Vs&& ... vs)
{
  return detail::visit(
    std::index_sequence<zero<Vs>...>{},
    std::index_sequence<std::variant_size_v<remove_cv_ref_t<Vs>>...>{},
    std::forward<F>(f),
    std::forward<Vs>(vs)...);
}

cppinsights.io does a reasonable job of showing how the compiler sees the code after template expansion.

Using this version of visit on the first example where the visitor does nothing, generates a function that does nothing, because now the compiler can inline the visitor. Here's a link to the godbolt.org compiler explorer.

So, how fast is this when trivial work is actually done?

The graph above shows the time in µs to count ones in a vector of 5000 elements. The X-axis is the number of types is the variant. The measurement is made with clang-7 using the flags '-O3 -std=c++17 -stdlib=libc++' on this machine:

Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)

Other compilers and standard library implementations shows similar results.

I think there are two interesting observations to be made from the graph. One is that the standard library implementation isn't strictly speaking O(1). It is substantially faster for very small numbers of types in the variant. The other is that for very simple visitors, it really pays to let the compiler be able to inline the call.

If you want to play with this, or shoot it down, the implementation and benchmark is available from github.com.