Saturday, December 22, 2018

When performance guarantees hurts performance - std::visit

The performance of std::visit came up in a discussion, and my first thought was that from reading generated assembly code, it's a jump table, so it should be pretty fast. But then it dawned on me, all examples I've seen have been jumps to functions. The compiler has not been able to inline the visitor. This is, of course, of no consequence if your visitor does any work worth speaking of, but if the visitor does trivial work, like incrementing the given variable, then it may be important if done in a hot path.

Looking at the (draft) standard, I found this on std::visit(visitor, variant...):

Click the image to read the (draft) std on eel.is
n, here, is the number of variants in the call to std::visit.

I believe the only way to achieve this is to implement visit more or less like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
namespace detail {
template <std::size_t ... I, typename F, typename V>
auto visit(std::index_sequence<I...>, F&& f, V&& v)
{
    using ftype = void (*)(F&&, V&&);
    static constexpr ftype funcs [] = {
        [](F&& f, V&& v){ return f(std::get<I>(std::forward<V>(v)));}...
    };
    if (v.index() == std::variant_npos) throw std::bad_variant_access{};
    return funcs[v.index()](std::forward<F>(f), std::forward<V>(v));
}
}
template <typename F, typename V>
auto visit(F&& f, V&& v)
{
    using Vb = std::remove_cv_t<std::remove_reference_t<V>>;
    constexpr auto size = std::variant_size_v<Vb>;
    return detail::visit(std::make_index_sequence<size>{},std::forward<F>(f), std::forward<V>(v));
}

A table of functions is created at compile time, lines 5-8. At line 9 a check is made if the variant is invalid by exception, and if not, at line 10, the call is made via the jump table. In this case I've simplified the implementation to only work with visitors that do not return any value.

Unfortunately, it is extremely difficult for the compiler to inline the calls that go via the table. Look at the example on the compiler explorer godbolt.org. It is also a possible pessimization to have to check for the valueless by exception state first.

I began to wonder if it would be better if I ignore the O(1) requirement and let the compiler generate trivial if-else tests, since it can probably inline them. How many alternative types can the variant hold before the jump table becomes faster?

My idea was to use templates to generate code that compares the value of .index() on each variant, with the possible indexes, and make the call when it matches. So, if we imagine a call to visit, using two variants variant<int,char> v1 and variant<bool,void*,char> v2, then the index space is a 2x3 rectangle, since v1 has two valid indexes, and v3 has 3. The desired generated code should be something like this:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
template <typename F>
auto visit(F f, V1 v1, V2 v2)
{
  if (       v1.index() == 0 && v2.index() == 0) {
    return      f(std::get<0>(v1),   std::get<0>(v2));
  } else if (v1.index() == 1 && v2.index() == 0) {
    return      f(std::get<1>(v1),   std::get<0>(v2));
  } else if (v1.index() == 0 && v2.index() == 1) {
    return      f(std::get<0>(v1),   std::get<1>(v2));
  } else if (v1.index() == 1 && v2.index() == 1) {
    return      f(std::get<1>(v1),   std::get<1>(v2));
  } else if (v1.index() == 0 && v2.index() == 2) {
    return      f(std::get<0>(v1),   std::get<2>(v2));
  } else if (v1.index() == 1 && v2.index() == 2) {
    return      f(std::get<1>(v1),   std::get<2>(v2));
  } else {
    throw std::bad_variant_access{};
  }
}

Doing this requires some helpers dealing with index sequences:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
template<std::size_t I, std::size_t ... Is>
constexpr
auto
prepend(std::index_sequence<Is...>) {
  return std::index_sequence<I, Is...>{};
}

constexpr
std::index_sequence<>
next_seq(std::index_sequence<>, std::index_sequence<>) {
  return {};
}

template<
  std::size_t I, std::size_t ... Is,
  std::size_t J, std::size_t ... Js
>
constexpr
auto
next_seq(std::index_sequence<I, Is...>, std::index_sequence<J, Js...>) {
  if constexpr (I + 1 == J) {
    return prepend<0>(next_seq(std::index_sequence<Is...>{},
                               std::index_sequence<Js...>{}));
  } else {
    return std::index_sequence<I + 1, Is...>{};
  }
}

The type function next_seq takes a current index sequence and an upper limit for each index. Calls evaluates to:

next_seq(<0,0>,<2,3>)  ->  <1,0>
next_seq(<1,0>,<2,3>) -> <0,1>
next_seq(<0,1>,<2,3>) -> <1,1>
next_seq(<1,1>,<2,3>) -> <0,2>
next_seq(<0,2>,<2,3>) -> <1,2>
next_seq(<1,2>,<2,3>) -> <0,0>

The meat of the visit implementation can now be written:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template<std::size_t ... I>
static
constexpr
std::size_t
sum(std::index_sequence<I...>) { return (I + ...); }

template<
  std::size_t ... Is,
  std::size_t ... Ms,
  typename F,
  typename ... Vs
>
constexpr
auto
visit(
  std::index_sequence<Is...> i,
  std::index_sequence<Ms...> m,
  F &&f,
  Vs &&... vs)
{
  constexpr auto n = next_seq(i, m);
  if (std::tuple(vs.index()...) == std::tuple(Is...)) {
    return f(std::get<Is>(std::forward<Vs>(vs))...);
  }
  if constexpr (sum(n) > 0) {
    return visit(n, m, std::forward<F>(f), std::forward<Vs>(vs)...);
  } else {
    throw std::bad_variant_access{};
  }
}

It's a compile time recursion over all possible indexes. Line 22 checks if the tuples current indexes matches the indexes of the recursion. If they do the call to the visitor function is made on line 23. The sum function, on lines 1-5 are a convenience to make the condition for recursion easier (line 25. If the sum of all indexes in the next sequence is 0, then it has wrapped.) If no match was found anywhere, at least one of the variants must have been invalid by exception, so an exception is thrown.

The real visit function is a wrapper that provides the index sequences:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
template<typename>
static constexpr std::size_t zero = 0;

template <typename T>
using remove_cv_ref_t = std::remove_const_t<std::remove_reference_t<T>>;

template <typename F, typename ... Vs>
auto visit(F&& f, Vs&& ... vs)
{
  return detail::visit(
    std::index_sequence<zero<Vs>...>{},
    std::index_sequence<std::variant_size_v<remove_cv_ref_t<Vs>>...>{},
    std::forward<F>(f),
    std::forward<Vs>(vs)...);
}

cppinsights.io does a reasonable job of showing how the compiler sees the code after template expansion.

Using this version of visit on the first example where the visitor does nothing, generates a function that does nothing, because now the compiler can inline the visitor. Here's a link to the godbolt.org compiler explorer.

So, how fast is this when trivial work is actually done?

The graph above shows the time in µs to count ones in a vector of 5000 elements. The X-axis is the number of types is the variant. The measurement is made with clang-7 using the flags '-O3 -std=c++17 -stdlib=libc++' on this machine:

Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)

Other compilers and standard library implementations shows similar results.

I think there are two interesting observations to be made from the graph. One is that the standard library implementation isn't strictly speaking O(1). It is substantially faster for very small numbers of types in the variant. The other is that for very simple visitors, it really pays to let the compiler be able to inline the call.

If you want to play with this, or shoot it down, the implementation and benchmark is available from github.com.

Thursday, November 29, 2018

How to speak at a conference

A former colleague of mine recently described the steps to speak at a conference as:

* Write a proposal and (optionally) a talk outline.
* Get accepted by the conference organisers.
* Write the talk.
* Deliver the talk.

Cool!

This is not wrong, by the way.

But... how do you get accepted? I'm sorry, but I don't have any great advice here, but I can tell you what little I know from my limited experience.

First, and very important thing to keep in mind: You're not being rejected, you're just not being accepted. Not being accepted means that competition is fierce, and other talks, on your chosen topic, were deemed more interesting than yours, or that your chosen topic did not quite match the conference. Not being accepted doesn't mean you suck. It means other talks were considered more interesting, and there's no shame in that. It obviously doesn't feel good if your proposal wasn't accepted, but keep at it, refine your proposal, and try again. Maybe with the same topic from another angle, or perhaps another topic, perhaps another conference?

Mind that not all conferences are alike, so there are differences, but the broad pictures is likely more or less the same.

Let's start with two examples.

First the abstract. This is what the committee decides from, and also what's visible to the attendees to the conference. The first example (so far two Nay and no Yay):


The Curiously Recurring Coupled Types Pattern.
Why can pointers be subtracted but not added? What do raw C pointers, STL
iterators, std::chrono types, and 2D/3D geometric primitives have in
common? In this talk we will present some curiously coupled data types that
frequently occur in your programs, together forming notions that you are
already intuitively familiar with. We will shine a light on the
mathematical notion of Affine Spaces, and how they guide stronger design.
We will review the properties of affine spaces and show how they improve
program semantics, stronger type safety and compile time enforcement of
these semantics. By showing motivational examples, we will introduce you to
the mathematical notion of affine spaces. The main focus will then be on
how affine space types and their well defined semantics shape expressive
APIs. We will give examples and guidelines for creating your own affine
types.


And the accompanying outline, as envisioned by the time of the submission. This outline is preliminary, and you won't be held accountable for it. It's an aid for the conference committee to decide. The conference committee knows that the talk is not written yet; that this is an idea for a talk. I find that writing the outline also helps with figuring out a structure for the talk:
Show familiar examples of situations of affine space semantics - pointer
arithmetic - iterators - chrono - coordinate systems 

  • Mathematical definitions/properties 
  • Describe properties of affine space types - operators and relations - [show a Concept for affine space types, tbd] 
  • Show how to write a simple affine space strong type template. 
  • Parallels to unit systems


Here's the abstract for another talk that was accepted:

Programming with Contracts in C++20 
Design by Contract is a technique to clearly express which parts of a
program has which responsibilities. In the case of bugs, contracts can
mercilessly point a finger at the part that violated the contract.
Contracts are coming as a language feature in C++20. I will show how they
can make your interfaces clearer with regards to use, but also point out
pitfalls and oddities in how contracts are handled in C++. Writing good
contracts can be difficult. I intend to give you guidance in how to think
when formulating contracts, and how to design your interfaces such that
good contracts can be written, because contracts have effects on interface
design. Warning: Parts of the contents are personal opinion.

and its outline:

Brief intro to the ideas behind Design by Contract Show what the current
draft standard supports, including strengths, weaknesses and missing
features. Propose rules of thumb for "best practices" with contracts in
C++20. Show some gotchas to look out for.
Is there a take away message from this?

I think there are two take away messages.

  1. It pays to think about how you want your talk to look like in the conference programme. This is difficult, and (at least for me) takes a disproportionate amount of time. It's only 100 or so words, after all, but expressing an idea very briefly is very hard.
  2. It also takes luck. It's not your fault if you're not lucky. A talk one conference didn't accept, another one might, and vice versa. Keep trying (and if not offered, ask for why the proposal wasn't accepted - chances are there's valuable information there.)
If you get email saying your talk has been accepted, then congratulations, it is time for the big work to begin. Think about how you best get your ideas across. Who is your audience? How knowledgeable are they about your topic? Watch a number of presentations you have liked, and study how the presenter does it. There are many different techniques. Shamelessly steal techniques you think works well, and note what's problematic so that you can avoid it. Writing the presentation material does (for me) take a huge amount of time, and I keep revisiting it over and over, polishing for better narrative, fixing bugs, improving visual style. One difficult thing to estimate is how long it takes to deliver the talk. Aim for filling your slot reasonably well. It's not nice to overshoot, but it's also rude to your audience and the conference organiser if you use up considerably less time than has been set aside for you. The only way to learn how long it takes is to do it (after you have done it a few times, you get a feel for your slide-rate, but for your first talk, you obviously have no idea.) Dry run the talk for yourself. Leave space for audience interaction. Practice on your colleagues. If it's a conference talk, try do do a practice run for your local user group. Solicit feedback, and improve your presentation even more. Some speakers like to rehearse the talk very much, others prefer to improvise more. Find out what works for you.

And, finally... you have things to say. You have experiences that are worth sharing. Your experiences will valuable to others, but only if you share them. Speaking at conferences or local user groups isn't the only way to share experiences, but don't dismiss it, OK?

Share your experiences.

Saturday, July 14, 2018

DRY multicomparisons

Now and then, I find myself writing something like if (x == a || x == b || x == c) ..., and every time the repetition of x == annoys me.

A number of people, me included, have reduced the repetition by writing code like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
template <typename T>
class is
{
public:
    constexpr is(T t_) : t(std::move(t_)) {}
    template <typename ... U>
    constexpr auto any_of(U const& ... u) const { return ((t == u) || ...); }
private:
  T t;
};

If you're not familiar with the C++17 constructs, I'll walk you through it.

To begin with, C++17 has a class template argument type deduction that works for constructors. This works automatically when the type used in the constructor is enough to deduce the template types, which is the case with the constructor on line 5, so is{3} is automatically deduced to be is<int>{3}, since the literal 3 is an int.

The member function template any_of(), on line 7, uses a variadic template parameter pack ...u, to accept any number of arguments of any type. This is then passed to a fold expression which will be expanded such that the pattern (t == u) is repeated, where u takes on each and every of the parameters called with, in the ....

Here's an example use of the construction:

1
if (is{x}.any_of(a,b,c)) ...

It will construct an is<int>{x}, and the call to any_of(a,b,c) will in effect return the result of ((x == a) || (x == b) || (x == c)).

This takes care of much of the repetition, but it doesn't read very nicely. It's awkward.

Enter ranges. Ranges have been in the works for C++ for some time now, and is available as open source libraries if you want to try them out. Chances are they will be a part of C++20, but that decision is not made yet.

With ranges, the algorithms from <algorithm> and <numeric> can work on whole ranges of values at once, so you don't have to fiddle with iterator pairs. Collections counts as ranges, including std::initializer_list<>. In my examples I'm using the range-v3 library, available from github.com. In the range-v3 library, the algorithms are in namespace ranges::v3.

So we can instead write:


1
if (ranges::v3::any_of({a,b,c}, [x](auto y){ return x == y; }) ...

It takes care of repetition, but at the cost of boiler plate to the degree it smells of obfuscation. It's not easy to read.


A trivial higher order function improves it:

1
2
3
4
5
template <typename T>
inline auto equals(T t)
{
    return [=](auto&& x) { return x == t;};
}

The call can now be made as:

1
if (ranges::v3::any_of({a,b,c}, equals(x)) ...

The call to equals(x) gives us the lambda that makes the test for equality. I guess that if you read it out, except for the namespace and punctuations, it kind of reads like normal English.

But there is a difference here. A subtle one, and one I didn't think of for a very long time. What if a, b and c, are not all of the same type?

Just a couple of days ago, I had a different idea. What if I make any_of to be a class template, that holds the values I want to compare, and implements an operator==? This would be sort of like the first example solution, but inside out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
template <typename ... T>
class any_of : std::tuple<T...>
{
public:
    using std::tuple<T...>::tuple;
    template <typename U>
    constexpr bool operator==(const U& u) const {
        return std::apply([&](const auto& ... a) { return ((a == u) || ...);},
                          get());
    }
    template <typename U>
    friend constexpr bool operator==(const U& u, const any_of& a)
    {
        return a == u;
    }
private:
    constexpr const std::tuple<T...> get() const { return *this;}
};

template <typename ... T>
any_of(T&& ...) -> any_of<T...>;

I'll walk you through the code.

First, any_of<T...> inherits privately from std::tuple<T...>, and on line 5 we say that we have all the constructors that tuple does.

Lines 6-10 implements operator==, that can be called with any type. std::apply, on line 8, calls a function (first parameter) with the individual values from a tuple (2nd parameter). The function is a lambda that can take any number of arguments, and compare them with the reference captured u, as explained above for the is<T>::any_of() function template, and the 2nd parameter is the tuple that any_of inherits from (as given by the private get() member function on line 17.)

The one thing missing now is that it is not obvious for the compiler what the types T are. It cannot deduce them automatically, so we need to help it. The deduction guide on lines 20-21 takes care of that. It say if we call any_of(a,b,c), it will deduce the Ts from the types of a, b and c. l-values will be deduced to l-value references, and r-values will be moved into the tuple.

This is enough to make it work. A call can now be made as:

1
if (any_of{a,b,c} == x)...

This is nice! A little bit Yoda-like grammar, but the friend operator== on lines 11-15 allows comparison with the arguments in the reversed order, like:

1
if (x == any_of{a,b,c})...

This is all and well. But since this can be used with any types for which the comparisons make sense, why not expand it to other operators too, and implement all_of and none_of? Why not allow an assertion like the following?:

1
assert(all_of(a,b,c,d) > " ");

This of course requires that a, b, c and d are all greater-than comparable with a c-string.

I think this is neat. It reads nicely, and is not very difficult to understand.

If you want to have a look at code generation, please toy with this link to gcc.godbolt.org. Spoiler - the compilers see through it all and generates amazingly efficient code.

If you want to try it in your own code, clone it from github.com.

Monday, August 7, 2017

Performance of flat maps

A flat map is a data structure that maps a key to a value, and that stores the data in a compact flat memory structure to get better cache hit rate. They can be substantially faster than hash tables and trees, for provided the data sets are small, but how small? There's also more than one way to implement a flat map, and what are the pros and cons of the alternatives?

I set out to compare 4 alternatives, and added C++ std::map<key, value> (which is a balanced binary tree,) and std::unordered_map<key, value> (which is a hash table) into the performance measurements. Note that most of what's mentioned here is language agnostic. C++ happens to be the language of choice for this post, but any language that allows access to contiguous blocks of memory would be equally suitable.

The contestants

My alternatives are the unordered_flatmap<key, value>, flatmap<key, value>, unordered_split_flatmap<key, value>, and split_flatmap<key, value>.

unordered_flatmap<key, value> 

This is the simplest of them all.

This is a map from integer to string, where key and value are stored next to each other. The coloured blocks in the background are illustrations of possible cache lines. In this illustration, each cache line holds two key,value pairs. A key is 1/8 of a cache line in size, and a value is 3/8 of a cache line in size.

Searching for a key in this structure is simply a matter of trying the keys one by one from the beginning.
An obvious disadvantage with this structure is the cost of searching for a key that isn't there.
All elements must be searched, as there is no way to know when to stop. Another less obvious disadvantage of this layout is that it is quite wasteful of the cache when it comes to lookup. In this case, as it traverses the elements to (possibly) find the desired key, it only takes advantage of 1/4 of the data read into a cache line. The memory occupied by the uninteresting values is wasted.

On the other hand, since there are no guarantees of order of elements in this structure, removal of an element, once found, can be made very efficient, by simply moving the last element into the spot no longer occupied by the removed element.
Above, key=8 is removed from the flat map.

Insertion is also fairly efficient, just add the new element last.
The insertion, however, requires first a failed find to ensure that the key is unique.

These linear searches may not be as bad as they first seem. Modern CPUs quickly spot a linear access pattern and speculatively pre fetch data into the cache.

ordered_flatmap<key, value>

This one has the same storage structure as before, but it always stores the keys in order.
Since the keys are stored in a sorted order, lookup can be made using binary search
This scales much better, since the number of accesses is log2(N), where N is the number of elements stored. It is especially good when it comes to lookup of a key that isn't there.
However, the disadvantage is that this access pattern kills the pre fetch of data into the cache since the CPU cannot predict where the next access will be.

To make matters worse, maintaining the sort order when doing insertion and deletion carries a non-trivial cost.
 Removing key=8 above, and inserting it again below.
There are a lot of move constructors involved above, but again, the access pattern for these are very predictable meaning that the move constructors almost certainly operate in hot caches.

Both of these maps above are really good when you want to iterate over all stored elements. Easily predictable pattern, and key/value are stored together.

unordered_split_flatmap<key, value>

This version addresses the waste of precious cache when searching, by storing keys and values in separate arrays, linked by the same index.
A lookup example makes the advantage obvious
There is no wasted cache space at all. Even though a failed lookup shares the characteristics of the unordered_flatmap<key,value>, the cache waste cost is 1/4, which can be a substantial saving.
Insertion and deletion works as with unordered_flatmap<key, value>, except it works on the two arrays instead of one.

Iteration is troubling in this one. There are two parallel sequential accesses, which is fine, but at least in C++, the precedent set by the standard library mandates that an iterator gives a std::pair<key, value> (or a reference to one,) and that doesn't work here since key and value are stored in completely different locations. You need proxy iterators, and those are not free from problems. Eric Niebler wrote about this in his 2015 blog post "To be, or not to be (an iterator)."

ordered_split_flatmap<key, value>

Here's the same as above, but with the keys, and thus values, in sorted order.
The idea is to combine the reduced cache waste in the lookup, while taking advantage of a logarithmic algorithm.

It is not obvious if this offers an advantage over the ordered_flatmap<key, value>, since pre fetch of data into the cache is killed anyway.

Other alternatives

There are many possible implementations of flat maps, and I felt like I had to limit myself. Alternatives I find interesting to explore are sorted with linear search, giving on the average twice as fast lookup in the failure case, and a hybrid ordered_split_flatmap<key, value> where only the keys are sorted, and contain an index to the value, which is stored in whichever location is free, offering fast lookup at the cost of non-linear access for iterations.

Observations on keys

Integer keys are always nice. They consume little memory space and all comparison operations on them are fast. Strings are interesting. One observation of note is that equality comparison is often fast when strings are unequal, since any sane implementation checks the length first, and if they are different there is no need to check the string contents, whereas a less-than comparison for a string is always expensive. Another interesting aspect of strings as keys, is that all modern C++ library implementations use a short string optimisation, where short strings are stored in the string object itself and long strings are stored on the heap and the string object points to it. So short strings provide great locality of reference whereas long strings are likely to result in cache misses.

The shoot out

All benchmarks are made on an Intel i7-4500U CPU @ 3GHz. It has 32K of L1-data cache, 32K of L1-instruction cache, 256K of L2 cache and 4M of L3 cache. The benchmark program is compiled with clang++-4.0 -O3 -DNDEBUG, -std=gnu++1z and libc++ where sizeof(std::string) is 24 bytes and short string optimisation allows for strings up to 22 characters of length without using the heap. sizeof(int) is 4 bytes.

The implementations are not super-tweaked for performance. They are implemented using std::vector<> and the std::find_if() algorithm from the standard library. I believe tweaking could improve the results a little bit, but the patterns would look the same because performance is mainly limited by data accesses and comparison operations.

In all measurements the keys are integers, short strings and long strings. The strings are a snapshot of the file names, presented in a random order. The long strings are the full paths, the short strings are the base name. A few of the short strings are too long for the short string optimisation, whereas most fit, just as a few of the long strings do in fact fit in the short string optimisation whereas most do not.

All measurements are also made with cold L1 and L2 caches.  The way this is done is that all elements in an array of 256K size is traversed after setup and before each measurement iteration. This is to see how much one cache hit helps another access. I intended to cool down the L3 cache as well, but 12h to run the benchmark with cold L2 cache was long enough. It would have taken days with the L3 cache cooled down.

The source code for the implementations and the benchmark can be found on github.com.

Lookup of known keys

click for larger image
The main surprise for me, here, is how good std:map<int, string> is for up to about 100 items, and how poor std::unordered_map<> generally is, unless you have around 100 items or more.

The ordered versions perform poorly for string keys until you have a few hundred of them. I believe this is not primarily due to cache effects, but rather because determining the string ordering relation is always an expensive operation whereas equality comparison is often cheap when unequal (since comparing size alone suffices.)

Lookup of unknown keys

click for larger image
I found it surprising to see how well the unordered versions performed for strings, as long as the number of elements are a few tens. For integer keys the ordered versions outperform std::unordered_map<> until somewhere between 256 and 512 elements.

Insertion of elements

click for larger image
The Y-scale is the total time taken to populate a map to that number of elements, and then divided by the number of elements, so it's an average of the time to populate the map.

Again, std::unordered_map<> needs a few hundred elements, and for integer keys it's performing really poorly with smaller data sets.

ordered_split_flatmap holds up until surprisingly large number of elements, probably due to good enough locality of reference, cheap comparisons and good algorithmic complexity, despite the cost of moving elements to make place for the new one. For strings it's still the unordered flatmaps that rule the performance graph until about 100 or so elements.

Removing elements

click for larger image
The Y-axis is the average time taken to one by one, in random order, remove every element in a map, starting with the size on the X-axis.

This one really surprised me at first, seeing how very poorly the ordered versions performed compared to how reasonably well they performed with insertion. But, with insertion in flat maps, the cost of occasionally reallocating the vector(s) to make space, is amortised over the elements. With erase, there is no such cost, so what is seen is the cost of maintaining the sort order. In every case, this seems to over shadow the algorithmic advantage of quickly finding the element.

I was also surprised to see how well the unordered versions held up to std::map<> and std::unordered_map<> with integer keys. For string keys, the advantage of unordered flat maps over std::unordered_map<> is minor except for very small data sets.

Iteration

click for larger image
This is a measurement of the time taken, per element, to iterate over all elements in a map.

It is no surprise that the flat maps outperform std::map<> and std::unordered_map<>. The flat maps have simple straight access patterns that are exceptionally cache friendly. Both std::map<> and std::unordered_map<> must chase pointers to find the elements. As the data set grows, this becomes an increasingly large burden.

I did not expect to see the split flat map versions outperform the co-located flat maps, especially given the nasty proxy iterators, but perhaps the CPU can offer an advantage with two parallel iterations?

Conclusion

Flat maps can outperform the maps from the standard library quite substantially, providing the data sets are small. There are several ways to implement them, of which 4 versions are measured here, and each of them have different pros and cons. Which ones to choose depends on your CPU, how much data you have, the comparison characteristics of your key type, and which operations are the most important.

This study is no where near exhaustive, but perhaps it can provide some pointers to which data structure you should prefer in your situation?

Saturday, June 3, 2017

constexpr quicksort in C++17

So I've written about compile time quick sort twice before (2011 and 2015,) but now when C++17 support is becoming available, I thought I'd try it again.

I'll be taking advantage of std::integral_constant<type, value> a lot. It has the advantage that it encodes the value in the type directly, while still allowing arithmetics as if it was a constant of the type. Unfortunately, it's rather much to write, so to save typing, a convenient variable template is introduced:

1
2
template <auto N>
std::integral_constant<decltype(N), N> c;

This brings a convenient short hand notation c<3U> meaning an lvalue of type std::integral_constant<unsigned int, 3U>. Line 1 shows template <auto>, which is a very handy new feature in C++17 for non-type template parameters, where the type is deduced. You can see example usage in the Compiler Explorer (thanks @mattgodbolt.)

Before we can go on to sorting, we need a way to express a range to sort, and to operate on those ranges. I've used a simple type_list template:

1
2
3
4
5
template <typename ... T>
struct type_list
{
    constexpr type_list(T...) {}
};

The constructor is there to take advantage of another new C++17 feature: automatic template parameter deduction. It's possible to write type_list{c<3>, c<8>} to construct an instance of type_list<std::integral_constant<int, 3>, std::integral_constant<int, 8>>. Here, (line 4 above) the actual values aren't used in the type_list, it's the types alone that are interesting. The values are just used to guide the compiler to deduce the types correctly. The same technique can be used in more conventional programming, for example std::pair{3, 'c'} constructs a std::pair<int, char> which holds the values 3 and 'c'.

Now we need a few functions to operate on the type lists:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
template <typename T, typename ... Ts>
constexpr auto head(type_list<T, Ts...>)
{
    return T{};
}

template <typename T, typename ... Ts>
constexpr auto tail(type_list<T, Ts...>)
{
    return type_list{Ts{}...};
}

template <typename ... Ts, typename ... Us>
constexpr auto operator|(type_list<Ts...>, type_list<Us...>)
{
    return type_list{Ts{}..., Us{}...};
}

Both tail() and the concatenating operator|() use the automatic template parameter deduction to construct the returned type_list. Here's a compiler explorer to play around a bit with them.

Now comes the matter of partitioning a type_list into two lists based on comparison with a pivot element. The easiest way to do this is a classic recursion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
template <typename Compare, typename P, typename ... Ts>
constexpr auto partition(Compare compare, P pivot, type_list<Ts...> tl)
{
    if constexpr (sizeof...(Ts) == 0)
    {
        return std::pair(type_list{}, type_list{});
    }
    else
    {
        constexpr auto h = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        if constexpr (compare(h, pivot))
        {
            return std::pair(type_list{h} | r.first, r.second);
        }
        else
        {
            return std::pair(r.first, type_list{h} | r.second);
        }
    }
}

if constexpr is a new C++17 construction (often referred to as constexpr-if,) that is a major blessing when doing template programming. Unlike an ordinary if statement, if constexpr only generates code for the branch selected by the constexpr condition.

Above, the else branch doesn't compile for an empty type_list<> tl, so an ordinary if statement would give a compilation error. In C++14 and earlier, it would be necessary to write two separate partition functions, one that matches the empty list, and one for the general case.

So, given a compare function, a pivot value, and a type_list<Ts...>, the function partition returns a pair of type lists. The first containing the Ts that are true for compare(pivot, t), and the second containing the other Ts. The compare function can be an ordinary lambda. In C++17, lambdas are constexpr (when possible.)

Checkout this Compiler Explorer to play with it.

Now all bits necessary for doing quick sort are in place, and it's an almost embarrassingly simple textbook-style recursive solution:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename Compare, typename ... T>
constexpr auto sort(type_list<T...> tl, Compare compare)
{
    if constexpr (sizeof...(T) == 0)
    {
        return type_list{};
    }
    else
    {
        constexpr auto pivot = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        return sort(r.first, compare) | type_list{pivot} | sort(r.second, compare);
    }
}

Here's a compiler explorer that converts the sorted type_list<> into a std::array<>, just to visualise the data generated. You may notice that the optimisation level chosen is -O0, and yet no code is generated to produce the sorted array.

As before, the usefulness of this is very limited, but it is kind of cool, isn't it?