Monday, August 7, 2017

Performance of flat maps

A flat map is a data structure that maps a key to a value, and that stores the data in a compact flat memory structure to get better cache hit rate. They can be substantially faster than hash tables and trees, for provided the data sets are small, but how small? There's also more than one way to implement a flat map, and what are the pros and cons of the alternatives?

I set out to compare 4 alternatives, and added C++ std::map<key, value> (which is a balanced binary tree,) and std::unordered_map<key, value> (which is a hash table) into the performance measurements. Note that most of what's mentioned here is language agnostic. C++ happens to be the language of choice for this post, but any language that allows access to contiguous blocks of memory would be equally suitable.

The contestants

My alternatives are the unordered_flatmap<key, value>, flatmap<key, value>, unordered_split_flatmap<key, value>, and split_flatmap<key, value>.

unordered_flatmap<key, value> 

This is the simplest of them all.

This is a map from integer to string, where key and value are stored next to each other. The coloured blocks in the background are illustrations of possible cache lines. In this illustration, each cache line holds two key,value pairs. A key is 1/8 of a cache line in size, and a value is 3/8 of a cache line in size.

Searching for a key in this structure is simply a matter of trying the keys one by one from the beginning.
An obvious disadvantage with this structure is the cost of searching for a key that isn't there.
All elements must be searched, as there is no way to know when to stop. Another less obvious disadvantage of this layout is that it is quite wasteful of the cache when it comes to lookup. In this case, as it traverses the elements to (possibly) find the desired key, it only takes advantage of 1/4 of the data read into a cache line. The memory occupied by the uninteresting values is wasted.

On the other hand, since there are no guarantees of order of elements in this structure, removal of an element, once found, can be made very efficient, by simply moving the last element into the spot no longer occupied by the removed element.
Above, key=8 is removed from the flat map.

Insertion is also fairly efficient, just add the new element last.
The insertion, however, requires first a failed find to ensure that the key is unique.

These linear searches may not be as bad as they first seem. Modern CPUs quickly spot a linear access pattern and speculatively pre fetch data into the cache.

ordered_flatmap<key, value>

This one has the same storage structure as before, but it always stores the keys in order.
Since the keys are stored in a sorted order, lookup can be made using binary search
This scales much better, since the number of accesses is log2(N), where N is the number of elements stored. It is especially good when it comes to lookup of a key that isn't there.
However, the disadvantage is that this access pattern kills the pre fetch of data into the cache since the CPU cannot predict where the next access will be.

To make matters worse, maintaining the sort order when doing insertion and deletion carries a non-trivial cost.
 Removing key=8 above, and inserting it again below.
There are a lot of move constructors involved above, but again, the access pattern for these are very predictable meaning that the move constructors almost certainly operate in hot caches.

Both of these maps above are really good when you want to iterate over all stored elements. Easily predictable pattern, and key/value are stored together.

unordered_split_flatmap<key, value>

This version addresses the waste of precious cache when searching, by storing keys and values in separate arrays, linked by the same index.
A lookup example makes the advantage obvious
There is no wasted cache space at all. Even though a failed lookup shares the characteristics of the unordered_flatmap<key,value>, the cache waste cost is 1/4, which can be a substantial saving.
Insertion and deletion works as with unordered_flatmap<key, value>, except it works on the two arrays instead of one.

Iteration is troubling in this one. There are two parallel sequential accesses, which is fine, but at least in C++, the precedent set by the standard library mandates that an iterator gives a std::pair<key, value> (or a reference to one,) and that doesn't work here since key and value are stored in completely different locations. You need proxy iterators, and those are not free from problems. Eric Niebler wrote about this in his 2015 blog post "To be, or not to be (an iterator)."

ordered_split_flatmap<key, value>

Here's the same as above, but with the keys, and thus values, in sorted order.
The idea is to combine the reduced cache waste in the lookup, while taking advantage of a logarithmic algorithm.

It is not obvious if this offers an advantage over the ordered_flatmap<key, value>, since pre fetch of data into the cache is killed anyway.

Other alternatives

There are many possible implementations of flat maps, and I felt like I had to limit myself. Alternatives I find interesting to explore are sorted with linear search, giving on the average twice as fast lookup in the failure case, and a hybrid ordered_split_flatmap<key, value> where only the keys are sorted, and contain an index to the value, which is stored in whichever location is free, offering fast lookup at the cost of non-linear access for iterations.

Observations on keys

Integer keys are always nice. They consume little memory space and all comparison operations on them are fast. Strings are interesting. One observation of note is that equality comparison is often fast when strings are unequal, since any sane implementation checks the length first, and if they are different there is no need to check the string contents, whereas a less-than comparison for a string is always expensive. Another interesting aspect of strings as keys, is that all modern C++ library implementations use a short string optimisation, where short strings are stored in the string object itself and long strings are stored on the heap and the string object points to it. So short strings provide great locality of reference whereas long strings are likely to result in cache misses.

The shoot out

All benchmarks are made on an Intel i7-4500U CPU @ 3GHz. It has 32K of L1-data cache, 32K of L1-instruction cache, 256K of L2 cache and 4M of L3 cache. The benchmark program is compiled with clang++-4.0 -O3 -DNDEBUG, -std=gnu++1z and libc++ where sizeof(std::string) is 24 bytes and short string optimisation allows for strings up to 22 characters of length without using the heap. sizeof(int) is 4 bytes.

The implementations are not super-tweaked for performance. They are implemented using std::vector<> and the std::find_if() algorithm from the standard library. I believe tweaking could improve the results a little bit, but the patterns would look the same because performance is mainly limited by data accesses and comparison operations.

In all measurements the keys are integers, short strings and long strings. The strings are a snapshot of the file names, presented in a random order. The long strings are the full paths, the short strings are the base name. A few of the short strings are too long for the short string optimisation, whereas most fit, just as a few of the long strings do in fact fit in the short string optimisation whereas most do not.

All measurements are also made with cold L1 and L2 caches.  The way this is done is that all elements in an array of 256K size is traversed after setup and before each measurement iteration. This is to see how much one cache hit helps another access. I intended to cool down the L3 cache as well, but 12h to run the benchmark with cold L2 cache was long enough. It would have taken days with the L3 cache cooled down.

The source code for the implementations and the benchmark can be found on github.com.

Lookup of known keys

click for larger image
The main surprise for me, here, is how good std:map<int, string> is for up to about 100 items, and how poor std::unordered_map<> generally is, unless you have around 100 items or more.

The ordered versions perform poorly for string keys until you have a few hundred of them. I believe this is not primarily due to cache effects, but rather because determining the string ordering relation is always an expensive operation whereas equality comparison is often cheap when unequal (since comparing size alone suffices.)

Lookup of unknown keys

click for larger image
I found it surprising to see how well the unordered versions performed for strings, as long as the number of elements are a few tens. For integer keys the ordered versions outperform std::unordered_map<> until somewhere between 256 and 512 elements.

Insertion of elements

click for larger image
The Y-scale is the total time taken to populate a map to that number of elements, and then divided by the number of elements, so it's an average of the time to populate the map.

Again, std::unordered_map<> needs a few hundred elements, and for integer keys it's performing really poorly with smaller data sets.

ordered_split_flatmap holds up until surprisingly large number of elements, probably due to good enough locality of reference, cheap comparisons and good algorithmic complexity, despite the cost of moving elements to make place for the new one. For strings it's still the unordered flatmaps that rule the performance graph until about 100 or so elements.

Removing elements

click for larger image
The Y-axis is the average time taken to one by one, in random order, remove every element in a map, starting with the size on the X-axis.

This one really surprised me at first, seeing how very poorly the ordered versions performed compared to how reasonably well they performed with insertion. But, with insertion in flat maps, the cost of occasionally reallocating the vector(s) to make space, is amortised over the elements. With erase, there is no such cost, so what is seen is the cost of maintaining the sort order. In every case, this seems to over shadow the algorithmic advantage of quickly finding the element.

I was also surprised to see how well the unordered versions held up to std::map<> and std::unordered_map<> with integer keys. For string keys, the advantage of unordered flat maps over std::unordered_map<> is minor except for very small data sets.

Iteration

click for larger image
This is a measurement of the time taken, per element, to iterate over all elements in a map.

It is no surprise that the flat maps outperform std::map<> and std::unordered_map<>. The flat maps have simple straight access patterns that are exceptionally cache friendly. Both std::map<> and std::unordered_map<> must chase pointers to find the elements. As the data set grows, this becomes an increasingly large burden.

I did not expect to see the split flat map versions outperform the co-located flat maps, especially given the nasty proxy iterators, but perhaps the CPU can offer an advantage with two parallel iterations?

Conclusion

Flat maps can outperform the maps from the standard library quite substantially, providing the data sets are small. There are several ways to implement them, of which 4 versions are measured here, and each of them have different pros and cons. Which ones to choose depends on your CPU, how much data you have, the comparison characteristics of your key type, and which operations are the most important.

This study is no where near exhaustive, but perhaps it can provide some pointers to which data structure you should prefer in your situation?

Saturday, June 3, 2017

constexpr quicksort in C++17

So I've written about compile time quick sort twice before (2011 and 2015,) but now when C++17 support is becoming available, I thought I'd try it again.

I'll be taking advantage of std::integral_constant<type, value> a lot. It has the advantage that it encodes the value in the type directly, while still allowing arithmetics as if it was a constant of the type. Unfortunately, it's rather much to write, so to save typing, a convenient variable template is introduced:

1
2
template <auto N>
std::integral_constant<decltype(N), N> c;

This brings a convenient short hand notation c<3U> meaning an lvalue of type std::integral_constant<unsigned int, 3U>. Line 1 shows template <auto>, which is a very handy new feature in C++17 for non-type template parameters, where the type is deduced. You can see example usage in the Compiler Explorer (thanks @mattgodbolt.)

Before we can go on to sorting, we need a way to express a range to sort, and to operate on those ranges. I've used a simple type_list template:

1
2
3
4
5
template <typename ... T>
struct type_list
{
    constexpr type_list(T...) {}
};

The constructor is there to take advantage of another new C++17 feature: automatic template parameter deduction. It's possible to write type_list{c<3>, c<8>} to construct an instance of type_list<std::integral_constant<int, 3>, std::integral_constant<int, 8>>. Here, (line 4 above) the actual values aren't used in the type_list, it's the types alone that are interesting. The values are just used to guide the compiler to deduce the types correctly. The same technique can be used in more conventional programming, for example std::pair{3, 'c'} constructs a std::pair<int, char> which holds the values 3 and 'c'.

Now we need a few functions to operate on the type lists:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
template <typename T, typename ... Ts>
constexpr auto head(type_list<T, Ts...>)
{
    return T{};
}

template <typename T, typename ... Ts>
constexpr auto tail(type_list<T, Ts...>)
{
    return type_list{Ts{}...};
}

template <typename ... Ts, typename ... Us>
constexpr auto operator|(type_list<Ts...>, type_list<Us...>)
{
    return type_list{Ts{}..., Us{}...};
}

Both tail() and the concatenating operator|() use the automatic template parameter deduction to construct the returned type_list. Here's a compiler explorer to play around a bit with them.

Now comes the matter of partitioning a type_list into two lists based on comparison with a pivot element. The easiest way to do this is a classic recursion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
template <typename Compare, typename P, typename ... Ts>
constexpr auto partition(Compare compare, P pivot, type_list<Ts...> tl)
{
    if constexpr (sizeof...(Ts) == 0)
    {
        return std::pair(type_list{}, type_list{});
    }
    else
    {
        constexpr auto h = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        if constexpr (compare(h, pivot))
        {
            return std::pair(type_list{h} | r.first, r.second);
        }
        else
        {
            return std::pair(r.first, type_list{h} | r.second);
        }
    }
}

if constexpr is a new C++17 construction (often referred to as constexpr-if,) that is a major blessing when doing template programming. Unlike an ordinary if statement, if constexpr only generates code for the branch selected by the constexpr condition.

Above, the else branch doesn't compile for an empty type_list<> tl, so an ordinary if statement would give a compilation error. In C++14 and earlier, it would be necessary to write two separate partition functions, one that matches the empty list, and one for the general case.

So, given a compare function, a pivot value, and a type_list<Ts...>, the function partition returns a pair of type lists. The first containing the Ts that are true for compare(pivot, t), and the second containing the other Ts. The compare function can be an ordinary lambda. In C++17, lambdas are constexpr (when possible.)

Checkout this Compiler Explorer to play with it.

Now all bits necessary for doing quick sort are in place, and it's an almost embarrassingly simple textbook-style recursive solution:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename Compare, typename ... T>
constexpr auto sort(type_list<T...> tl, Compare compare)
{
    if constexpr (sizeof...(T) == 0)
    {
        return type_list{};
    }
    else
    {
        constexpr auto pivot = head(tl);
        constexpr auto r = partition(compare, pivot, tail(tl));
        return sort(r.first, compare) | type_list{pivot} | sort(r.second, compare);
    }
}

Here's a compiler explorer that converts the sorted type_list<> into a std::array<>, just to visualise the data generated. You may notice that the optimisation level chosen is -O0, and yet no code is generated to produce the sorted array.

As before, the usefulness of this is very limited, but it is kind of cool, isn't it?

Sunday, January 8, 2017

Higher order functions as an enabler for lazy evaluation

Yesterdays post about Generating lambdas for clarity and performance showed how to make use of higher order functions to improve clarity while giving the optimiser a chance to improve performance, but the example used retained the original inflexible design.

Here's a short recap. Given a struct Employee, there is a function that filters out developers based on their salary. The original code looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct Employee
{
  using Number = unsigned;
  Number id;
  unsigned salary;
  std::string title;
  std::string name;
};

using staff = std::vector<Employee>;

auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  std::vector<staff::value_type const*> rv;
  for (auto& e : s)
  {
    if (e.salary >= floor && e.title == "Developer")
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

The code is by no means difficult to understand, but it lacks in flexibility, and loops like this spread over many places in the code base becomes a maintenance liability.

Enter a kind of filter function template, that like the devs_who_make_at_least() function, returns a vector of pointers to the elements given, but using a provided predicate instead of a hard coded condition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <typename Container, typename Pred>
auto elements_matching(Container const& c, Pred&& p)
{
  std::vector<typename Container::value_type const*> rv;
  for (auto& e : c)
  {
    if (p(e))
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

This is more generic. It works on any STL-like container and accepts any predicate that works on the value type. A few sample higher order functions were introduced to select elements just as in the original example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
auto earns_at_least(unsigned floor)
{
  return [floor](const Employee& e) { return e.salary >= floor; };
}

auto has_title(const std::string& s)
{
  return [s](const Employee& e) { return e.title == s; };
}

template <typename T, typename U>
auto pred_and(T&& t, U&& u)
{
  return [t,u](auto&& x) { return t(x) && u(x); };
}

Here pred_and() is used to create a new predicate from two others, returning true only if both contained predicates return true. Creating a vector of pointers to all developers that earn 1M or more simply becomes:

1
2
3
auto megadevs = elements_matching(employees,
                                  pred_and(earns_at_least(1000000), 
                                           has_title("Developer")));

This was also shown to give performance gains over the original code.

So far so good, but what if the desired result is not a vector of pointers? What if we just want to calculate the accumulated salary of those selected? It doesn't make sense to pay the cost of populating an array for that. What if not all developers are needed, but you just need to browse through the selected ones until some other criteria is met? Then it is unnecessary to have populated the vector with all of them.

Enter lazy evaluation. The idea is to create yet another level of indirection, and defer calling the predicate until the actual iteration is done.

What is wanted is something like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
auto megadevs = filter(employees,
                       pred_and(earns_at_least(1000000), 
                                has_title("Developer")));

// No iteration over employees yet
// No vector created.

uint64_t sum = 0;
for (auto i = megadevs.begin();// calls pred and increments until true or end
     i != megadevs.end();      // simple comparison
     ++i)                      // calls pred on incremented until true or end
{
  sum += i->salary;            // refers to element in employees
}

From this a design emerges. The type of megadevs, returned by the call to filter(), must have member functions begin() and end(). The iterators returned must have access to the iterators into employees, and must have access to the predicate.

Here's an outline for the type returned by filter():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template <typename Container, typename Pred>
class filter_t
{
  using CI = typename Container::const_iterator;
 public:
  class iterator;

  template <typename P>
  filter_t(Container& c_, P&& p_)
   : c(c_), p(std::forward<P>(p_))  { }

  iterator begin() {
    auto i = c.begin();
    auto e = c.end();
    while (i != e && !p(*i))
      ++i;
    return { i, e, p };
  }

  iterator end() { return { c.end(), c.end(), p }; }
private:
  const Container& c;
  Pred             p;
};

Each instance must refer to the original container, and must access the predicate, so these types are template parameters. The alias CI is just there for typographical purposes, making blog code shorter. The class iterator is deferred for a while. I'll get back to it soon. The interesting part is the member function begin(). We want it to create an iterator that refers to the first element in the container that fulfils the predicate, or end if there is no such element. It does this by checking the predicate and incrementing until true. The returned filter iterator must have the found iterator, the end, and the predicate, otherwise it cannot implement its increment operators.
The natural implementation would be to use std::find_if() instead of incrementing and checking manually, but this was unexpectedly slow.
Now we can look at the iterator class:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
template <typename Container, typename Pred>
class filter_t<Container, Pred>::iterator
{
public:
  using value_type = typename std::iterator_traits<CI>::value_type;
  using reference = typename std::iterator_traits<CI>::reference;;
  using pointer = typename std::iterator_traits<CI>::pointer;
  using difference_type = typename std::iterator_traits<CI>::difference_type;
  using iterator_category = std::forward_iterator_tag;

  iterator(CI b, CI e, Pred& p_) : iter(b), iend(e), p(p_) { }

  reference operator*() const { return *iter; }

  iterator& operator++()
  {
    ++iter;
    while (iter != iend && !p(*iter))
      ++iter;
    return *this;
  }

  iterator operator++(int)
  {
    iterator rv(*this);
    ++*this;
    return rv;
  }

  bool operator==(const iterator& rh) const
  {
    return iter == rh.iter;
  }

  bool operator!=(const iterator& rh) const
  {
    return iter != rh.iter;
  }
private:
  CI       iter;
  CI const iend;
  Pred&    p;
};

It's quite a code block, but most is obvious. The aliases on lines 5-9 are there to make the iterator cooperate with algorithms in the standard library. See std::iterator_traits<> for details.

The other interesting part is operator++() on lines 15-21. As when created from the begin() member function of filter_t<>, it searches for the next iterator in the referenced container until the predicate matches or the end is reached. Also here, a hand written loop is used, after performance measurements showed an unexpected need.

With this in place, the filter() function template becomes nearly trivial:

1
2
3
4
5
6
template <typename Container, typename Predicate>
inline filter_t<Container, std::remove_reference_t<Predicate>>
filter(Container& c, Predicate&& p)
{
  return { c, std::forward<Predicate>(p) };
}

Here's a good time to warn that this is a proof of concept code, intended to show the idea. For real world use, you would for example need variants with mutable access and variants that take ownership of the container if it is an rvalue, otherwise you cannot safely chain operations. These variants change nothing in performance, but they do make the implementation rather more complex.

Now is a good time to see if this did any good. A slight modification is made from the sample program of yesterday. The program populates a vector of 5000 employees. 1/4 of them with title = "Developer", and an unrealistically uniform random distribution salary in the range 50000 - 250000. With those the program did 500000 loops, each filtering out the developers that make 150000 or more and calculates the accumulated salary of the first 500.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
auto devs = filter(employees,
                   pred_and(earns_at_least(150000), 
                            has_title("Developer")));

int count = 500;
uint64_t sum = 0;
for (auto & d : devs)
{
  sum += d.salary;
  if (--count == 0) break;
}

Using g++ 6.2 and -O3, perf yields this result:

 Performance counter stats for './a.out':

   10 814 982 186      branch-instructions                                         
      752 594 101      branch-misses             #    6,96% of all branches         
   30 224 548 004      cpu-cycles                                                  
   29 445 937 781      instructions              #    0,97  insn per cycle         

     10,177892229 seconds time elapsed

Slightly rewriting the test program to add using the vector version from yesterday as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
auto devs = elements_matching(employees,
                              pred_and(earns_at_least(150000), 
                                       has_title("Developer")));

int count = 500;
uint64_t sum = 0;
for (auto & d : devs)
{
  sum += d->salary;
  if (--count == 0) break;
}

gives the result:

 Performance counter stats for './a.out':

   14 881 795 600      branch-instructions                                         
    1 061 800 168      branch-misses             #    7,13% of all branches         
   45 275 693 140      cpu-cycles                                                  
   47 380 071 935      instructions              #    1,05  insn per cycle         

     15,215398962 seconds time elapsed

Here the advantage of lazy evaluation becomes obvious. We're only interested in the first 500 matches, so not populating a vector with the remaining is a gain. However, in all fairness, it should be said that there is a cost to this generality. The filter iterator shows much worse branch prediction results, and for large data sets, this can be noticeable.

There it is. Higher order functions are by no means a necessity for lazy evaluation, but when you have them, it's not a huge job to implement, and using it becomes natural. The lazy evaluation can give large performance gains, but there is an added complexity which may back fire.

Expanding further on this sooner or later leads to a range library, but that is another story.

Saturday, January 7, 2017

Generate lambdas for clarity and performance

Higher order functions, functions that operate on other functions or returns functions, are familiar to those who have had some experience with functional programming, but they often seems magical to those who have not. Some of those with experience of using higher order functions have a gut feeling that they are expensive to use and prefer to avoid them.

[ edited 2017-01-08: added performance numbers for gcc at the bottom of the post ]
[ edited 2017-01-14: Perfect forwarding of functions in higher order function ]

Take this somewhat silly traditional C++ code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct Employee
{
  using Number = unsigned;
  Number id;
  unsigned salary;
  std::string title;
  std::string name;
};

using staff = std::vector<Employee>;

auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  std::vector<staff::value_type const*> rv;
  for (auto& e : s)
  {
    if (e.salary >= floor && e.title == "Developer")
    {
      rv.push_back(&e);
    }
  }
  return rv;
}

It's not very complex, and most C++ developers understand what it does in a few seconds. If this is the only place in the entire code base that elements are selected from a vector of Employees, then all is well. But what if there are several places? The code becomes cluttered with almost identical copies of the same loop. This screams algorithm. None in the standard library is a perfect fit, though. Let's hand roll a simple one:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
template <typename Container, typename Pred>
auto elements_matching(Container const& c, Pred&& p)
{
  std::vector<typename Container::value_type const*> rv;
  for (auto& e : c)
  {
    if (p(e))
    {
      rv.push_back(&e);
    }
  }
  return rv;
}


It's not ideal, but it works with any container type and returns a vector with pointers to the elements in the container that matches the predicate.

With this, the devs_who_make_at_least() function becomes simple:


1
2
3
4
5
6
7
auto devs_who_make_at_least(const staff& s, unsigned floor)
{
  auto pred = [floor](const Employee& e) {
    return e.salary >= floor && e.title == "Developer";
  };
  return elements_matching(s,pred);
}

This is cool, but what if there's more than one place in the code where you want to select employees based on salary or title?

C++14 introduced the auto return type, which is great for writing functions that generates lambdas.

Take for example:

1
2
3
4
5
6
7
8
9
auto earns_at_least(unsigned floor)
{
  return [floor](const Employee& e) { return e.salary >= floor; };
}

auto has_title(const std::string& s)
{
  return [s](const Employee& e) { return e.title == s; };
}

Each of the above functions returns a lambda that has captured the function parameter. These can be quite expressively used with the elements_matching algorithm introduced earlier:

1
2
3
4
staff employees;

devs = elements_matching(employees, has_title("Developer"));
megaearners = elements_matching(employees, earns_at_least(1000000));

This is powerful on its own right. The code is easy to read. Most, even very inexperienced C++ developers, grasp the intent directly, even if the mechanism may be unclear.

But what if we want more than one criteria, like selecting developers with a certain salary? Let's introduce an and higher order function, that returns true if two contained predicates both return true.

1
2
3
4
5
6
template <typename T, typename U>
auto pred_and(T&& t, U&& u)
{
  return [t = std::forward<T>(t),u = std::forward<U>(u)](auto&& x)
         { return t(x) && u(x); };
}

This function template pred_and(), creates a new predicate using two predicates. It will be true if and only if both t(x) and u(x) evaluates to true. It naturally short circuits the logic, so that if t(x) is false, u(x) is never evaluated.

Now finding the super well paid developers who make more that 1M becomes so easy it doesn't even need a separate function anymore.

1
2
3
auto megadevs = elements_matching(employees,
                                  pred_and(earns_at_least(1000000), 
                                           has_title("Developer")));

So what about performance, then? Surely this magic has a cost in confusing the optimiser?

I created a small test program that populates a vector of 5000 employees. 1/4 of them with title = "Developer", and an unrealistically uniform random distribution salary in the range 50000 - 250000. With those the program did 500000 loops, each filtering out the developers that make 150000 or more. This was build with clang++ 3.9.

The output from the linux tool perf are,

first from the hand made loop at the top of this post:

 Performance counter stats for './a.out':

   31 078 321 937      branch-instructions                                  
    2 055 192 607      branch-misses             #    6,61% of all branches         
   97 906 147 917      cpu-cycles                                           
  111 202 137 490      instructions              #    1,14  insn per cycle  

     32,974801101 seconds time elapsed

Then from using the algorithm and higher order function:

 Performance counter stats for './a.out':

   14 258 216 181      branch-instructions                                  
    1 513 180 390      branch-misses             #   10,61% of all branches         
   54 898 012 739      cpu-cycles                                           
   42 665 314 382      instructions              #    0,78  insn per cycle  

     18,747075752 seconds time elapsed

I must admit I was puzzled by this huge performance advantage of using higher order functions, especially when there has been no attempt to optimise it at all. It turns out there's a small mistake in the hand written loop. It compares the member .title, which is a std::string with the string literal "Developer". This means it must make character by character comparison. The higher order function has_title() captures a std::string, and equal comparison of std::string begins with checking their lengths. If the lengths are different there's no reason to look at the characters. That is the only reason I've seen for the huge gain.

Changing the hand written loop to compare title with a std::string gives this result:

 Performance counter stats for './a.out':

   13 971 299 475      branch-instructions                                  
    1 449 744 816      branch-misses             #   10,38% of all branches         
   51 964 400 535      cpu-cycles                                           
   39 992 503 929      instructions              #    0,77  insn per cycle  

     17,448599597 seconds time elapsed

So, it is better performing. Not much. But it is. However, the hand written loop gets copied all over the code base, the algorithm and higher order functions do not. The performance bug could've been made in the has_title() function as well, but it would be one place to fix, not many. Likewise, the elements_matching() algorithm could be optimised, for example with partial loop unrolling to do several comparisons per revolution. That too would be an effort spent once in one place, and not all over the code.

That was the results with clang++-3.9. Let's see how g++ 6.2 fares in comparison. First the higher order function version:

 Performance counter stats for './a.out':

   14 006 122 557      branch-instructions                                      
    1 027 464 337      branch-misses             #    7,34% of all branches         
   43 066 722 628      cpu-cycles                                               
   44 394 114 000      instructions              #    1,03  insn per cycle      

     14,505986613 seconds time elapsed

And then the hand written loop, with the fix to compare with std::string:

 Performance counter stats for './a.out':

   14 039 057 224      branch-instructions                                       
    1 852 376 974      branch-misses             #   13,19% of all branches         
   58 797 998 124      cpu-cycles                                                
   40 858 411 471      instructions              #    0,69  insn per cycle       

     19,743206041 seconds time elapsed

Here it's clear that gcc does a slightly worse job than clang with optimising the hand written loop, but does a substantially better job at optimising the higher order function version.

As can be seen, the exact result does vary between compilers, but the chances of losing any performance to speak of when using higher order functions are small, and there is the potential for very noticeable performance gains.

So, use algorithms and use higher order functions, for code clarity and for performance.