Playful Programming: 2015

Monday, August 3, 2015

Cache optimizing a priority queue

I must begin with saying that if you found this because you have a performance problem, you should almost certainly look elsewhere. It is highly unlikely that your performance problem is caused by your priority queue. If, however, you are curious, or you have done careful profiling and found out that the cache characteristics of your priority queue are causing your performance problem, and you cannot fix that by altering your design, by all means read on.

A priority queue is typically implemented as a binary heap. The std::priority_queue<> class template in the C++ standard library is such an example. There is a reasonably good explanation for how they work on Wikipedia, but I'll go through some operations anyway since it leads naturally to the optimization process.

The heap is a partially sorted tree-like structure. Below is a heap with the letters 'a'-'i'. A parent node always has higher priority than its children. In this example, 'a' has the highest priority. There is no order between the children, though. Either can have higher priority.

The numbers next to the nodes are the node indexes. A interesting property of the binary heap is that the index of a parent is always half of the index of a child (rounded down.) Another interesting property of the binary heap is that it is always packed towards lower indexes. There are never any holes in the structure.

If a new node is added, space is made for index 11. Then the new value is compared with that of the parent, i.e. index 5. If the new value has higher priority, the parent is moved to index 11, and the new value is compared with the parent of index 5, i.e. index 2. If index 2 has higher priority, the new value is inserted at index 5 and the insertion is done.

Removing the highest priority element is more involved. When 'a' is removed above, index 1 becomes free, which is forbidden, so the child with highest priority is moved to it. So 'b' is moved to index 1 and index 3 becomes free. The child of index 3 with highest priority is 'c', which is moved to index 3, which makes index 6 free. Since index 6 has no children, the phase shifts to insertion of the last element at index 6. So the last element is compared with the parent of index 6, i.e. 3. Since index 3 (which now holds 'c') has higher priority than 'j', the job is done and 'j' is moved from index 10 to index 6. Index 10 becomes free, but since its the last element that is legal and the heap shrinks in size so that index 9 becomes the last element.

Both insertion and popping the top element are O(log2(n)).

The clever thing about this is that it fits very well in an array. The above heap could as well have been drawn as below.

Traversing the nodes by doing division or multiplication by 2 on indexes is very fast compared to following pointers in a traditional tree structure.

A drawback, however, is that when the number of nodes becomes large, the distance between parent and child becomes such that only the ones nearest the root share a cache line. Assume for the sake of illustration that each cache line can hold 8 nodes. With a large heap, the result will become as illustrated below, where each color is one cache line in size.

The top 3 levels are fine, as they are all in one cache line, but after that each step towards a child or towards a parent is on a new cache line. This can become a performance problem since the caches are severely underutilized. You read a whole cache line only to access 1/8 of it. If you pop the highest priority element, the first cache-line will get accesses to at least 5/8 of its data, which is good, but the rest will get at most 2/8 accessed.

This leads to the idea of another arrangement, a heap of mini-heaps, or B-heap as it is also called. There is a brief explanation on Wikipedia.

If the mini-heaps hold 7 elements, the cache graph above instead becomes like this.

The grey triangles each represent a 7 element mini-heap. This is better, but not that good, and the index calculations become very complicated. Within each mini-heap, you make the multiplication or division by two, but the index to the parent of the top of a mini-heap and the index of children in the last row of a mini-heap is very different. Both involves modulo 7 calculations, which are not very fast. I implemented this and ran some benchmarks, but it was slower than std::priority_queue<>, despite having substantially fewer cache misses, which was rather disappointing.

Another idea is to deliberately waste space and although each mini-heap will hold only 7 values, it will consume the space of 8 values and thus a whole cache line. The result is illustrated below.

This is much better. The lower image illustrates the layout in an array where the numbers to the left are the array offset to the beginning of the mini-heap. Offset 0 in each mini-heap is not used, which makes the parent/child calculations within a mini-heap multiplication or division by 2. With mini-heap size of 8 (or any power of two) the checks for mini-heap root or last child in mini-heap becomes simple bit masking operations, which are very cheap. Calculating hops between mini-heaps is more elaborate, but is limited to multiplications or divisions by 8, and those are fast.

In the above case, inserting a new element is guaranteed to touch at most 2 cache lines, even if the added element will replace the root. Compare with the first binary heap layout which would touch 4 cache lines to replace the root. Also, the more elements in the B-heap, the better it becomes, because of the huge fan out. Each mini-heap has 8 mini-heap children, so the number of cache-lines to touch when traversing between leaf and root are log8(n).

After implementation, I ran a number of performance measurements. The graphs presented here are generated on an Intel i7-4500U CPU running Ubuntu 15.04. The test programs are compiled with g++ 4.9.3 using the flags -O3 and -std=c++14. The CPU has a 32K L1d cache, and each cache line is 64 bytes.

Of the different measurements made, the most telling one is where a heap is filled to a certain number of elements, then operated on. In pseudo-code:

for (size) queue.push()

for (cycles) {

for (n) queue.push()

for (n) queue.pop()

}

I have rather arbitrarily chosen 'n' to be 320. It's large enough to exercise the memory a bit. Pushing is done from a large array pre-populated with random numbers.

The first measurement is a priority queue of ints. There are room for 16 elements in each cache line, so measurements are made with mini-heaps of 8, 16, 32 and 64 elements. The graph displays the relative speed of the B-heaps with different mini-heap sizes over std::priority_queue<int>.

Two interesting observations here. It's always faster than std::priority_queue<int>, and the size of the mini-heaps does not seem to matter much.

However, a priority queue of just integers is almost always uninteresting. Usually the priority is associated with something, a message for example. A reasonable implementation is to let the priority queue hold the a std::pair<int, std::unique_ptr<msg>>, and prioritize on the int.

This is wild and difficult to say anything about. There are even examples of worse performance than std::priority_queue<>. I do not understand what causes this.

This leads to another idea, though. The priority queue could hold two data structures. One being the B-heap of priorities, the other an array holding the pointers. They are connected via the indexes. There are two advantages with this. One is that it holds the B-heap of priorities tight. Popping the top object always involves moving the highest priority child up, so reading the value of the lesser prioritized child is a wasted read and can evict more valuable data from the L1 cache. With a tighter B-heap structure that waste is reduced. Also, the structure holding the pointers is only touched when actual moves are being made, it's not touched during the tests. These together is likely to give better performance.

Well, it is still all over the place, but at least it's now consistently outperforming std::priority_queue<>. We can also see that a mini-heap size of only 8 elements is consistently bad. I expected a mini-heap size of 16 elements to be the best performer since 16 ints takes up exactly one cache line, but this is obviously not the case. The bigger mini-heaps outperforms it most of the time. I do not have an answer for why this is so.

Yet another idea is to continue with the separate storage for the values, but keep the unique_ptr<> in an array outside the priority queue, and use the index into that array as the data associated with the priority. This has two advantages in addition to the advantages outlined above - it further shrinks the data set being operated on, and ints are very cheap to move, relative to unique_ptr<>.

This is consistently much faster than std::priority_queue<std::pair<int,int>>. It is still difficult to say much about the preferred mini-heap size, except that 8 is too small and suffers, and that my idea of 16 being the optimum was wrong.

Source code for the priority queue, including the benchmark program, is available from github.

Friday, June 5, 2015

Performance observations on a C++ vector of lambdas

[ edit 2015-Jun-7 : The source code is available on github ]

When writing unit tests, you typically don't care much about execution speed, but compile time performance is important. After all, if building your unit test program takes 3 minutes, Test Driven Development becomes so painful it requires super human determination, but if the build takes 3 seconds, TDD becomes a joy.

I became interested in this in part from observations when making the Trompeloeil header only C++14 mocking framework, and also from the interesting blog post "Template Code Bloat Revisited: A Smaller make_shared" by Jason Turner (@lefticus.) Here's hoping that the information presented can aid other developers of C++ unit test tools in shortening the test program build times.

With the set up in place, I decided to measure run time performance as well, since the extra effort was very minor and makes the findings interesting to a much wider audience than just makers of unit test tools.

Setup

This simple lambda is repeated 1000 times in the source.

1	[](int n) { return n + __COUNTER__; }

__COUNTER__ is not needed, but it makes sure each function body differs a bit, making it a more difficult for the compiler to be clever and just make one lambda type.

The lambdas are stored in a std::vector<>. There are two representations of the vector. The simple and straight forward representation being:

1	std::vector<std::function<int(int)>> v;

The less straight forward representation uses an interface class:

template <typename sig>
struct func;

template <typename R, typename ... A>
struct func<R(A...)>
{
  virtual ~func() = default;
  virtual R call(A...) = 0;
};

template <typename Lambda, typename R, typename ... A>
struct func_n : public func<R(A...)>
{
  template <typename U>
  func_n(U&& u) : t(std::forward<U>(u)) {}
  R call(A... a) override { return t(a...); }
  Lambda t
};

std::vector<std::unique_ptr<func<int(int)>>> v;

This version has two variants. The first is to create std::uniue_ptr<func<int(int)>> and populate the vector with. The second is to create std::unique_ptr<func_n<Lambda, int, int>>> and let the conversion constructor for std::unique_ptr<> work. The second version is inspired by the blog post by Jason Turner (@lefticus,) where the idea is that the more specialized std::unique_ptr<> is costly to create and convert from, even though it is never needed.

To further the complications, there are separate measurements of populating the vector with .push_back() and with .emplace_back(). The idea is that the potential run time performance gain with .emplace_back() comes with a compile time penalty.

The vector is not pre-reserved to capacity, which should not matter for compile time or evaluation time performance, but almost certainly impacts population time performance.

The contenders in this shootout are g++4.9, g++-5.1 and clang++3.6. clang++ uses libstdc++ from g++4.9. I have measurements with clang++3.6 using libc++, but they came out so exceptionally unflattering that I suspect my libc++ installation is flawed somehow.

The machine that the measurements are made on is an Intel X980 i7@3.33GHz with 12G ram, running X86_64 Gentoo.

Compile time measurements are from a single run of each type since they take several seconds. The run time measurements are repeated 20 times, and the shortest time found is used, hoping that it has the least disturbance from other activities on the computer.

Compilation performance

Here the time (in seconds) required to compile the single source file program is measured.

Unoptimized builds

All builds are with -std=c++14 -ggdb

push_back

It is clear that compilation time with std::function<sig> is roughly twice as long as with the hand crafted function class template regardless of compiler. The expected difference between creating a std::unique_ptr<> to the interface and a std::unique_ptr<> to the derived template and convert, is no where to be seen.

emplace_back

Here it is obvious that std::function<sig> suffers even more when emplace_back() is used, and now requires more than 3 times longer time to compile than the unique_ptr<> to hand crafted interface alternative. For the latter, the compile times don't seem affected.

Optimized builds

All builds are with -std=c++14 -O

push_back

This one is weird. For g++4.9 and g++5.1, compilation times are slightly shorter with optimized builds using std::function<> than with unoptimized builds. clang++3.6 on the other hand needs quite a long time to compile the source. Again the expected difference between std::unique_ptr<> to the interface and std::unique_ptr<> to the derived class is not seen.

emplace_back

Clang++3.6 works very hard with optimized builds using std::function<>. It looks bad when you glance at the graph, but read the numbers - it shoots out above the graph an order of magnitude above that of the g++ compilers, requiring a whopping 338 seconds to compile the source. The std::unique_ptr<> alternatives seem unaffected by optimization. For g++ there is extra suffering compared to push_back() when using std::function<>, even more so than with unoptimized builds.

Populate time

Here the time needed to populate the vector with the 1000 lambdas is measured. The program is run 20 times, and the time from the fastest run is shown. Only optimized builds are measured.

push_back (µs)

It is not obvious if the differences are in the compilers or in the library versions, but given that clang++ uses libstdc++ from g++4.9 and shows very similar performance, I'm leaning towards a library improvement for std::function<> in g++5.1 being shown.

emplace_back (µs)

The expected performance boost of emplace_back() does turn out to be a disappointing performance pessimization when using g++. I guess item 42 in Effective Modern C++ is correct that while emplacement can improve performance, it might not always do so. With clang++3.6 a performance improvement can be seen, but it is very minor. However, given that the vector capacity was not pre-reserved, the move constructor called when growing the vector is likely to dominate. Perhaps the improvement with clang++3.6 would be greater with pre-reserved capacity.

Evaluation performance

Here the time is measured to call every lambda in the vector. Again, only optimized builds are measured, and the time shown is from the fastest run of 20.

evaluate all (µs)

Here std::function<> shines. I have not tried to find the cause for the performance difference, but I am guessing that the improvement comes from better locality of reference if these small lambdas can be stored in the std::function<> objects themselves, instead of being spread all over the heap as is the case with the std::unique_ptr<> implementations.

Conclusions

Estimating performance is difficult, there is no other way to put it. That enabling optimizations can reduce compilation time, as is shown with g++ and push_back() of std::function<>, is baffling. That emplacement slows down compilation time is perhaps not surprising, but neither is the size of the cost obvious. std::function<> is very fast to call, at least in this example, but it is obviously costly to create both in run time and compilation time. Clang++ choking completely on optimized builds with std::function<> is rather unexpected, and that it doesn't show any run time performance advantage for it is disappointing.

The expected conversion cost from unique_ptr<derived> to unique_ptr<base> is not seen at all, but perhaps the situation would be different if the source was split into several files.

The only things clear are that std::function<> is slow to compile and fast to call.

Sunday, January 18, 2015

Compile time quicksort in idiomatic modern C++

A contender for the most useless program ever written just got a much needed overhaul. In 2011 I wrote about compile time quick sort as a challenge to myself, and as an exercise in the then newfangled variadic templates. Now, having worked with C++11 for a few years, and lately also C++14, I see that the code I wrote then is unnecessarily clumsy and highly unidiomatic.

In C++14 the data representation is easy. std::integer_sequence<T, ts...> fits perfectly. It really is the same as I used in 2011, but this time it's provided by the standard library.

Output

In order to show sequences, an output streaming operator is handy:

template <typename T, T ... ts>
std::ostream& operator<<(std::ostream& os,
                         const std::integer_sequence<T, ts...>&)
{
  os << "{ ";
  static const T values[] { ts... };
  std::copy(std::begin(values),
            std::end(values),
            std::ostream_iterator<T>(os, " "));
  return os << "}";
}

Note that the value of the integer sequence is not used. All information is in the type. That's why filling the values in a static array makes sense - the output stream operator function will be instantiated for each different sequence being output by a program.

Concatenating sequences

An important problem to solve is that of concatenating sequences. It will be needed in quick sort itself, since it concatenates the sorted sub sequences.

First an introduction of the concat template, accompanied by an alias for making life easier.

template <typename ... T>
struct concat;

template <typename ... T>
using concat_t = typename concat<T...>::type;

The alias concat_t<T...> is a convenience. Instead of having to write using a = typename concat<T...>::type, you just write using a = concat_t<T...>.

In ways I dislike the naming convention (I'd like it reversed,) but it follows that of the C++14 standard library, and that is not unimportant. Regardless, I argue that the latter is not just less typing, but is easier to read too.

Now for the partial specializations.

As so often in template meta programming, the solution is recursive. First concatenating a single integer sequence is the trivial base case.

template <typename T, T ... ts>
struct concat<std::integer_sequence<T, ts...> >
{
  using type = std::integer_sequence<T, ts...>;
};

In ways this is even unnecessarily complicated. You can say that concatenating one type (any type) is that type itself. That would work, but I choose to make partial specializations for std::integer_sequence<> only, since it makes it slightly easier to catch stupid programming mistakes. It also future proofs concat, should I ever feel a need to concatenate other types.

The general case, concatenating two integer sequences, and some unknown tail is the same as concatenating the combination of the two integer sequences with the tail.

template <typename T, T... s1, T... s2, typename ... Tail>
struct concat<std::integer_sequence<T, s1...>,
              std::integer_sequence<T, s2...>,
              Tail...>
{
  using type = concat_t<std::integer_sequence<T, s1..., s2...>, Tail...>;
};

Note how the convenience alias concat_t is useful even in the implementation of concat itself.

Let's see if this works, using a simple program:

int main()
{
  using s1 = std::integer_sequence<int, 1, 2, 3>;
  using s2 = std::integer_sequence<int, 4, 5, 6>;
  using s3 = std::integer_sequence<int, 7, 8, 9>;
  std::cout << concat_t<s1, s2, s3>{} << '\n';
}

Note that even though the output stream operator for std::integer_sequence<> doesn't use the value itself, a value is none the less needed, so an instance is created.

The output of running the program is:

{ 1 2 3 4 5 6 7 8 9 }

so that is good.

Partitioning

Quick sort works by partitioning the sequence into a sequence of the values less than a selected pivot element, and the rest, and then recursively sort those partitions. So partitioning a sequence into two sequences based on a predicate is important.

Here too, the solution is a recursive. The base case of partitioning an empty sequence is trivial:

template <template <typename V, V> class pred, typename T>
struct partition;

template <template <typename V, V> class pred, typename T>
struct partition<pred, std::integer_sequence<T>>
{
  using incl_type = std::integer_sequence<T>;
  using excl_type = std::integer_sequence<T>;
};

After all, the number of elements selected by the predicate from the empty sequence are none, and likewise for those rejected by it.

The general case is not much more difficult:

template <template <typename V, V> class pred, typename T, T t, T ... ts>
struct partition<pred, std::integer_sequence<T, t, ts...>>
{
  using incl = std::integer_sequence<T, t>;
  using excl = std::integer_sequence<T>;

  using tail = partition<pred, std::integer_sequence<T, ts...> >;

  static const bool outcome = pred<T, t>::value;
  using incl_type = concat_t<std::conditional_t<outcome, incl, excl>,
                             typename tail::incl_type>;
  using excl_type = concat_t<std::conditional_t<outcome, excl, incl>,
                             typename tail::excl_type>;
};

As always in C++ template meta programming, using alias names for things is a must to keep the code anywhere near readable.

Above tail is the result of the partitioning of the rest of the values. std::conditional_t<selector, TrueType, FalseType> results in either TrueType or FalseType depending on the boolean value of the selector.

If we assume that the outcome of the predicate on t (the first element in the sequence) is true, then incl_type becomes the concatenation of std::integer_sequence<T, t> and incl_type from tail, otherwise it becomes the concatenation of the empty sequence and incl_type from tail.

Since this implementation builds both the included sequence and the excluded sequence in parallel, the convention of a convenience alias partition_t<> is of little use. It would be possible to make a filter template instead, that just provides a sequence of selected elements, but then filtering would have to be done twice, the second time with the inverted predicate, and even though we're in compile time computation, we care about performance and avoid two-pass implementations if we can, right?

Let's see how this partitioning works, with another simple program:

template <typename T, T v>
struct is_odd : std::integral_constant<bool, v & 1> {};

int main()
{
  using seq = std::integer_sequence<int, 1, 2, 3, 4, 5, 6>;
  using elems = partition<is_odd, seq>;
  std::cout << elems::incl_type{} << ' ' << elems::excl_type{} << '\n';
}

Lines 1-2 defines the simple predicate that tells if an integral value is odd. Line 7 partitions the provided sequence into those selected by the predicate and those rejected by it.

The result of running the program is:

{ 1 3 5 } { 2 4 6 }

so again, it looks OK.

Quick sort

Given the availability of concat and partition, quicksort becomes rather simple to implement. First the typical introduction of the template:

template <template <typename U, U, U> class compare, typename T>
struct quicksort;

template <template <typename U, U, U> class compare, typename T>
using quicksort_t = typename quicksort<compare, T>::type;

Since quicksort only has one result, the convenience alias quicksort_t becomes handy. compare is a binary predicate accepting two values of a given type.

As so often, the base case is trivial.

template <template <typename U, U, U> class compare, typename T>
struct quicksort<compare, std::integer_sequence<T>>
{
  using type = std::integer_sequence<T>;
};

After all, the result of sorting an empty sequence is always an empty sequence.

The general case, a partial specialization of quicksort with a binary predicate compare and an std::integer_sequence<> of at least one element, is surprisingly simple too:

template <template <typename U, U, U> class compare,
          typename T, T t, T ... ts>
struct quicksort<compare, std::integer_sequence<T, t, ts...>>
{
  template <typename V, V v>
  using pred = compare<V, v, t>;

  using partitions = partition<pred, std::integer_sequence<T, ts...>>;

  using incl_seq = typename partitions::incl_type;
  using excl_seq = typename partitions::excl_type;
  using type = concat_t<quicksort_t<compare, incl_seq>,
                        std::integer_sequence<T, t>,
                        quicksort_t<compare, excl_seq> >;
};

Lines 5-6 is an interesting template alias. It makes a unary predicate from the binary predicate compare and the first element of the sequence, t. So, the first element becomes the pivot element to partition the rest of the elements around, and partition requires a unary predicate as its selection criteria. As an example, if compare is "less than", and the first element t is 8, then pred becomes "is less than 8".

Line 8 is where this partitioning is done.

Lines 10-11 are just short hand conveniences to make lines 12-14 readable.

Lines 12-14 are just as from the text book, the result of quick sort is the concatenation of quick sort on the partition of elements included by the predicate, the pivot element, and the result of quick sort on the partition of elements excluded by the predicate.

Another simple program shows the result:

template <typename T, T lh, T rh>
struct less : std::integral_constant<bool, (lh < rh)> {};

int main()
{
  using seq = std::integer_sequence<int, 32, 8, 4, 1, 7, 3, 99, 101, 5>;
  std::cout << quicksort_t<less, seq>{} << '\n';
}

The output from running this program is unsurprisingly:

{ 1 3 4 5 7 8 32 99 101 }

A word on performance

Everyone who has studied algorithms knows that it's poor practice to partition the elements around the first element for quick sort, since it gives sorting an already sorted sequence O(n²) complexity. For any deterministic element selection, there is a vulnerable family of sequences, so to protect against this, the selection of the pivot element for partitioning should be random. Just two days ago, a solution to this was shown by Matt Bierner in his blog post "Compile Time Pseudo-Random Number Generator". I leave it as an exercise for someone else to improve on the quick sort implementation.

Wrap up

The readability of this code is vastly improved over the 2011 original, and its shorter too. Using alias templates does not just save typing, but also removes the need for many transformation templates. Performance is also improved over the 2011 original, which used two pass partitioning using a filter.

The utter uselessness of compile time quicksort, however, is not threatened.

Tuesday, January 6, 2015

Sequence Control with the Trompeloeil C++ Mocking Framework

As previously introduced, the Trompeloeil C++ framework is a new mocking framework for C++14. If you're not at all familiar with Trompeloeil, you may want to read the introductory first.

Quick Recap

Before diving in to the topic of sequence control, here comes a brief recapitulation of Trompeloeil.

You need to define a mock class, and you do that as you define any class, but with mock implementations of the interesting functions.

class DataReader
{
public:
  virtual ~DataReader() = default;
  virtual bool available() const = 0;
  virtual size_t appendReadBytes(size_t desiredLength, std::vector<uint8_t>& data) = 0;
};

class ReaderMock : public DataReader
{
public:
  MAKE_CONST_MOCK0(available, bool());
  MAKE_MOCK2(appendReadBytes, size_t(size_t, std::vector<uint8_t>&));
};

Mock classes typically inherit from interfaces but they are not required to, and they typically do not have data members, but they are allowed to. MAKE_MOCKn(name, sig) creates a mock implementation of a member function with n parameters. MAKE_CONST_MOCKn is the same, but for const member functions. The first parameter to both is the name of the member function, and the second is the signature.

In tests, you let your test subject interact with the mock instances, which you place expectations on.

In this post, I will use the term expectation to mean any of REQUIRE_CALL(), ALLOW_CALL(), FORBID_CALL(), NAMED_REQUIRE_CALL(), NAMED_ALLOW_CALL() and NAMED_FORBID_CALL().

Setting up the Scene

The example used in this post is a MessageDispatcher that reads messages from a DataReader and calls observers with the received messages. The observers can subscribe and unsubscribe to receiving messages.

class MessageDispatcher
{
public:
  class subscriber_handle;
  MessageDispatcher(DataReader& reader);
  void readMessage();
  subscriber_handle subscribe(std::function<void(const std::vector<uint8_t>&)>);
  void unsubscribe(subscriber_handle);
};

In this example, messages are constructed as 1 header byte containing the length of the message payload. Subscribers expect only the message payload. For different reasons, it is possible that all of a message is not available for reading, in which case the MessageDispatcher must retry reading the tail when data is available. The MessageDispatcher is expected to continue reading messages as long as DataReader says data is available.

The observers are also made as simple mocks:

class ObserverMock
{
public:
  MAKE_CONST_MOCK1(message, void(const std::vector<uint8_t>&));
};

Default Sequencing

By default, expectations on different mock objects and different member functions are unrelated and any order permutation is equally good, as long as all expectations are met by the end of their surrounding scope. Several expectations on the same member function for the same mock object are tried in reverse order of creation until a match is found. This is often useful, since you can state a default in a wide scope, and place specific expectations in narrow local scopes.

Here's a first simple test for receiving one message and forwarding it to its observer.

TEST(a_received_message_is_forwarded_to_a_single_subscriber)
{
  using trompeloeil::_;

  // the players
  ReaderMock        reader;
  MessageDispatcher dispatcher(reader);
  ObserverMock      obs;

  // setting up the scene
  dispatcher.subscribe([&](auto& m) { obs.message(m); });

  const std::vector<uint8_t> msg{'c', 'd', 'e'};

  REQUIRE_CALL(reader, appendReadBytes(1U, _))
    .WITH(_2.empty())
    .LR_SIDE_EFFECT(_2.push_back(msg.size()))
    .RETURN(1U);
  REQUIRE_CALL(reader, appendReadBytes(msg.size(), _))
    .WITH(_2.empty())
    .LR_SIDE_EFFECT(_2 = msg)
    .LR_RETURN(msg.size());
  REQUIRE_CALL(reader, available())
    .RETURN(false);
  REQUIRE_CALL(obs, message(_))
    .LR_WITH(_1 == msg);

  // action!
  dispatcher.readMessage();
}

In this example, there is little sequence control. There are two expectations on reader.appendReadBytes() on lines 15 and 19, meaning that when a call arrives, the one on line 19 is tried first and if it does not match the parameters, the one on line 15 is tried.

A reasonable implementation of MessageDispatcher will pass, but so will a number of faulty ones too. What we want is to ensure that the MessageDispatcher reads the length (the expectation on line 15), and then continues with reading the payload, and finishes with checking if more data is available. The expectation on the observer is not really a problem since the implementation would be unlikely to call it with the correct message unless it has read it properly. The order between the call to the observer and reader.available() is irrelevant.

Constrained Sequences

A fixed version with constraints on the matching sequence of expectations could look like:

TEST(a_received_message_is_forwarded_to_a_single_subscriber)
{
  using trompeloeil::_;

  // the players
  ReaderMock        reader;
  MessageDispatcher dispatcher(reader);
  ObserverMock      obs;

  // setting up the scene
  dispatcher.subscribe([&](auto& m) { obs.message(m); });

  const std::vector<uint8_t> msg{'c', 'd', 'e'};

  trompeloeil::sequence seq;
  REQUIRE_CALL(reader, appendReadBytes(1U, _))
    .WITH(_2.empty())
    .IN_SEQUENCE(seq)
    .LR_SIDE_EFFECT(_2.push_back(msg.size()))
    .RETURN(1U);
  REQUIRE_CALL(reader, appendReadBytes(msg.size(), _))
    .WITH(_2.empty())
    .IN_SEQUENCE(seq)
    .LR_SIDE_EFFECT(_2 = msg)
    .LR_RETURN(msg.size());
  REQUIRE_CALL(reader, available())
    .IN_SEQUENCE(seq)
    .RETURN(false);
  REQUIRE_CALL(obs, message(_))
    .LR_WITH(_1 == msg);

  // action!
  dispatcher.readMessage();
}

The sequence object created on line 15, is used to impose a strict order on the matched expectations (lines 18, 23 and 27.) Now only the expectation for obs.message() on line 29 is unconstrained, the other three must occur in the stated order, or a violation is reported.

Interleaved Sequences

Let's up the ante and verify that several messages can be read in one go, and that an observer can unsubscribe in the message member function. This is considerably longer, but despair not, an explanation follows.

TEST(an_observer_can_unsubscribe_in_the_callback_mid_message_processing)
{
  using trompeloeil::_;

  // the players
  ReaderMock        reader;
  MessageDispatcher dispatcher(reader);
  ObserverMock      obs1, obs2, obs3;

  // setting up the scene
  dispatcher.subscribe([&](auto &m){ obs1.message(m); });
  auto i2 = dispatcher.subscribe([&](auto &m){ obs2.message(m);});
  dispatcher.subscribe([&](auto &m){ obs3.message(m); });

  trompeloeil::sequence read_seq, obs1_seq, obs2_seq, obs3_seq;

  // first message
  const std::vector<uint8_t> msg1{'c', 'd', 'e'};

  REQUIRE_CALL(reader, appendReadBytes(1U, _))
    .WITH(_2.empty())
    .IN_SEQUENCE(read_seq)
    .LR_SIDE_EFFECT(_2.push_back(msg1.size()))
    .RETURN(1U);
  REQUIRE_CALL(reader, appendReadBytes(msg1.size(), _))
    .WITH(_2.empty())
    .IN_SEQUENCE(read_seq, obs1_seq, obs2_seq, obs3_seq)
    .LR_SIDE_EFFECT(_2 = msg1)
    .LR_RETURN(msg1.size());
  REQUIRE_CALL(obs1, message(_))
    .LR_WITH(_1 == msg1)
    .IN_SEQUENCE(obs1_seq);
  REQUIRE_CALL(obs2, message(_))
    .LR_WITH(_1 == msg1)
    .IN_SEQUENCE(obs2_seq)
    .LR_SIDE_EFFECT(dispatcher.unsubscribe(i2));
  REQUIRE_CALL(obs3, message(_))
    .LR_WITH(_1 == msg1)
    .IN_SEQUENCE(obs3_seq);
  REQUIRE_CALL(reader, available())
    .IN_SEQUENCE(read_seq, obs1_seq, obs2_seq, obs3_seq)
    .RETURN(true); // continue reading messages

  // second message
  const std::vector<uint8_t> msg2{'f'};

  REQUIRE_CALL(reader, appendReadBytes(1U, _))
    .WITH(_2.empty())
    .IN_SEQUENCE(read_seq)
    .LR_SIDE_EFFECT(_2.push_back(msg2.size()))
    .RETURN(1U);
  REQUIRE_CALL(reader, appendReadBytes(msg2.size(), _))
    .WITH(_2.empty())
    .IN_SEQUENCE(read_seq, obs1_seq, obs2_seq, obs3_seq)
    .LR_SIDE_EFFECT(_2 = msg2)
    .LR_RETURN(msg2.size());
  REQUIRE_CALL(obs1, message(_))
    .LR_WITH(_1 == msg2)
    .IN_SEQUENCE(obs1_seq);
  REQUIRE_CALL(obs3, message(_))
    .LR_WITH(_1 == msg2)
    .IN_SEQUENCE(obs3_seq);
  REQUIRE_CALL(reader, available())
    .IN_SEQUENCE(read_seq, obs1_seq, obs2_seq, obs3_seq)
    .RETURN(false); // no more data

  // action!
  dispatcher.readMessage();
}

The first interesting thing is the many sequence objects defined on line 15. While we still don't want to impose any specific order between the different subscribers, we want to ensure that the two messages are delivered to the observers in order, and immediately after having been read.

Lines 30-39 say that each observer is expected to be called, but since they have different sequence objects, any order between them is OK. But, already the expectation for appendReadBytes() also refers to each of the sequence objects obs1_seq, obs2_seq and obs3_seq, meaning that the observers may not be called until after the message has been read.

Line 41 says that reader.available() must be called after the messages have been read, since it refers to read_seq, which also the expectations on appendReadBytes() on lines 21-21 do. Line 41 also refers to each of obs1_seq, obs2_seq and obs3_seq, meaning that although there is no restriction on which observer gets called in which order, all of them must have been called before reader.available() is called.

The second section on lines 44-65 is more or less the same, but since obs2 unsubscribes on line 36, the only remaining references to obs3_seq are for reads.

A graphic representation of allowed sequences may be viewed as follows:

Wrap up

By default calls to unrelated mock objects and unrelated member functions on a mock object have no restrictions on the order of matched calls, while expectations to the same member function of the same mock object are tried in reverse order of creation, until a match is found.

Sequence objects is a powerful tool for both ensuring that calls to unrelated objects happens in a required sequence, and for imposing a more readable sequencing of calls matching several expectations to the same member function on the same mock object.

Using several sequence you can describe interleaved sequences that may, but need not, have common points. This allows you to restrict the test exactly as much as necessary, but also allows you to relax the test as much as desired, to allow for flexibility in implementation (for example, you typically don't want to impose an order of objects visited from a hash-table.)