A very common use case in Best Execution is to compare data across multiple groups to assess relative performance. For example, a buy-side firm may want to understand how their brokers performed in each period. If one broker is particularly good, that broker should receive more trading flow as a reward in the next period. Other comparisons could be made to determine which internal strategy is having the best performance, or how a change to an existing trade implementation strategy affects it.

The quantitative framework for Best Execution is Transaction Cost Analysis (TCA). TCA works by taking client order data and putting it into context with the prevailing market conditions at the time each order was executed. Over time, many orders can be analyzed by recording the market prices at various times during the life of the order and then archiving those results. This data set can then be used to make comparisons by grouping the orders according to a dimension of interest. For example, we can compute the value weighted performance of each broker. We could also compare the performance of our buys versus our sells. We can use any attribute that we have recorded on the orders to group by, and subsequently make comparisons.

This sounds very simple. There are a couple catches, though. First, we will need to ensure that we are making ‘apples-to-apples’ comparisons. It is important to ‘control’ for factors such as the type of security we are trading and our relative size to the total volume transacted in the market which means we will likely need to remove some orders from the analysis. We do not want to compare the performance of a broker executing small cap securities over multiple days with brokers executing mega-cap securities over 10 minutes. So, we must be wary of what is called ‘selection bias’ which is another way of saying we want to ensure the groups of orders we are comparing all share ‘similar’ characteristics.

The second issue we face comes from the need to ensure that any differences we measure are ‘significant’. If you flip a fair two-sided coin an odd number of times and measure the probability of heads, the observed probability of heads can never be equal to the true value. For a fair coin the true probability would be exactly 50%, but in any experiment with an odd number of flips, the probability of obtaining heads can never be exactly 50%. The good news is that the more flips observed, the closer we get to the true value. What we need is lots of flips to determine if the coin is fair. In TCA, we need lots of orders.

If we only have a few orders, we cannot make meaningful distinctions. As the number of observations go up, so does our ability to tell whether there is a ‘statistically significant’ difference between the groups of results. There are statistical techniques that we can use to tell whether we should consider the observed differences between groups to be significant. We can apply these tests to our Best Execution questions and create tools such as a ‘Broker Report Card’ which not only ranks based on performance, but also runs statistical tests to determine whether the difference in performance is large enough to be significant. If the differences are not significant, there should be ties in the ranking (see Figure 1 for example).

*Figure 1. Above we show a comparison of performance across several brokers. We show boxplots of the distributions of outcomes for arrival price performance using approximately 2,400 orders for each broker. We see from the stats that Broker 3 has the best performance and is significantly better than any other broker. Brokers 1, 2, and 4 are all tied while Broker 5 is significantly worse than the group with Rank 5. The more orders we have, the more our ability to resolve significant differences increases.*

Now we must face a bit of a reality check. To address the first consideration, ensuring we are comparing apples with apples - we often need to remove some orders from our group. To address the second consideration, measuring statistically significant differences between groups, we want as many orders as possible.

There are a couple approaches we can take. One, we can execute more orders. This is problematic in most cases. Implementing orders is a part of the overall investment process and generally adds implementation costs. A second approach would be to accrue orders over time building up more and more. This is definitely viable but can take time. A third approach would be to participate in a peer group analysis where orders are pooled across many firms.

Fortunately, there is a fourth approach, which involves modifying traditional benchmarks to include the liquidity environment in which the order was implemented so that we can better compare performance across groups. In general, buy-side firms will trade many different securities. Each security has its own liquidity characteristics. The liquidity characteristics of each security also changes over time. This means we must try and observe what happens with our orders in these various scenarios.

What we do in the fourth approach is to develop analytics that measure performance relative to prevailing liquidity conditions, which means we can loosen the filters on the control groups to admit more orders into our statistical analysis. (We covered relative performance benchmarks in an earlier article that can be found here).

In machine learning (ML), this fourth approach is sometimes called feature engineering. We apply domain knowledge to adjust for the difficulty of the order by observing the liquidity conditions that were prevalent during the execution of the order, and then compare these relative benchmarks across the groups of interest. This optimizes the balance between selection bias and statistical error.

Once we can make statistically significant comparisons, we can apply many ML and AI techniques to leverage the value of our proprietary order data. The whole process is akin to developing a ‘gut’ feel. While people develop a gut feel from subconscious memories of past experiences, AI models can use previous results from TCA to make predictions and recommendations. For example, with enough TCA orders, an AI recommendation system can suggest which strategy would best meet the implementation goals as expressed in a firm’s Best Execution policy.

We have moved well beyond arithmetic TCA into the realm of statistical TCA which allows us to leverage ML approaches that can inform Best Execution policies and procedures.

That is significant.

## Comments