Parallel Sort and Merge Join - Watch out for Unpredictability in Performance

When SQL Server sorts or does merge join in parallel the query performance highly depends on other concurrent CPU intensive activities taking place on the server.

In the below example you will see that when we execute a query with MAXDOP 0 it completes in less than a second. When we have a CPU intensive query executing over a CPU core, the above query with MAXDOP 0 takes 42 seconds. In reality it need not just be 42 seconds, it could be many minutes or hours depending on data volume. Also possible solutions are discussed in webcast 3.

Enough explanation, let’s see a real world example.

Let’s execute the query with MAXDOP 0:

The reason I am using convert is to trick the optimizer to grant additional memory. The memory estimated by the optimizer without convert is not enough to perform an in-memory sort. The sort spills to tempdb without the convert leading to non optimal performance. You can learn more about this from my webcasts or watching my SQL Bits V session at or reading the MSSQL Tip based on our webcast, (written by Matteo Lorini).

--Example provided by

--Query takes less than a second


Now let’s execute a CPU intensive query over another session:

--Example provided by
--Execute this CPU intensive query on a different session



Now let’s execute the above query with MAXDOP 0:

--Example provided by

--Query takes about 42 second

--How many seconds did it take on your server ?


The reason for the performance issue is the way SQL Server implements merging or order preserving exchanges and SQL Server’s CPU/thread scheduling architecture.

Don't miss our SQL Server Performance Tuning Hands-on WorkshopRegister

SQL Server Performance Tuning Consulting & Health Check

sp_whopro - SQL Server Activity Monitoring and Logging Stored Procedure - Free Download

SQLTest - SQL Server Performance, Load, Stress and Unit Test Tool - Free Download

SQLVideo - SQL Server Performance Tuning and Troubleshooting - Free Youtube Videos, now also available on MSDN Channel 9!

You can see our ‘Gather Streams’ has ‘Order By’, the sorted output of the parallel thread has to be merged by ‘Gather Streams’ maintaining the sorted order.


Every parallel thread supplies a page of sorted rows (packet) and the ‘Gather Streams’ merges these pages. Once ‘Gather Streams’ runs out of rows from a thread it waits for the next page of sorted rows from that thread. If that thread (which has to supply the next page of sorted rows) is executing on a CPU core that is busy executing a CPU intensive query, then the thread has to wait up to 4ms to supply the next page of sorted rows.

Duration depends on how many CPU cores you have, with more cores, each thread sorts less rows.

4ms is the scheduling quantum in SQL Server. Refer to the article

I wrote a while ago and the webcasts for additional details.


Some of you might think well in that case I will not execute queries in parallel, will set MAXDOP to 1. There are cases where you have to use all the CPU cores (return on investment (ROI), Maximizing your existing hardware CPU, memory and disk); you cannot tell your customers ‘well everyone has to wait while I execute this report using 1 CPU core keeping the rest of the CPU cores idle for the next hour’. You have to find creative ways to resolve these issues. One way to solve the issue and bring back query performance predictability is to partition the workload; I discuss this in my webcast (

This issue is just not limited to Sort; below you can see SQL Server chooses to execute the query in parallel using merge join.

The query executes faster with merge join when the CPU intensive query is not executing:


The query does parallel merge join:


When we force hash join, the query performance is similar:


The query does parallel hash join:

Let's execute the CPU intensive query on a CPU core:



Now the Merge join query is very slow, takes in my case 19 seconds:


In the execution plan you can see merge join:


Now when we force a hash join, the query is fast.


In the execution plan you can see hash join:


This is because hash join does not suffer from CPU/thread scheduling architecture like merge join. Parallel hash join suffers from memory pressure due to uneven row distribution, similar to parallel sort, which I explain in my webcasts.

In a situation like above, when the cost of parallel merge and parallel hash are close, one can choose hash (with an hint) as it gives performance predictability. And again the other possibility is to partition the workload (Soft-NUMA) like I discuss in webcast 3.

I explain these concepts with detailed examples in my webcasts (, I recommend you to watch them. The best way to learn is to practice. To create the above tables and reproduce the behavior,  subscribe to our Newsletter and I will send you the relevant SQL Scripts.

This and similar topics are covered in my workshops.

Read some more SQL Server Performance Tuning Articles

This article is for informational purposes only; you use any of the suggestions given here entirely at your own risk.

© 2010-2018 All rights reserved.

Back to Top