^{1}

^{2}

^{3}

^{2}

^{4}

^{2}

^{1}

^{2}

^{3}

^{4}

Edited by: Satrajit S. Ghosh, Massachusetts Institute of Technology, USA

Reviewed by: Saad Jbabdi, University of Oxford, UK; Nicholas J. Tustison, University of Virginia, USA; Anastasia Yendiki, Massachussetts General Hospital, USA

*Correspondence: Eleftherios Garyfallidis, Computer Science Department, 2500 University Boulevard, Sherbrooke, QC J1K 2R1, USA. e-mail:

This article was submitted to Frontiers in Brain Imaging Methods, a specialty of Frontiers in Neuroscience.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.

Diffusion MR data sets produce large numbers of streamlines which are hard to visualize, interact with, and interpret in a clinically acceptable time scale, despite numerous proposed approaches. As a solution we present a simple, compact, tailor-made clustering algorithm, QuickBundles (QB), that overcomes the complexity of these large data sets and provides informative clusters in seconds. Each QB cluster can be represented by a single centroid streamline; collectively these centroid streamlines can be taken as an effective representation of the tractography. We provide a number of tests to show how the QB reduction has good consistency and robustness. We show how the QB reduction can help in the search for similarities across several subjects.

Following the acquisition of diffusion MR scans, processes of reconstruction and integration are performed to create a ^{6}) depending principally on the number of seed points used to generate the tractography but also on how the tractography propagation algorithm handles voxels with underlying fiber crossings.

The size of these tractographies makes them difficult to interpret and visualize. A clustering of some kind seems to be an obvious route to simplify the complexity of these data sets and provide a useful segmentation. As a result, during the last 10 years there have been numerous efforts by many researchers to address both unsupervised and supervised learning problems of brain tractography. As far as we know all these methods suffer from low time efficiency, i.e., they are very slow when used in practice.

In the tractography literature we can find approaches which use unsupervised and/or supervised learning algorithms to create bundles, i.e., groupings of streamlines with similar spatial and shape characteristics. In supervised learning the data sets are divided into a training and a test set. For the training set, experts will have provided anatomical labels for a set of manually segmented streamline bundles. Those bundles will now correspond to tracts, e.g., the corticospinal tract or the arcuate fasciculus. The task then will be to identify similar structures amongst the unlabeled streamlines in the test set.

In unsupervised learning the focus is on creating a partitioning of the streamlines without knowing any labels. In this work we used unsupervised learning to reduce in a simple and efficient way the number of streamlines and make manual segmentation of bundles and tractography exploration less time consuming tasks. By the term

We believe that a complete unsupervised method cannot directly create anatomically valid bundles without extensive prior information preferably from experts or atlases. This is because anatomical bundles differ considerably both in lengths and in shape (see Schmahmann and Pandya,

Most clustering (unsupervised learning algorithms) are in the best case of complexity

Other clustering methods have also been proposed that use graph theoretic approaches (see Brun et al.,

In fact tractographies have high levels of redundancy with many similar streamlines. Our approach is to take advantage of this to reduce the size and dimensionality of the data sets as a precursor for higher complexity classification and/or input from experts. To address these key issues of time and space we present a stable, generally linear time clustering algorithm that can generate meaningful clusters of streamlines in seconds with minimum memory consumption. Our approach is straightforward and we do not need to calculate all pairwise distances unlike most existing methods. Furthermore we can update our clustering online, i.e., as and when new data points become available. In this way we can overcome the previous barriers of space and time.

We show that QuickBundles can generate these clusters many times much faster than any other available method, and that it can be used to cluster from a few hundred to many millions of streamlines.

We think that there is no current unsupervised anatomical segmentation method that can have general usability without expert knowledge integration. Nonetheless, neuroanatomists often disagree on definition of major structures or on which streamlines correspond to actual tracts (Catani et al.,

A wide range of approaches have been taken in the literature for representing or coding for tractographies (Chung et al.,

For these reasons we have opted to use a rather simpler symmetric distance function (Garyfallidis et al.,

_{direct}, _{flipped}) which is a symmetric distance function that can deal with the streamline bi-directionality problem; it works on streamlines which have the same number of points_{mean} = (_{mean}(_{mean}(

As it has no preferred orientation, a streamline _{1}, _{2}, …, _{K}_{1}, _{2}, …, _{K}^{3} and its flipped version ^{F}_{K}_{K−1}, …, _{1}). With this notation the direct, flipped and MDF distances are defined as follows:

Here |_{direct}(

The MDF distance is in fact a metric on the space of streamlines. Obviously MDF distances are non-negative, are zero if and only if the two streamlines are identical, and symmetrical. The triangle inequality is established as follows. Let ^{F}^{F}

The main advantages of the MDF distance are that it is fast to compute, it takes account of streamline direction issues through consideration of both direct and flipped streamlines, and that its behavior is easy to understand (see Figure

Another advantage of the MDF distance function is that it separates short streamlines from long streamlines; a streamline

A further important advantage of having streamlines with the same number of points is that we can easily do pairwise calculations on them; for example add two or more streamlines together to create a new average streamline. We will see in the next section how streamline addition is a key property that we exploit in the QB clustering algorithm.

Care needs to be given to choosing the number of points required in a streamline (streamline discretization). We always keep the endpoints intact and then discretize into segments of equal lengths. One consequence of short streamlines having the same number of points as long streamlines is that more of the curvature information from the long streamlines is lost relative to the short streamlines, i.e., the short streamlines will have higher resolution. We found empirically that this is not an important issue and that for clustering purposes even discretizing to only

In some later stages in the analysis of tractographies, e.g., for merging clusters, we find a use for Hausdorff-type distance functions which for simplicity we denote as MAM distances – short for Minimum (or Maximum, or Mean) Average Minimum distance (MAM). (In this nomenclature the classical Hausdorff distance is the Maximum Average Minimum distance.) We mostly use the Mean version of this family, see equation (

where the number of points _{min}, MAM_{max}, and MAM_{mean} will give different results. For example, MAM_{min} will bring together more short streamlines with long streamlines than MAM_{max}, and MAM_{mean} will have an in-between effect. Other distances than _{i},

QuickBundles (QB) is a surprisingly simple and very fast algorithm which can reduce tractography representation to an accessible structure in a time that is linear in the number of streamlines

In QB each item, a streamline, is a fixed-length ordered sequence of points in ℝ^{3}, and QB uses comparison functions and amalgamations which take account of and preserve this structure. Moreover each item is either added to an existing cluster on the basis of the distances between the cluster descriptor of the item and the descriptors of the current list of clusters. Clusters are held in a list which is extended according to need. Unlike amalgamation clustering algorithms such as

A clustering algorithm needs a measure of distance between two streamlines, and QB uses a particular distance measure that we call minimum average direct flip (MDF). The MDF measure requires that each streamline be resampled to have

QuickBundles stores information about clusters in _{i}_{i}

An example of the QB centroid is presented in Figure

Algorithm 1: QuickBundles

_{1}, (…),

_{i}

_{N}

_{1}, (…),

_{k}

_{M}

_{1}(←) ([1],

_{1,1})

_{1}]

_{i}

_{k}

_{k}

_{direct}(

_{flipped}(

_{k}(←) 1

_{k}(←) d

_{l}

_{l}(·)h (←)

_{l}

^{F}

_{l}

_{l}

_{l}

_{l}

_{l}

_{M(+)1}(←) ([

_{M(+)1})

The algorithm proceeds as follows. At any one step in the algorithm we have _{1} and place it in the first cluster _{1} ← ({1}, _{1}, 1); _{i}_{e}_{e}_{e}_{e}_{e}_{e}_{M+1} ← ([_{i}

Choice of orientation can become an issue when adding streamlines together, because streamlines can equivalently have their points ordered 1, …,

The complete QB algorithm is described in formal detail in Algorithm 1. One of the reasons why QB has on average linear time complexity derives from the structure of the cluster node: we only save the sum of current streamlines

One of the major benefits of applying QB to tractographies is that it can provide meaningful simplifications and find features that were previously invisible or difficult to locate because of the high density of the tractography. For example we used QB to cluster part of the corticospinal tract (CST). This bundle was labeled in the data sets provided by PBC (2.5) and it was selected by an expert. The QB representation is clearly shown in Figure

Another interesting feature of QB is that it can be used to merge or split different structures by changing the clustering threshold. This is shown in Figure

We can see similar effects with real streamlines, for instance those of the fornix shown at the left panel of Figure

In order to quantify the dimensionality reduction it achieves we applied QB clustering to the 10 human subject data sets (2.5). The mean data compression (ratio of tractography size to number of QB clusters) was 34.4:1 with a 10 mm threshold and 230.4:1 with a 20 mm threshold.

We have found rather few systematic ways available in the literature to directly compare different clustering results for tractographies, beyond that of Moberts et al. (_{ij}_{i}_{j}_{ij}_{i}_{j}

We will use OMA to compare the different clusterings that arise when the streamlines in the tractography are shuffled. However this statistic has its limitations. Not only are there considerable computational overheads in calculating the cross-classification matrix, there is also a fundamental disadvantage because they do not work with clusterings of different tractographies. Being able to compare results of clusterings across brains is crucial for creating stable brain imaging procedures, and therefore it is necessary to develop a way to compare different tractography clusterings on different sets of streamlines from the same or different subjects.

Although we recognize that these are difficult problems, we introduce three novel comparison functions which we call

Let

We define the

Coverage and overlap measure how well one set approximates another. In order to compare two reductions of possibly different data sets we define the symmetric measure

BA ranges between 0, when no streamlines of S or T have neighbors in the other set, and 1 when they all do.

We applied QuickBundles to a variety of data sets: simulations, 10 human tractographies collected and processed by ourselves, and one tractography with segmented bundles which was available online.

We generated 3 different bundles of streamlines from parametric paths sampled at 200 points. The streamlines were made from different combinations of sinusoidal and helicoidal functions. Each bundle contained 150 streamlines. For the red bundle in Figure

We collected data from 10 healthy subjects at the Medical Research Council’s Cognition and Brain Sciences Unit with a 3 T scanner (TIM Trio, Siemens), using a Siemens advanced diffusion work-in-progress sequence, and STEAM (Merboldt et al., ^{2}, matrix size 96 × 96, and slice thickness 2.5 mm (no gap). Fifty five slices were acquired to achieve full brain coverage, and the voxel resolution was 2.5 × 2.5 × 2.5 mm^{3}. A 102-point half grid acquisition (Yeh et al., ^{2} was used. The total acquisition time was 14′21″ with TR = 8200 ms and TE = 69 ms. The experiment was approved by the Cambridge Psychology Research Ethics Committee (CPREC).

For the reconstruction of the 10 human data sets we used Generalized Q-Sampling (Yeh et al.,

We also used labeled data sets by experts (see Figures ^{1}

In this section we justify our claims about the speed and linear complexity of QB (3.1). Next we demonstrate the robustness of QB as a method for clustering tractographies (3.2). In Section

The execution time of QB is affected by the following parameters:

The complexity of QB is in the best case linear time

As a further test we compared QB (with 12 point streamlines and a clustering threshold of 10 mm) with timings reported from the fastest state of the art methods found in the literature. These methods have different goals from those of QB however we think that it is useful to show the important speedup that QB offers for the same number of streamlines. With 1000 streamlines Wang et al.’s (

One of the disadvantages of most clustering algorithms is that they give different results with different initial conditions; for example this is recognized with k-means, expectation maximization (Dempster et al.,

[A] First we look at the stability of the number of clusters with respect to random permutations. [B] Next we will use optimized matching agreement (OMA) to establish how well the detailed content of QB clustering is preserved under random permutations. [C] Next we will show using the coverage and overlap metrics how the QB centroids are a better reduction of the tractography dataset than an equivalent number of random selection of streamlines. [D] Finally we will show how well QB clustering on a subset of a tractography dataset serves as an approximation to the remainder of the dataset.

[A] We recorded the numbers of QB clusters in 25 different random permutations of the tractographies of 10 human subjects acquired as described in Section

By contrast the within-subject variability of the number of clusters across random permutations is rather small, with mean SD 12.7 (min. 7.3; max. 17.4). The standard error of the individual subject means above is (worst case) ±3.9 which gives strong assurance that 25 random permutations are adequate to get reliable subject level estimates and that there is minimal fluctuation across these permutations. This suggests a good level of consistency in the data reduction achieved by QB within each tractography.

[B] Next we investigated how consistent QB clusterings are when datasets are permuted. Sixteen different random permutations were generated for each of 10 tractographies and the corresponding QB clusterings were computed with clustering threshold 10 mm. For each subject the 120 pairings of QB clusterings were compared using the optimized matched agreements index and then averaged. Across subjects the average OMA (see Section

To motivate our understanding of worst and best case scenarios when the clusterings in question faithfully capture the structure of the underlying dataset, we consider what happens when the dataset consists of parallel lines of uniform spacing. The result of QB clustering is an approximate partitioning into equally spaced pieces. There will typically be an offset between two such partitionings, and the OMA between them will range between 100% when they coincide and 50% when they are most out of phase.

[C] Recognizing that large tractography datasets present a computational challenge, some authors (e.g., O’Donnell and Westin,

Thresholds | Comparison | Coverage% (SD) | Overlap (SD) |
---|---|---|---|

10 mm/10 mm | QB Centroids | 99.96 (0.007) | 2.44 (0.08) |

Random | 90.49 (0.41) | 6.16 (0.55) | |

20 mm/20 mm | QB Centroids | 99.99 (0.004) | 3.54 (0.18) |

Random | 95.86 (0.62) | 6.81 (0.93) |

We conclude from this that the QB centroids have near perfect coverage, and the typical streamline is adjacent to 2–4 centroids, depending on threshold. By comparison the random subsets have rather lower coverage, failing to approximate between 5 and 10% of the tractography depending on choice of threshold. Moreover the overlap rises strikingly to between 6 and 7. Therefore QB has overall superior performance to a random set.

[D] The final check on the effectiveness of QB clustering centroids is to see how well they approximate a dataset from which they were not derived. For this purpose the coverage and overlap statistics for the QB centroids were compared between the first half of the tractographies from which they were derived, and the second half. The results are presented in Table

Thresholds | Comparison | Coverage% (SD) | Overlap (SD) |
---|---|---|---|

10 mm/10 mm | First half | 99.96 (0.007) | 2.44 (0.08) |

Second half | 99.31 (0.08) | 2.44 (0.08) | |

20 mm/20 mm | First half | 99.99 (0.004) | 3.54 (0.18) |

Second half | 99.91 (0.007) | 3.54 (0.18) |

For each threshold, the first row repeats that of the previous table, while the second row shows that there is only a small fall-off in coverage, and that the overlap is unchanged. The QB centroids are therefore can be taken as a valid reduction of the other halves of the datasets.

We warped 10 tractographies each belonging to a different healthy subject (see Section ^{2}

For every subject we only considered the biggest 100 QB clusters, i.e., the clusters which contained the highest number of streamlines. The purpose of this experiment was to identify a similarity measure between the streamlines of the different subjects.

In Figure

Further insights into the kind of correspondences that QB establishes are shown in Figure

The mean total number of streamlines in the 100 biggest clusters was 4,818.6 (±794.4). These clusters covered on average 16.18% (±1.4%) of the total number of streamlines. We proceeded to use these centroids to study the variability between the streamlines across different subjects.

For this purpose we evaluated the bundle adjacency statistic (BA; see Section

For BA10 the most dissimilar subjects were subjects 4 and 6 with BA = 38.5%. The most similar subjects were 4 and 5 with BA = 59.5%. The mean BA10 was 48% (±4.9%). With BA20 the most dissimilar subjects were subjects 7 and 10 with BA = 72%. The most similar subjects were, in agreement with BA10, 4 and 5 with BA = 88.5%. The mean BA20 was 80% (±3.2%).

In this experiment there was a great variability of centroid lengths (mean = 73.6 ± 43.9 mm). If we suppose that shorter streamlines are more likely to be noise artifacts we would expect that by concentrating on longer streamlines we would have a more robust similarity measure for tractography comparison. We propose to follow this up in future work by studying how the length of the big clusters affects BA.

In general taking short streamlines into account is less valid because (a) the longer streamlines have greater potential to be useful landmarks when comparing or registering different subjects, as they are more likely to be present in most subjects, (b) removing short streamlines facilitates the usage of distance based clustering (no need for manually setting the clustering threshold) and interaction with the tractography, and (c) typically one first wants to see the overall representation of the tractography and later go to the details. MDF distance often separates shorter from longer neighboring streamlines which is both a strength and a limitation according to application. Nonetheless, after having clustered the longer streamlines there are many ways to assign the shorter clusters to their closest longer clusters. For this purpose we recommend using a different distance from MDF for example the minimum version of MAM referred to as MAM_{min} in equation (

Here we discuss two simple strategies for clustering short streamlines. The first is an unsupervised technique and the second is supervised.

Cluster the long streamlines using QB with clustering threshold at 10 mm and then cluster the short streamlines (<100 mm) to a lower threshold and assign them to their closest long streamline bundle from the first clustering using the MAM_{min} distance.

Cluster the tractography of a subject, pick a centroid streamline and then find the closest streamlines to that selected streamline using MDF, cluster the closest streamlines found from the previous step and for each one of these new centroid streamlines find the closest streamlines using the MAM_{min} distance. We should now have an amalgamation of shorter and longer streamlines in one cluster.

An example of this second strategy is shown in Figure _{min} distance. In this way we managed to bring together in a semi-automatic fashion an entire bundle consisting both of long and short streamlines by just selecting initially a single representative streamline.

_{min} distance) from the centroid streamlines in

We have presented a novel and powerful algorithm – QuickBundles (QB). This algorithm provides simplifications to the problem of revealing the detailed anatomy of the densely packed white matter which has recently attracted much scientific attention; and it is recommended when large data sets are involved. QB can be used with all types of diffusion MRI tractographies which generate streamlines (e.g., probabilistic or deterministic) and it is independent of the reconstruction model. QB is supported by a distance function MDF on the space of streamlines which makes it a metric space. QB can achieve compression ratios of the order of 200:1 depending on the clustering threshold while preserving characteristic information about the tractography.

In common with mainstream clustering algorithms such as

Other algorithms previously too slow to be used on the entire tractography can now be used efficiently too if they start their clustering on the output of QB rather than the initial full tractography.

We saw that QB is a linear time clustering method based on streamline distances, which is on average linear time

Additionally, QB can be used to explore multiple tractographies and find correspondences or similarities between different tractographies. This can be facilitated by the use of Bundle Adjacency (BA) a new similarity measure introduced in this paper.

The reduction in dimensionality of the data achieved by QB means that BOIs (bundles of interest) can be selected as an alternative to ROIs for interrogating or labeling the data sets. Our experience with ROI-based matter atlases (WMAs) is that they cannot differentiate fiber directions, i.e., several different bundles could cross an ROI. Therefore, ROIs constructed with a WMA do not lead to anatomical bundles and typically lead to large sprawling sets of streamlines. BOIs seem to be a solution to this problem and BOI creation can be facilitated by QB. Furthermore, we showed that QB can be used to find obscured streamlines not visible to the user at first instance. Therefore, QB opens up the road to create rapid tools for exploring tractographies of any size.

In the future we would like to investigate different ways to merge QB clusters by integrating prior information from neuroanatomists. We are currently working on developing interactive tools which exploit the simplification that QB provides (see Garyfallidis et al.,

We have shown results with data from simulations, single and multiple real subjects. The code for QuickBundles is freely available at

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We gratefully acknowledge valuable discussions with Arno Klein and John Griffiths on various aspects of this work.

^{1}

^{2}