Taskonomy: Disentangling Task Tr
4. Experiments
With 26 tasks in the dictionary (4 source-only tasks), our approach leads to training 26 fully supervised task-specific networks,

order, from which we sample according to the procedure in Sec. 3. The total number of transfer functions trained for the taxonomy was ∼3,000 which took 47,886 GPU hours on the cloud.

Out of 26 tasks, we usually use the following 4 as sourceonly tasks (described in Sec. 3) in the experiments: colorization, jigsaw puzzle, in-painting, random projection. However, the method is applicable to an arbitrary partitioning of the dictionary into T and S. The interactive solver website allows the user to specify any desired partition.

Table 1: Task-Specific Networks’ Sanity: Win rates vs. random (Gaussian) network representation readout and statistically informed guess avg.
Network Architectures: We preserved the architectural and training details across tasks as homogeneously as possible to avoid injecting any bias. The encoder architecture is identical across all task-specific networks and is a fully convolutional ResNet-50 without pooling. All transfer functions include identical shallow networks with 2 conv layers (concatenated channel-wise if higher-order). The loss (
) and decoder’s architecture, though, have to depend on the task as the output structures of different tasks vary; for all pixel-to-pixel tasks, e.g. normal estimation, the decoder is a 15-layer fully convolutional network; for low dimensional tasks, e.g. vanishing points, it consists of 2-3 FC layers. All networks are trained using the same hyperparameters regardless of task and on exactly the same input images. Tasks with more than one input, e.g. relative camera pose, share weights between the encoder towers. Transfer networks are all trained using the same hyperparameters as the task-specific networks, except that we anneal the learning rate earlier since they train much faster. Detailed definitions of architectures, training process, and experiments with different encoders can be found in the supplementary material. Data Splits: Our dataset includes 4 million images. We made publicly available the models trained on full dataset, but for the experiments reported in the main paper, we used a subset of the dataset as the extracted structure stabilized and did not change when using more data (explained in Sec. 5.2). The used subset is partitioned into training (120k), validation (16k), and test (17k) images, each from non-overlapping sets of buildings. Our task-specific networks are trained on the training set and the transfer networks are trained on a subset of validation set, ranging from 1k images to 16k, in order to model the transfer patterns under different data regimes. In the main paper, we report all results under the 16k transfer supervision regime (∼10% of the split) and defer the additional sizes to the supplementary material and website (see Sec. 5.2). Transfer functions are evaluated on the test set.
How good are the trained task-specific networks? Win rate (%) is the proportion of test set images for which a baseline is beaten. Table 1 provides win rates of the taskspecifc networks vs. two baselines. Visual outputs for a ran dom test sample are in Fig. 3. The high win rates in Table 1 and qualitative results show the networks are well trained and stable and can be relied upon for modeling the task space. See results of applying the networks on a YouTube video frame-by-frame here. A live demo for user uploaded queries is available here.

Figure 8: Computed taxonomies for solving 22 tasks given various supervision budgets (x-axes), and maximum allowed transfer orders (y-axes). One is magnified for better visibility. Nodes with incoming edges are target tasks, and the number of their incoming edges is the order of their chosen transfer function. Still transferring to some targets when tge budget is 26 (full budget) means certain transfers started performing better than their fully supervised task-specific counterpart. See the interactive solver website for color coding of the nodes by Gain and Quality metrics. Dimmed nodes are the source-only tasks, and thus, only participate in the taxonomy if found worthwhile by the BIP optimization to be one of the sources.
To get a sense of the quality of our networks vs. state-ofthe-art task-specific methods, we compared our depth estimator vs. released models of [53] which led to outperforming [53] with a win rate of 88% and losses of 0.35 vs. 0.47 (further details in the supplementary material). In general, we found the task-specific networks to perform on par or better than state-of-the-art for many of the tasks, though we do not formally benchmark or claim this.
4.1. Evaluation of Computed Taxonomies
Fig. 8 shows the computed taxonomies optimized to solve the full dictionary, i.e. all tasks are placed in T and S (except for 4 source-only tasks that are in S only). This was done for various supervision budgets (columns) and maximum allowed order (rows) constraints. Still seeing transfers to some targets when the budget is 26 (full dictionary) means certain transfers became better than their fully supervised task-specific counterpart.
While Fig. 8 shows the structure and connectivity, Fig. 9 quantifies the results of taxonomy recommended transfer policies by two metrics of Gain and Quality, defined as: Gain: win rate (%) against a network trained from scratch using the same training data as transfer networks’. That is, the best that could be done if transfer learning was not utilized. This quantifies the gained value by transferring. Quality: win rate (%) against a fully supervised network trained with 120k images (gold standard).

Figure 9: Evaluation of taxonomy computed for solving the full task dictionary. Gain (left) and Quality (right) values for each task using the policy suggested by the computed taxonomy, as the supervision budget increases(→). Shown for transfer orders 1 and 4.
Red (0) and Blue (1) represent outperforming the reference method on none and all of test set images, respectively (so the transition Red→White→Blue is desirable. White (.5) represents equal performance to reference).

Figure 10: Generalization to Novel Tasks. Each row shows a novel test task. Left: Gain and Quality values using the devised “all-for-one” transfer policies for novel tasks for orders 1-4. Right: Win rates (%) of the transfer policy over various self-supervised methods, ImageNet features, and scratch are shown in the colored rows. Note the large margin of win by taxonomy. The uncolored rows show corresponding loss values.
Each column in Fig. 9 shows a supervision budget. As apparent, good results can be achieved even when the supervision budget is notably smaller than the number of solved tasks, and as the budget increases, results improve (expected). Results are shown for 2 maximum allowed orders.
4.2. Generalization to Novel Tasks
The taxonomies in Sec. 4.1 were optimized for solving all tasks in the dictionary. In many situations, a practitioner is interested in a single task which even may not be in the dictionary. Here we evaluate how taxonomy transfers to a novel out-of-dictionary task with little data.
This is done in an all-for-one scenario where we put one task in T and all others in S. The task in T is target-only and has no task-specific network. Its limited data (16k) is used to train small transfer networks to sources. This basically localizes where the target would be in the taxonomy. Fig. 10 (left) shows the Gain and Quality of the transfer policy found by the BIP for each task. Fig. 10 (right) compares the taxonomy suggested policy against some of the best existing self-supervised methods [96, 103, 68, 100, 1], ImageNet FC7 features [51], training from scratch, and a fully supervised network (gold standard).
The results in Fig. 10 (right) are noteworthy. The large win margin for taxonomy shows that carefully selecting transfer policies depending on the target is superior to fixed transfers, such as the ones employed by self-supervised methods. ImageNet features which are the most popular off-the-shelf features in vision are also outperformed by those policies. Additionally, though the taxonomy transfer policies lose to fully supervised networks (gold standard) in most cases, the results often get close with win rates in 40% range. These observations suggests the space has a rather predicable and strong structure. For graph visualization of the all-for-one taxonomy policies please see the supplementary material. The solver website allows generating the taxonomy for arbitrary sets of target-only tasks.

Figure 11: Structure Significance. Our taxonomy compared with random transfer policies (random feasible taxonomies that use the maximum allowable supervision budget). Y-axis shows Quality or Gain, and X-axis is the supervision budget. Green and gray represent our taxonomy and random connectivities, respectively. Error bars denote 5th–95th percentiles.
5. Significance Test of the Structure
The previous evaluations showed good transfer results in terms of Quality and Gain, but how crucial is it to use our taxonomy to choose smart transfers over just choosing any transfer? In other words, how significant/strong is the discovered structure of task space? Fig. 11 quantifies this by showing the performance of our taxonomy versus a large set of taxonomies with random connectivities. Our taxonomy outperformed all other connectivities by a large margin signifying both existence of a strong structure in the space as well as a good modeling of it by our approach. Complete experimental details is available in supplementary material.
5.1. Evaluation on MIT Places & ImageNet
To what extent are our findings dataset dependent, and would the taxonomy change if done on another dataset? We examined this by finding the ranking of all tasks for transferring to two target tasks of object classification and scene classification on our dataset. We then fine tuned our taskspecific networks on other datasets (MIT Places [104] for scene classification, ImageNet [78] for object classification) and evaluated them on their respective test sets and metrics. Fig. 12 shows how the results correlate with taxonomy’s ranking from our dataset. The Spearman’s rho between the taxonomy ranking and the Top-1 ranking is 0.857 on Places and 0.823 on ImageNet showing a notable correlation. See supplementary material for complete experimental details.
5.2. Universality of the Structure
We employed a computational approach with various design choices. It is important to investigate how specific to those the discovered structure is. We did stability tests by computing the variance in our output when making changes in one of the following system choices: I. architecture of task-specific networks, II. architecture of transfer function networks, III. amount of data available for training transfer networks, IV. datasets, V. data splits, VI. choice of dictionary. Overall, despite injecting large changes (e.g. varying the size of training data of transfer functions by 16x, size and architecture of task-specific networks and transfer networks by 4x), we found the outputs to be remarkably stable leading to almost no change in the output taxonomy computed on top. Detailed results and experimental setup of each tests are reported in the supplementary material.
Figure 12: Evaluating the discovered structure on other datasets: ImageNet [78] (left) for object classification and MIT Places [104] (right) for scene classification. Y-axis shows accuracy on the external benchmark while bars on x-axis are ordered by taxonomy’s predicted performance based on our dataset. A monotonically decreasing plot corresponds to preserving identical orders and perfect generalization.
5.3. Task Similarity Tree
Thus far we showed the task space has a structure, measured this structure, and presented its utility for transfer learning via devising transfer policies. This structure can be presented in other manners as well, e.g. via a metric of similarity across tasks. Figure 13 shows a similarity tree for the tasks in our dictionary. This is acquired from agglomerative clustering of the tasks based on their transferring-out behavior, i.e. using columns of normalized affinity matrix P as feature vectors for tasks. The tree shows how tasks would be hierarchically positioned w.r.t. to each other when measured based on providing information for solving other tasks; the closer two tasks, the more similar their role in transferring to other tasks. Notice that the 3D, 2D, low dimensional geometric, and semantic tasks are found to cluster together using a fully computational approach, which matches the intuitive expectations from the structure of task space. The transfer taxonomies devised by BIP are consistent with this tree as BIP picks the sources in a way that all of these modes are quantitatively best covered, subject to the given budget and desired target set.
6. Limitations and Discussion
We presented a method for modeling the space of visual tasks by way of transfer learning and showed its utility in reducing the need for supervision. The space of tasks is an interesting object of study in its own right and we have only scratched the surface in this regard. We also made a number of assumptions in the framework which should be noted.

Figure 13: Task Similarity Tree. Agglomerative clustering of tasks based on their transferring-out patterns (i.e. using columns of normalized affinity matrix as task features). 3D, 2D, low dimensional geometric, and semantic tasks clustered together using a fully computational approach.
Model Dependence: We used a computational approach and adopted neural networks as our function class. Though we validated the stability of the findings w.r.t various architectures and datasets, it should be noted that the results are in principle model and data specific. The current model also does not include a principled mechanism for handling uncertainty or probabilistic reasoning.
Compositionality: We performed the modeling via a set of common human-defined visual tasks. It is natural to consider a further compositional approach in which such common tasks are viewed as observed samples which are composed of computationally found latent (sub)tasks.
Space Regularity: We performed modeling of a dense space via a sampled dictionary. Though we showed a good tolerance w.r.t. to the choice of dictionary and transferring to out-of-dictionary tasks, this outcome holds upon a proper sampling of the space as a function of its regularity. More formal studies on properties of the computed space is required for this to be provably guaranteed for a general case. Transferring to Non-visual and Robotic Tasks: Given the structure of the space of visual tasks and demonstrated transferabilities to novel tasks, it is worthwhile to question how this can be employed to develop a perception module for solving downstream tasks which are not entirely visual, e.g. robotic manipulation, but entail solving a set of (a priori unknown) visual tasks.
Lifelong Learning: We performed the modeling in one go. In many cases, e.g. lifelong learning, the system is evolving and the number of mastered tasks constantly increase. Such scenarios require augmentation of the structure with expansion mechanisms based on new beliefs.
Acknowledgement: We acknowledge the support of NSF (DMS-1521608), MURI (1186514-1-TBCJE), ONR MURI (N00014-14-1-0671), Toyota(1191689-1-UDAWF), ONR MURI (N00014-13-1-0341), Nvidia, Tencent, a gift by Amazon Web Services, a Google Focused Research Award.
