Note: This article was first published on Towards Data Science Medium.
In the column “Structuring Machine Learning Concepts”, I am trying to take concepts from the Machine Learning (ML) space and cast them into new, potentially unusual frameworks to provide novel perspectives. The content is meant for people in the data science community, as well as tech-savvy individuals who are interested in the field of ML.
The last installment of “Structuring Machine Learning Concepts” was dedicated to introducing Hybrid Processing and mapping it together with Active Learning and Transductive & Inductive Semi-Supervised Learning in a novel framework.
The trigger for writing this piece was the ubiquity of Transfer Learning these days, branching out in many directions. It comes in various shapes and colors, but the methodology lacks a higher-level framing. Let’s expand on that.
Transfer Learning (TL) has probably been one of the most important developments in Deep Learning to make it applicable for real-world applications. Many might remember the “ImageNet moment” when AlexNet crushed the ImageNet competition and made neural networks the standard for Computer Vision challenges.
However, there was one problem — you needed a lot of data for this to work, which was often not available.
The solution to this came with the introduction of TL. This allowed us to take a Convolutional Neural Network (CNN) pre-trained on ImageNet, freeze the first layers, and only re-train its head on a smaller dataset, bringing CNNs into industry mass adoption.
In 2018, this “ImageNet moment” finally arrived for Natural Language Processing (NLP). For the first time, we moved from re-using static word-embeddings to sharing complete language models, which have shown a remarkable capacity for capturing a range of linguistic information. Within that development, Sebastian Ruder published his thesis on Neural TL for NLP, which already mapped a tree-breakdown of four different concepts in TL.
This got me thinking: what are the different means of using insights of one or two datasets to learn one or many tasks. These are the dimension I could think of:
I also considered adding “Importance” to be able to include auxiliary tasks, but let’s not complicate things too much. Thereby, I ended up with similar dimensions as Ruder has also used for NLP. Let’s map out all eight combinations, resulting from the three binary dimensions.
I will make some opinionated decisions about the terms, which are not backed by extensive literature — bear with me.
Using the dimensions Task, Domain, and Order, we end up with this 2x2x2 matrix, mapping out the concepts introduced before. For a 2D visualization, I put two dimensions on the x-axis and doubled the binary entries, ending up with 8 distinct cells (e.g., the upper left one would be same Domain, same Task, and sequential Order).
In this post, we have used the dimensions of Task, Domain, and Order to structure the ways we can perform TL. I enjoyed indulging in my consulting past by expanding them to a larger matrix, getting me to think about completely new scenarios while trying to fill the empty fields. This has led to some rather obvious cases of (e.g., “Dataset Merging” and “Parallel Training”) all the way to some known procedures that did not have a commonly used name yet (e.g., “Task Fine-Tuning”).
Stay tuned for the next articles.
Sebastian graduated top of his class with an MSc in electrical engineering from TU Munich and the CDTM. During his second MSc at Stanford he focused on management science and machine learning. He worked as a consultant at McKinsey before returning to engineering at Intel and deep tech startups.