Note: This article was first published on Towards Data Science Medium.
In the column “Structuring Machine Learning Concepts”, I am trying to take concepts from the Machine Learning (ML) space and cast them into new, potentially unusual frameworks to provide novel perspectives. The content is meant for people in the data science community, as well as tech-savvy individuals who are interested in the field of ML.
The last installment of “Structuring Machine Learning Concepts” was dedicated to presenting a new framework to map Supervised, Self-Supervised, Unsupervised, and Reinforcement Learning. I claim that those are the actual “pure” learning styles we should consider in ML.
The trigger for writing this piece was my company’s work in automating human labeling tasks in real-world scenarios. I was always eager to match what we are trying to do with the lingo of research but was never truly satisfied with the concepts presented there. Let’s expand on that.
When I talk about human labeling tasks, I am referring to business processes where humans are completing a SL problem. This can be content moderation on images in media companies (e.g., deciding between “safe for publishing” and “not safe for publishing”), routing incoming emails and documents through the organization (“department 1”, “department 2”, …), or extracting information from incoming PDF orders (“name”, “IBAN”, ...). With many of them, there is often a human-only process in place today, which could benefit from automation.
Ideally, you don’t try to shoot for a 1-to-1 replacement, but you start automating the obvious cases using algorithms and leave the rest to the human. At my company Luminovo, we have been thinking a lot about how to structure an ML system that truly lives up to the promise of continuous learning when used to automate a human-only SL process step-by-step. We called it Hybrid Processing since we are using a human-AI hybrid to follow the goal of processing data.
We normally start either with zero knowledge or with a pre-trained base model. In the beginning, our model is not confident enough to automate anything, so all the incoming data points are labeled by the human. Thereby, she not only completes the task but also provides feedback to the model in the form of a new input-output pair, which can be used for re-training.
When explaining this concept to friends in the ML community, it is often compared to Active Learning (AL). Putting aside the “online nature” of most automation tasks (as opposed to consuming data in batches in a normal AL setup), I never completely warmed up to the comparison since the overarching goal of AL is to create “as good of a model” as possible with “as little data” as possible. Hybrid Processing, on the other hand, does not care about the quality of the model, at least not as its primary objective. The goal instead is to self-label as many data points as possible and only send the uncertain ones to the human.
When trying to generalize this, I realized that the whole concept of Semi-Supervised Learning (SemiSL), as explained in the last part of the “Structuring Machine Learning Concepts” series, fits in quite nicely with those considerations. Remember, in SemiSL we are trying to combine an (often small) amount of labeled data with a large amount of unlabeled data during training. This is similar to what we want to achieve in AL and Hybrid Processing, however, we are missing an important element: the human-in-the-loop. For SemiSL, we do not have access to an “oracle” but are stuck with the labeled data we are given. Also, this process is not “online”, meaning the concept of time is not a key driver.
When looking into the SemiSL theory, I did find exactly the split I was searching for: SemiSL can be either transductive or inductive. For Transductive SemiSL, the goal is to infer the correct labels for the unlabeled data; for Inductive SemiSL, we want to infer the correct mapping from X to Y, or put differently: build the best model we can.
2x2 Matrix for Hybrid Processing, AL, and Transductive & Inductive SemiSL. Created by Author.
Once again, I am proposing a simple 2x2 matrix with the following dimensions:
Is it the ultimate goal to process data, i.e., perform transduction, and assign labels to all my unlabeled points?
Or do we only care about improving the model, i.e., finding the true mapping from input to output?
Do we have access to an “oracle” that can provide us with labels?
Or are we left with the information we have already gathered?
There is a lot more you can write about all of these quadrants, especially about AL and its scenarios and query strategies. I’ll leave this for another blog post.
Inthis post, I finally found personal closure on the topic of how AL translates to some real-world scenarios. I introduced Hybrid Processing as a new paradigm where it is our goal to solve human labeling tasks as efficiently as possible. Together with the known concepts of AL and Transductive & Inductive SemiSL, we were able to set up a 2x2 matrix with the dimensions asking if we aim to improve the underlying model and if we have access to a human-in-the-loop.
Stay tuned for the next article.
Thanks to Timon Ruban, Pranay Modukuru, Lukas Krumholz, and Aljoscha von Bismarck.
Sebastian graduated top of his class with an MSc in electrical engineering from TU Munich and the CDTM. During his second MSc at Stanford he focused on management science and machine learning. He worked as a consultant at McKinsey before returning to engineering at Intel and deep tech startups.