I am confused: What is the role of labeled data in Snorkel? Thanks!October 10, 2019 at 7:03pm
When I first started reading about Snorkel I had the impression that it could learn to label without any labeled data whatsoever, which seems amazing, if not impossible -- almost like a perpetual motion machine: "We show...we can recover accuracies...without any labeled data...". Then in the spam tutorial I got the impression that labeled data is useful for testing the accuracy of your labeling functions, but still not strictly necessary. But then thinking about a group-think scenario, where you have workers who mean well but where the majority is incorrect and outvote a minority, it seems the only way you can correctly get the proper accuracy for the labeling function is by validation data. And in the crowd sourcing scenario it seems you did need labeled examples ("Fortunately, our label model learns weights for LFs based on their outputs on the training set.").
So, wouldn't it be more correct to say that Snorkel learns a labeling function from a relatively small amount of labeled validation data, so you can label and leverage a much larger amount of unlabeled data? Or am I missing something? Or am I looking at thinking that has evolved over the past few years?
Again, I'm confused and would really like to know as I'm giving a talk tomorrow on Snorkel and would really like to know: Does Snorkel learn to label unlabeled data and the accuracy of its labels totally without labeled data, or how much labeled data does it require? (e.g., 70% training, 10% validation for the generative labeling function to learn accuracies, and then 20% for testing -- is that reasonable?)
Thanks again! -- Bill
October 10, 2019 at 10:36pm
Hey . I'm not a Snorkel expert myself, but you're entirely correct: weak supervision generally refers to the case of using a small number of high-precision labels to ultimately provide soft/hard labels for a much larger set of unlabeled data points. A bit more exposition below:
Snorkel isn't attempting to do unsupervised learning, which is where we infer class labels from the structure of unlabeled data points (e.g., going from x_i => y_i). Instead, it supports weak supervision, which is where we use a bunch of smaller labeling functions [L_1, L_2, ..., L_k] to give our model a set of proxies for the true label. So you've gone from having a set of features that make up
X[x_1, x_2, ..., x_i] to a set of label functions
Lthat approximate the true label [L_1, L_2, ..., L_k]. So now what?
Well, the idea is that we can train a model specifically designed to go from a matrix of label functions L to a matrix of possible labels Y. The nitty-gritty details are in the Snorkel paper here: https://arxiv.org/abs/1711.10160. You can also reference the (in my opinion) more accessible blog post here: https://hazyresearch.github.io/snorkel/blog/wsblog_post.html. But generally, weak supervision hinges on an (unfortunate) truth about the world: high-precision label acquisition is generally very costly, since we usually need to have some sort of human-in-the-loop/expert labeling our data. For example, for cancer classification datasets, it's _really expensive to get labels since you need to hire a radiologist (or buy their data) to label your data for you. But data quality tends to follow the law of diminishing returns: it (generally) costs disproportionately more to get your labels from "clean" to "very clean" than from "decent" to "clean. Noisy labeling functions sort of exploit this reality: you can craft a bunch of "decent" labeling functions in the same amount of time it would take you to design a single "clean" labeling function. In that time, you outsource the de-noising (that is, going from "decent" to "clean" labels) to Snorkel, and get probabilistic labels that work very well.
But then thinking about a group-think scenario, where you have workers who mean well but where the majority is incorrect and outvote a minority, it seems the only way you can correctly get the proper accuracy for the labeling function is by validation data.
Snorkel is still performing weak supervision, where – all things considered – it's better to have a larger, more diverse set of "gold" labels and training data than not. This is generally seen across all applications of supervised learning, since more data means you can better understand the underlying data distribution that your data collection is sampling from.
So, wouldn't it be more correct to say that Snorkel learns a labeling function from a relatively small amount of labeled validation data, so you can label and leverage a much larger amount of unlabeled data?
I'd definitely say so!
Does Snorkel learn to label unlabeled data and the accuracy of its labels totally without labeled data, or how much labeled data does it require? (e.g., 70% training, 10% validation for the generative labeling function to learn accuracies, and then 20% for testing -- is that reasonable?)
I think you can use Snorkel without any gold-label data, but without a validation set you won't be able to properly evaluate your weak supervisor's ability to generalize from a small (labeled) dataset to a much larger (unlabeled) dataset — you're kind of just hoping it works out. Re split sizes: this depends for basically every individual ML problem, and I'm not sure there's a definitive answer. Generally, you'll be limited by the number of high-quality labels you can actually obtain (that is, you won't have to ask this question since you'll only have so many labeled points). That said, like the spam tutorial notes (https://www.snorkel.org/use-cases/01-spam-tutorial#data-splits-in-snorkel), you'll want to make sure that your labeled sets (if you have the luxury of that many labels) are identically distributed, since you'll be using those to: develop labeling functions (dev set), tune the label model's hyper-parameters (val set), evaluate performance on never-before-seen-data (test set), and actually generate labels for your unlabeled points (train set).
This was a bit of a wall of text, but I hope this helps! Best of luck with your presentation.
Thanks very much! Yes, that helps a lot! I think I can see now how you could use Snorkel solely with labeling functions and unlabeled data to first label the data and then train a model on the newly labeled data -- it's just that it wouldn't be a very good idea at all as you wouldn't have any idea of the quality of the labeling functions or of your results. Normally, without Snorkel, we just are concerned with developing and then testing a classifier but now it seems we have a similar step for Snorkel itself. So we want to develop and test the quality of Snorkel's labeling function -- I think you're calling that the weak supervisor -- and the labeling functions that it uses, and then after that we have yet more steps to develop and test the classifier that uses that now completely labeled data. In other words, the weak supervision adds further steps in iteration but now into the labeling process. Hope I've got that right. Phew! Thanks for all your help, any further corrections appreciated! -- Bill
Actually, I thought I had it, but now I'm watching Chris Re's XLDB-2019 video at 14:27, where the title is "Problem: Labeling functions output labels that are noisy, conflicting, and correlated" and at the bottom it says "Solution: Recover labeling function accuracies and correlations without labeled data using data programming." and I'm confused again...because my understanding is we do need labeled data to evaluate how accurate LF 1...5 are in that diagram to understand how well that overall labeling function is working. This is a challenge!
Hey thanks for the question! Chipping in quickly here, but please do check out our various materials that we've put online over the years about Snorkel (and do let us know if there are specific points that were confusing).
The idea in Snorkel is to entirely replace hand-labeling training data with writing labeling functions (and other operators) to programmatically label (and transform and slice) training data. That is, Snorkel does not need any hand-labeled data for training. Just (A) unlabeled data and (B) labeling functions.
In the ML community, we often call this kind of approach weak supervision, since we are using "weaker" or less reliable labels created by the labeling functions, instead of manually created ground truth labels. (Note: the setting where you have a small set of hand-labeled data and a larger set of unlabeled data is generally referred to as semi-supervised learning, and can be nicely complementary to weak supervision approaches- but that's for another post...)
Now, you might ask: how do I check and evaluate the performance of a model trained with Snorkel? You could do this however you want, but we always recommend relying on a small hand-labeled test or evaluation dataset, i.e. doing it the old fashioned way. That is, we don't need any hand-labeled data for training in Snorkel, but we do often use some for testing. The key is that a test set can usually be orders of magnitude smaller than a training set!
Finally, you might ask: how do I come up with ideas for labeling functions? Often it's best to look at some data (which we can assume is then implicitly labeled). In some tutorials we refer to this as a development set- a small set of hand-labeled data which we assume is generated in the course of looking at / exploring some data while coming up with LF ideas. However, this dataset is not strictly needed.
In summary: Snorkel does not need any hand-labeled training data to train a model- just unlabeled data and labeling functions! However, we often practically rely on one or two tiny hand-labeled datasets during development (the dev set) and testing (the test set). Hope this helps!
And to clarify - you do not need labeled data to learn the accuracies, correlations, and other parameters of the labeling functions! The surprising fact that you can in fact do this has been the focus of several ML papers we've published over the years. It sounds shocking (suspicious even!), but it's actually building on old ideas and theory. The intuition: we look at the agreement and disagreements between the labeling functions, and learn their accuracies (which ones to trust) based on that data alone!
October 11, 2019 at 5:01pm