Training with ground-truth labeled and Snorkel-labeled dataJanuary 21, 2021 at 2:44am
Background: I'm writing a conference paper on some Snorkel work, so I want to make sure I understand why Snorkel is considered a weak supervision approach, but isn't considered a semi-supervised learning approach, even though many tutorials leverage some ground truth data in addition to unlabeled data.
I understand that Snorkel was designed to work so that no ground truth is required. In a semi-supervised situation where we have a small number (e.g., 50) of correctly labeled samples and a larger number of unlabeled samples (e.g., 300) and LFs for all samples, my mental model is that Snorkel looks at agreements and disagreements among LFs only, but could not leverage the separate bank of ground truth labels -- at least not internally in its algorithms to improve its reconciliation of agreements and disagreements or algorithmic construction of its Label Model.
However, I recall from a post earlier (June 9th) that one way to use labeled data is to use ground truth labels "where they exist and Snorkel-generated ones otherwise." If we do that, isn't that approach technically semi-supervised learning, even though Snorkel's internal Label Model construction may not be? Has there been any comparison as to how well this combined ground truth labeled + Snorkel labeling of the rest of the data semi-supervised approach works, compared to other semi-supervised learning methods?
January 28, 2021 at 10:25pm
Hi (wimurr) great question! You could certainly draw the connections and indeed some Snorkel papers have compared directly against traditional semi-supervised learning (SSL) approaches (e.g. http://www.vldb.org/pvldb/vol12/p223-varma.pdf for one). I would say that SSL is an approach where some labeled and a larger amount of unlabeled data is used, generally in a domain/problem-agnostic way, whereas weak supervision is where unlabeled data is weakly labeled using domain expertise. The fact that some labeled data is often helpful in WS approaches certainly blurs the lines, and makes comparison fair, but the two remain functionally distinct. As for papers on using both labeled and unlabeled data- for one recent one see: https://mayeechen.github.io/files/Value_of_Data_Full.pdf. Hope this helps!