How to use Snorkel on numeric data?April 14, 2021 at 4:50pm (Edited 3 weeks ago)
Hi - I'm trying to see if I can use Snorkel to label my numeric training dataset (and then looking to either do label classification or convert the labels to scores and do a regression classification).
My question is if there are coding examples anywhere that show how to use Snorkel with purely numeric training data? I'm from a biology/non-CS background so I'm not sure how to get started with Snorkel in this case, I have tried to repurpose text-based/NLP examples I've seen Snorkel has, but I haven't made much progress.
For reference, my training data is rows of genes each with columns of numeric data measuring different biological qualities about the genes, and I do have some labeling rules depending on that data, using my domain knowledge, that I would be very interested to try and give Snorkel to apply labels. I've also posted a question about this problem to stackoverflow with example code I'm trying to use (https://stackoverflow.com/questions/67105512/how-to-apply-snorkel-to-numeric-data)
April 17, 2021 at 9:07pm
Hi (hnicholls), yes! Snorkel has been used on lots of datasets and applications with numeric data. Fundamentally, very little changes: you still express your domain knowledge in functions that either output a label or abstain. The only thing that changes is what the logic in those labeling functions looks like. Instead of using regular expressions, for examples, you may do numeric comparisons (e.g., if your data points have attributes for height and width, you could say something like the following (in pseudocode):
def lf(x): if abs(x.height - x.width) < 50: return 1 # Assuming class 1 is the class that makes sense for this rule
And the calculations you perform can be arbitrarily simple or complex; they'll just be operating on numeric properties instead of text-based ones. Another approach I've seen used before is plotting your data along different axes to visualize where different clusters are, and then writing LFs to essentially label a cluster at a time.
More specifically, looking at the example code you shared on Stack Overflow, the issue seems to be that your LFs are outputting floats rather than integers. The library doesn't currently support regression problems—only classification.