Guidelines for using Snorkel for very imbalanced datasetDecember 12, 2019 at 3:54pm
I'm using Snorkel to label a very imbalanced dataset (~1 positive for 500 negative). I found this paper: https://ajratner.github.io/assets/papers/Osprey_DEEM.pdf on the Snorkel website which is useful. They define a balanced 'synthetic' training set by over-sampling certain categories of their data.
Unfortunately, I don't have an obvious way to oversample my minority class. Are there other guidelines which the Snorkel team proposes for working with imbalanced data? Are there specific parameters which I should try tuning?
Can you point me to other example use cases where a class imbalance problem was successfully overcome?
December 13, 2019 at 12:26am
Hi , you can check out this paper: https://www.nature.com/articles/s41467-019-11012-3 as an example of data with high class-imbalance. For tuning, you can tune the class_balance and prec_init parameters while training the LabelModel.
December 13, 2019 at 10:47am
Thanks for the paper , it's very relevant. From my understanding they oversampled the positive class after the labelling with Snorkel (to train the discriminative models). I couldn't find any mention of how they dealt with imbalance to generate the label probabilities. Did I miss something?
December 21, 2019 at 4:19pm
Hi if you know the class balance, you can try incorporating it here: https://github.com/snorkel-team/snorkel/blob/4c361335c43305fd3ba2991f40b243f76b863503/snorkel/labeling/model/label_model.py#L812. We also describe an approach for estimating the class balance in https://arxiv.org/abs/1810.02840. Hope this helps!