menu

snorkel

Building and managing training datasets for machine learning

Channels
# All channels
view-forward
# announcements
view-forward
# api
view-forward
# applications
view-forward
# help
view-forward
# projects
view-forward
# tutorials
view-forward
Team

Shortlisting large number of labelling functions

January 14, 2021 at 4:23am

Shortlisting large number of labelling functions

January 14, 2021 at 4:23am
I am trying to use snorkel in the context of credit risk modelling (binary classification). I have written lots of labeling functions. I do have some ground truth available so I want to inform labeling functions somehow. The question is: which ground-truth related metrics is it recommended to look at (balanced accuracy, F-score)? Secondly, should I strictly remove any function that will not pass the threshold? I should also mention that in my dataset, positive class is rare (prevalence<4%). Thanks.

January 18, 2021 at 11:42pm
When writing labeling functions in the binary setting, two metrics that may be helpful:
  • Coverage (# non-abstain votes / # total votes)
  • Class-specific precision (assuming unipolar LFs)
It's hard to say exactly how to balance these metrics without a closer look at your application! (Pruning based on empirical precision is one option, or you might consider trading off coverage for higher precision as well!)
Edited
  • reply
  • like