menu

neptune-community

A place where Neptune.ml users and developers come together to make things work

Channels
Team

Neptune in shared environment and computational resources

August 4, 2019 at 7:47pm

Neptune in shared environment and computational resources

August 4, 2019 at 7:47pm (Edited 1 month ago)

August 4, 2019 at 7:49pm

Let me quote it here @jasperl :

I think a lot of people work in an environment where computational recources are shared and interfaced through a job system like slurm. To keep track of experiments in such a system it is vital to be able to easily continue experiments (say in case my job was suspended to give way for a higher priority job). I really like the dashboard on neptune.ml currently, but this could be a deal breaker for me in the future. So better support for continuing experiments: logging sysout, std err, system info and updating the status of resumed experiments would be a big improvement for me (and I think for many other users as well).
  • reply
  • like

On general level I see that in Neptune we need to provide better experience for Users who use shared resources. Particularly allowing them to re-open / continue suspended / stopped experiment.

Edited
  • reply
  • like

We already have some support. For example you have your stdout, stderr and git info tracked automatically. Check this example experiment: https://ui.neptune.ml/o/neptune-ml/org/credit-default-prediction/e/CRED-123/monitoring .

  • reply
  • like

Let me post a question: Would you prefer to re-open / continue experiment or link multiple experiments together? Latter idea is that if your experiment happens in multiple chunks, you have 1 experiment per chunk, but all of them are linked together, so you can analyse them as single super-experiment.

  • reply
  • like

Regarding system information needed: right now Neptune track hostname and hardware utilization metrics. What additional information do you need to have in Neptune?

  • reply
  • like