Neptune in shared environment and computational resourcesAugust 4, 2019 at 7:47pm (Edited 1 month ago)
August 4, 2019 at 7:49pm
Your original request: https://spectrum.chat/thread/6d1d5ca7-5a72-4234-b748-d5cb68713dd0?m=MTU2NDY2Njg0MTQxMg==
Let me quote it here @jasperl :
I think a lot of people work in an environment where computational recources are shared and interfaced through a job system like slurm. To keep track of experiments in such a system it is vital to be able to easily continue experiments (say in case my job was suspended to give way for a higher priority job). I really like the dashboard on neptune.ml currently, but this could be a deal breaker for me in the future. So better support for continuing experiments: logging sysout, std err, system info and updating the status of resumed experiments would be a big improvement for me (and I think for many other users as well).
On general level I see that in Neptune we need to provide better experience for Users who use shared resources. Particularly allowing them to re-open / continue suspended / stopped experiment.
We already have some support. For example you have your stdout, stderr and git info tracked automatically. Check this example experiment: https://ui.neptune.ml/o/neptune-ml/org/credit-default-prediction/e/CRED-123/monitoring .
Let me post a question: Would you prefer to re-open / continue experiment or link multiple experiments together? Latter idea is that if your experiment happens in multiple chunks, you have 1 experiment per chunk, but all of them are linked together, so you can analyse them as single super-experiment.
Regarding system information needed: right now Neptune track hostname and hardware utilization metrics. What additional information do you need to have in Neptune?