menu

gigantum

Community for help, feedback, and all things Gigantum

Channels
Team

Loading remote data?

November 18, 2019 at 5:02pm

Loading remote data?

November 18, 2019 at 5:02pm
I am running Gigantum on a remote machine. However, when I try to add a file in input data, it brings up my local filesystem. Seeing as the data I want to load is way too big to store on my local machine, how can I load data sitting on the remote machine?

November 18, 2019 at 5:33pm
- So this isn't supported super well at the moment, but we have been working on a feature for this problem. For now, the workaround would be to copy your data into place manually, and then trigger a version to be created.
  • SSH into your remote machine
  • If you want to track and version the files, you'd copy to ~/gigantum/<username>/<username>/labbooks/<project name>/input/
  • If you DO NOT want to track and version the files (which sounds like your case because they are big), at the moment you should copy to ~/gigantum/<username>/<username>/labbooks/<project name>/output/untracked. This is a bit of an anti-pattern, but at the moment it's the only place that is not tracked, synced, or versioned by default. Also, you can probably just symlink your data if you want to do that as well to avoid a copy.
  • Start your project container and then stop it. This guarantees everything is clean and versioned properly.
  • Finally refresh the page and you should see your data.
If your data is large and you DO want to version it and sync it, you can use a Dataset. If you want to go this route let me know and I can provide a similar workaround for Datasets.
Finally, as I mentioned we are working on a feature for this use case. It is a Dataset type that maps to local storage only. It tracks data, letting you know if things have changed, but does not fully version contents. It also prevents moving data around (which is good if large or sensitive), but if the data is available locally will link and mount files as needed at runtime automatically.
Does this sound like it would meet your needs? How big is "big" in your case? Do you ever want to version and sync your data, or are you happy just having access to it locally?
  • reply
  • like
Thank you so much! Data versioning is not as important to us, so we can live with symlinking or copying over the necessary data to the input directory.
  • reply
  • like
Also, our data is not that big (think 10g to 100g) range, but usually large enough to avoid storing on our desktops. All our data is generally stored on remote machines and backed up properly anyway, so none of us ever going through the trouble of storing it on our personal machines.
  • reply
  • like
OK great. So yeah, for now it sounds like you want to go the route of symlinking or copying into the untracked directory, but this local Dataset type that is under development will work well for you. Thanks for the feedback!
  • reply
  • like

November 28, 2019 at 2:54am
Hopefully its OK if I add to this question. I haven't used docker before so I'm a bit fuzzy on the technicalities of what parts, if any, of your local file system are accessible. Basically my question is; Can I just provide an absolute path to a file on my desktop system, which is running the client and the project, and open a file or equivalently a file on my companies network file system? Or does it have to be within the project docker file system only? (e.g as per the /input folder)
  • reply
  • like

November 28, 2019 at 2:19pm
- Always ok to ask questions! It can be a bit confusing where things are because so much is happening automatically behind the scenes. Everything lives on your local file system and is mounted into Docker containers at runtime as needed. So if you go to your user directory, you should see a gigantum directory. In here is all of your data. When the Client starts, this entire directory is mounted into the Client container. When a Project starts, just the individual Project directory is mounted into the Project container.
Right now to access files in a Project they do need to either be IN the Project (so added to the input, code, or output sections) or part of a Dataset that is linked to the Project.
We have heard this request to access data without full versioning and also things on a networked file system frequently. We are working on a new Dataset type that lets you point to folders that you can access on your local file system. It will track when files change, letting you know, but it won't fully version. This means certain features, such as rollback, won't be supported. We think this will meet the needs of most people in this scenario.
  • reply
  • like
Let me know if that doesn't make sense or if you have any other questions
  • reply
  • like

November 28, 2019 at 8:45pm
- Great that all makes sense. Yes your new Dataset type sounds great. Looking forward to it with anticipation!
  • reply
  • like

December 9, 2019 at 6:35pm
I tried something like this in a much earlier version of gigantum (pre-datasets) and had an issue where I couldn't follow symlinks outside the mounted filesystem from within the container. Excited to see that this may actually work/be supported now!
  • reply
  • like