Transferring Large Data

Astronomers and other geeks need to transfer big chunks of data sometimes. This document aims at establishing best practices for moving large data sets around, particularly for users of the CAL network.

distributing data within CAL

The easiest way to distribute data to colleagues within CAL is simply to use the CAL home directories. Make sure that the permissions are set properly on your data (and on the path leading to your data). Users of any of the Configured Workstations can simply access the file locally. Note that data in a /scratch directory is only visible to the local machine, while data in your home directory is visible from all Configured Workstations.

distributing public data outside of CAL

Your best bet for largish data sets that can be viewed publicly is to drop them in your user account's web directory, which will be served by our web server. If your CAL account is foo, and you create a world-readable directory called ~/www/, it will be visible to your colleagues (and others!) around the world as http://www.astro.columbia.edu/~foo/

CAL currently does not offer FTP service, though you may of course connect to external FTP servers.

distributing private data outside of CAL

If your data must not be seen by the public, you may wish to encrypt the data before making it available on the web. gpg is a good tool for this, particularly if you already have your public keys in place. If you and the remote party both know a secret, you can also use gpg to encrypt your data with the shared secret, using the --symmetric option.

ticket #230 also suggests using an .htaccess file to require a password before downloading data. While that ticket is open, this feature is not available in the CAL web servers. It is also not cryptographically sound: any machine on the various networks between the web server and the receipient can intercept that data and copy it (and can intercept the password used in an unencrypted http session for that matter). This authentication could be a useful additional layer of security for ultra-top-secret data, however.

If you and the intended receipient have shell accounts (via ssh) on a common remote machine, you could also use scp or sftp to transfer the data to that remote machine for the other person to pick up. This protects the data while they're in transit. Note that you'll need to set up the permissions properly on the files once they get to the remote machine.

receiving data from external sources

Your best bet is to ask the sender of the data to post it to a web or ftp server, and then use some standard tool (e.g. a browser, wget, curl, lftp, etc) to retrieve the data from there.

If the sender of the data does not have access to a server like this, but does have shell access on a remote machine that you also have access to, you could use sftp or scp to effect the transfer: have the sender use scp or sftp to move the data to the remote machine, and then use the same tools yourself to transfer them back to CAL.

why using E-mail for binary or large data is a bad idea

E-mail is generally an inappropriate method of transfer for data larger than a megabyte or two, or for non-text data. There are several reasons for this:

  • mime-encoding transfer inflation: Base64 encodings (your best bet for arbitrary binary data transmission via e-mail) inflate the size of your data by an additional 33%. Yes, transfer is relatively cheap these days, but there's no reason to pad everything we do with an extra third again, especially for larger data sets. Does that 600MB data set really need to be a 1GB transfer?
  • old protocols: SMTP was really not designed for more than text messages. The mime extensions are widely implemented, but they are subtly broken in lots of different e-mail software, MTAs and MUAs alike. Newer protocols (such as http) or old protocols designed to handle arbitrary data (such as FTP) are actually going to be more likely to get your data transferred all in one piece.
  • too many cooks in the kitchen: often, several different MTAs will handle a single piece of mail before it gets to its destination. The subtle breakages mentioned above are compounded by the fact that SMTP tends to rely on multiple hops to get the data from sender to receiver. It's possible for your MUA and MTA to initially do everything right, and then find that some MTA closer to your receipient has garbled your data somehow.
  • receipient-pays: with SMTP, even if you don't want that huge message from someone, it chews up space in your inbox. Depending on your choice of MUA and your connectivity to the internet, it may even cost you download time or transfer allowance as you fetch the message to decide to delete it. With larger messages, your inbox quota can be exhausted with fewer messages, even if you didn't want those messages in the first place.
  • sender-pays: sending an e-mail with a large attachment to someone who doesn't want it is a waste of time, space, and transfer allowance for the sender (and all intermediate MTAs), since it will be eventually deleted on the other side anyway. Posting a data set in some public web space lets people who want the data fetch it at their leisure and doesn't cost you anything if the recipients decide they don't need it after all.