JetS3t Synchronize

Synchronize is a console (text mode) Java application for synchronizing directories on a computer with an Amazon S3 account.

It is freely available as part of the JetS3t project which provides applications and a Java toolkit for Amazon's S3.

Synchronize offers the following capabilities:

  • *AWS credentials and the cryptographic password can now be provided via prompts on the command-line, rather than merely through a properties file
  • Upload a directory and all its contents to S3
  • Download a directory and all its contents from S3
  • Specify an S3 path that includes a prefix component to download only the objects that match the prefix (see Examples)
  • Option to move rather than copy files to or from S3. In this case, the source file or object is deleted after it has been transferred
  • Automatically compress (gzip) and/or encrypt files sent to S3
  • Sophisticated file comparisons used to determine whether files have changed, so only new or changed files are transferred
  • Upload any number of files and/or directories at one time
  • Option to upload or download files in batches of 1,000 to reduce the memory requirements when synching large numbers of objects
  • Option to skip object metadata comparisons to speed up synchronization of large buckets
  • Access Control List permissions of uploaded files can be set to PRIVATE, PUBLIC_READ or PUBLIC_READ_WRITE
  • When uploading files, specific file/directory paths can be ignored using .jets3t-ignore settings files
  • Control the level of detail in the application's reporting
  • *The --credentials option allows AWS credentials to be loaded from an encrypted file rather than an insecure properties file. The encrypted file can be created with the AWSCredentials API or the Cockpit application
  • *Prompts for HTTP Proxy login credentials if they are required but not provided in the jets3t.properties file
  • *Improved handling of uploads of many files where the files must first be transformed (compressed or encrypted). See the upload.transformed-files-batch-size setting.
Items with a red star (*) are new or updated as of JetS3t version 0.7.0

Getting Started

Synchronize can be run from the command line using the scripts included in the bin directory of the JetS3t distribution.

For Windows computers use the script synchronize.bat.

For Unixy computers, use the script synchronize.sh.

Files are copied to S3 with an UP(load) operation, and copied from S3 with a DOWN(load). By default, only new or changed files are transferred.

Preview Your Actions

Because Synchronize commands can potentially delete or replace your files, you should always test your commands using the noaction option before you run them for real. The --noaction option allows you to preview the actions that Synchronize will take, so you can avoid performing a command you will regret.

Properties File*

Synchronize looks for a Java properties file named synchronize.properties in the classpath. This properties file will often include the properties accesskey and secretkey which define the credentials for your Amazon Web Services account, though these can also be provided on the command line or when prompted if you don't wish to record them in a text file.

Here is the default synchronize.properties file, which includes a brief explanation of the main settings.

####################################
# Synchronize application properties
#
# This file should be available on the 
# classpath when Synchronize is run
####################################

# AWS Access Key (if commented-out, Synchronize will ask at the prompt)
accesskey=<YourAWSAccessKey>

# AWS Secret Key (if commented-out, Synchronize will ask at the prompt)
secretkey=<YourAWSSecretKey>

# Access Control List setting to apply to uploads, must be one of: 
# PRIVATE, PUBLIC_READ, PUBLIC_READ_WRITE
# The ACL setting defaults to PRIVATE if this setting is missing.
acl=PRIVATE

# Password used when encrypting/decrypting files, applicable on with the --crypto option. 
# password=

# If "upload.ignoreMissingPaths" is set to true, Synchronize will perform an upload despite missing  
# or unreadable source files. If set to false, Synchronize will halt if files or paths are missing. 
# WARNING: Be careful enabling this option, as it could cause legitimate objects in S3 to be
# deleted if the corresponding local files cannot be found or read.
#upload.ignoreMissingPaths=true

# Maximum number of files to transform and upload at a time, when file transformation is
# required (eg. when files are gzipped or encrypted during synchronization).
# When commented out, no batching takes place.  
#upload.transformed-files-batch-size=1000

Usage Instructions

To view Synchronize's usage instructions, run synchronize.sh --help. These instructions describe the command-line parameters required by Synchronize and the options available.

Options with a red star (*) are new as of JetS3t version 0.7.0

To see some example commands, refer to the Examples section below.

Usage: Synchronize [options] UP <S3Path> <File/Directory> (<File/Directory>...)
   or: Synchronize [options] DOWN <S3Path> <DownloadDirectory>

UP      : Synchronize the contents of the Local Directory with S3.
DOWN    : Synchronize the contents of S3 with the Local Directory
S3Path  : A path to the resource in S3. This must include at least the
          bucket name, but may also specify a path inside the bucket.
          E.g. <bucketName>/Backups/Documents/20060623
File/Directory : A file or directory on your computer to upload
DownloadDirectory : A directory on your computer where downloaded files
          will be stored

Required properties can be provided via: a file named 'synchronize.properties'
in the classpath, a file specified with the --properties option, or by typing
them in when prompted on the command line. Required properties are:
          accesskey : Your AWS Access Key (Required)
          secretkey : Your AWS Secret Key (Required)
          password  : Encryption password (only required when using crypto)
Properties specified in this file will override those in jets3t.properties.

Options
-------
-h | --help
   Displays this help message.

-n | --noaction
   No action taken. No files will be changed locally or on S3, instead
   a report will be generating showing what will happen if the command
   is run without the -n option.

-q | --quiet
   Runs quietly, without reporting on each action performed or displaying
   progress messages. The summary is still displayed.

-p | --noprogress
   Runs somewhat quietly, without displaying progress messages.
   The action report and overall summary are still displayed.

-f | --force
   Force tool to perform synchronization even when files are up-to-date.
   This may be useful if you need to update metadata or timestamps in S3.

-k | --keepfiles
   Keep outdated files on destination instead of reverting/removing them.
   This option cannot be used with --nodelete.

-d | --nodelete
   Keep files on destination that have been removed from the source. This
   option is similar to --keepfiles except that files may be reverted.
   This option cannot be used with --keepfiles.

-m | --move
   Move items rather than merely copying them. Files on the local computer will
   be deleted after they have been uploaded to S3, or objects will be deleted
   from S3 after they have been downloaded. Be *very* careful with this option.
   This option cannot be used with --keepfiles.

-b | --batch
   Download or upload files in batches, rather than all at once. Enabling this
   option will reduce the memory required to synchronize large buckets, and will
   ensure file transfers commence as soon as possible. When this option is
   enabled, the progress status lines refer only to the progress of a single batch.

-s | --skipmetadata
   Skip the retrieval of object metadata information from S3. This will make the
   synch process much faster for large buckets, but it will leave Synchronize
   with less information to make decisions. If this option is enabled, empty
   files or directories will not be synchronized reliably.
   This option cannot be used with the --gzip or --crypto options.

-g | --gzip
   Compress (GZip) files when backing up and Decompress gzipped files
   when restoring.

-c | --crypto
   Encrypt files when backing up and decrypt encrypted files when restoring. If
   this option is specified the properties must contain a password.

--properties <filename>
   Load the synchronizer app properties from the given file rather than from
   a synchronizer.properties file in the classpath.

--credentials <filename>*
   Load your AWS credentials from an encrypted file, rather than from the
   synchronizer.properties file. This encrypted file can be created using
   the Cockpit application, or the JetS3t API library.

--acl <ACL string>
   Specifies the Access Control List setting to apply. This value must be one
   of: PRIVATE, PUBLIC_READ, PUBLIC_READ_WRITE. This setting will override any
   acl property specified in the synchronize.properties file

--reportlevel <Level>
   A number that specifies how much report information will be printed:
   0 - no report items will be printed (the summary will still be printed)
   1 - only actions are reported          [Prefixes N, U, D, R, F, M]
   2 - differences & actions are reported [Prefixes N, U, D, R, F, M, d, r]
   3 - DEFAULT: all items are reported    [Prefixes N, U, D, R, F, M, d, r, -]

Report
------
Report items are printed on a single line with an action flag followed by
the relative path of the file or S3 object. The report legend follows:

N: A new file/object will be created
U: An existing file/object has changed and will be updated
D: A file/object existing on the target does not exist on the source and
   will be deleted.
d: A file/object existing on the target does not exist on the source but
   because the --keepfiles or --nodelete option was set it was not deleted.
R: An existing file/object has changed more recently on the target than on the
   source. The target version will be reverted to the older source version
r: An existing file/object has changed more recently on the target than on the
   source but because the --keepfiles option was set it was not reverted.
-: A file is identical between the local system and S3, no action is necessary.
F: A file identical locally and in S3 was updated due to the Force option.
M: The file/object will be moved (deleted after it has been copied to/from S3).

WARNING: Be very careful when restoring files from S3 to a directory that already contains files. By default Synchronize will delete any files in the target directory that are not present in S3, in order to synchronize the contents of the directory on your computer with the contents of S3.

Examples

The best way to get the hang of Synchronize is to experiment with the commands on some test files you don't care about, and use the JetS3t Cockpit application to see how uploads are stored in S3.

Before you start, modify the sample properties text file called synchronize.properties located in the configs directory to include your own S3 Access Key and Secret Key settings (see Getting Started above). Use the synchronize run scripts provided in the bin directory to run Synchronize.

Preview Your Actions

Because Synchronize commands can potentially delete or replace your files, you should always test your commands using the noaction option before you run them for real. The --noaction option allows you to preview the actions that Synchronize will take, so you can avoid performing a command you will regret.

Backing up files to S3

Let's say you have two directories containing important files (eg Documents and Reports) and you want to back them up to an S3 bucket called MyBackups (note that you should really use a more unique bucket name, like <MyAWSAccessKey>.MyBackups):

synchronize.sh UP MyBackups Documents Reports

After you have run this command once on your own computer and uploaded some files, try running the same command again. Synchronize will look at the contents of your S3 account and work out that it already contains your files, so it will not have any work to do.

Now add a file or two to your Documents directory, and perhaps change one of the files as well. This time when you run the command, Synchronize will upload the new and changed files. Alternately, you can run the command with the --noaction option to make Synchronize tell you what it would do without actually doing the work:

synchronize.sh --noaction UP MyBackups Documents Reports

Restoring files from S3

There are a few cases where you might want to restore a directory from S3. The simplest cases are when none of the files exist on your computer, for example if you have deleted the whole directory by mistake (oops!), or you want to download a copy of this directory to a second computer.

Let's simulate this simple case by downloading all the files in your S3 account to a new directory:

mkdir RestoreDirectory
synchronize.sh DOWN MyBackups RestoreDirectory

Synchronize will download all the contents of the S3 directory to the new directory name. Like the UPLOAD example above, re-running this command will do nothing as Synchronize will detect that your RestoreDirectory directory has the same contents as the S3 directory.

This same command can restore missing files, such as files that have accidentally been deleted. To see how this works, delete one of the files in RestoreDirectory and re-run the command to restore it.

Selectively Restoring Files

If you have many files backed up in S3, you may wish to download only a subset of these files. Synchronize offers two closely-related techniques to selectively download files from your S3 account. First, you can add a subdirectory path to a command's S3Path argument to download only the files in that subdirectory. In our upload example we backed up the contents of two directories in S3: Documents and Reports. To download only the contents of the Documents subdirectory in S3, you could run the following command:

synchronize.sh DOWN MyBackups/Documents/ Documents

The second technique to download a subset of files is to use a prefix constraint. A prefix constraint is a portion of text that matches the beginning of specific objects in your S3 account. This approach is similar to using a subdirectory name as we did above, but with a subtle difference: the prefix constraint does not need to match an entire subdirectory or file name. For example, if your Documents directory contained some files whose names started with "TaxReport" and other files whose names started with "Projections", you could download only the tax documents with the following command:

synchronize.sh DOWN MyBackups/Documents/Tax Documents

Changed files

Things can get more complicated when you already have files that are in S3, but the files' contents do not match. For example, let's say that you backed up your documents earlier but after doing this one of your documents somehow got corrupted. In this case the default Synchronize DOWNLOAD command above will restore the file by reverting it to the backed-up version from S3.

Warning! To repeat: by default, Synchronize will revert changed files when downloading. That is, it will replace changed files on your computer with older versions from S3. So be careful!

If you want to keep files you have updated and only download files not already on your computer you can prevent Synchronize from reverting files with the --keepfiles option. Let's say that you have changed a number of files in your NewDocumentsDir directory but one of them has been corrupted. You want to revert the one corrupt file but keep the others. To do this, delete the corrupted file then run the following Synchronize command:

synchronize.sh --keepfiles DOWN MyBackups NewDocumentsDirectory

With the keepfiles option, Synchronize will replace any missing files (ie the corrupted one) but leave the changed files alone.

Options, Options

This sections describes some of the Synchronize options in more detail.

noaction: Synchronize will not perform any action, and will not upload or download any files, but it will print reports and summaries as if it was run normally. This option is very useful for checking what actions a Synchronize command will perform before running the command for real.

quiet: Synchronize will only print out a summary of its actions instead of a line-by-line report describing the action taken for each file.

force: Synchronize will upload/download files even when it thinks they have not been changed. You might want to do this if you're worried that Synchronize isn't correctly identifying which files have changed, and you want to force it to update every file.

keepfiles: This option tells Synchronize to keeps files that it would otherwise revert or delete. With this option set, files that have been updated on the destination compared to the source directory, which would normally be reverted, will be left alone. Also, files on the destination that have been deleted in the source directory will be left in place, rather than deleted.

Note: The keepfiles option can sometimes be convenient but isn't intended for regular use. In effect it prevents Synchronize from doing its main job, which is to maintain an identical directory structure between your computer and S3. If you have to use it regularly, Synchronize probably isn't the right tool for what you're trying to do.

gzip: Files are compressed to gzip files prior to being uploaded, and are decompressed when being downloaded. Note If Gzipped files are downloaded without this option they will not be decompressed, and will not have any file extension (like .gz) to indicate that they are gzip files. It will be your responsibility to decompress these files.

crypto: Files are encrypted with the password specified in the Properties File's password setting prior to being uploaded, or are decrypted with this password when being downloaded. Note If encrypted files are downloaded without this option they will not be decrypted, and will not have any file extension to indicate that they are encrypted files. It will be your responsibility to decrypt these files.

Notes

Compressing/encrypting uploads: Synchronize will create temporary files when used with any upload options that change the contents of uploaded files, such as compressing or encrypting them. This means that you will need up to twice as much free space in your default temp directory as taken by the files you intend to upload.