JetS3t Synchronize

Synchronize is a console (text mode) Java application for synchronizing directories on a computer with an Amazon S3 or Google Storage account.

Quick Links Capabilities Getting Started Usage Instructions Examples Notes

Synchronize offers the following capabilities:

Items with a red star (*) are new or updated since JetS3t version 0.8.1

  • Improved batching algorithm to reduce memory usage when working with large numbers of files*
  • Improved file comparison logic so object metadata is only retrieved from a service when it is required*
  • Added support for the Amazon S3 service's Multipart Upload which allows large files to be uploaded in smaller parts for improved transfer performance or to upload files larger than 5 GB. Part size is configurable via upload.max-part-size in synchronize.properties*
  • Removed support for the --skipmetadata option which was made obsolete by the above changes. This option is now ignored.*
  • AWS credentials and the cryptographic password can now be provided via prompts on the command-line, rather than merely through a properties file
  • Upload a directory and all its contents to an online service
  • Download a directory and all its contents from an online service
  • Specify a path that includes a prefix component to download only the objects that match the prefix (see Examples)
  • Option to move rather than copy files to or from the storage service. In this case, the source file or object is deleted after it has been transferred
  • Automatically compress (gzip) and/or encrypt files sent to the service
  • Sophisticated file comparisons used to determine whether files have changed, so only new or changed files are transferred
  • Upload any number of files and/or directories at one time
  • Option to upload or download files in batches of 1,000 to reduce the memory requirements when synching large numbers of objects
  • Option to skip object metadata comparisons to speed up synchronization of large buckets
  • Access Control List permissions of uploaded files can be set to PRIVATE, PUBLIC_READ or PUBLIC_READ_WRITE
  • When uploading files, specific file/directory paths can be ignored using .jets3t-ignore settings files
  • Control the level of detail in the application's reporting
  • The --credentials option allows AWS credentials to be loaded from an encrypted file rather than an insecure properties file. The encrypted file can be created with the AWSCredentials API or the Cockpit application
  • Prompts for HTTP Proxy login credentials if they are required but not provided in the jets3t.properties file
  • Improved handling of uploads of many files where the files must first be transformed (compressed or encrypted). See the upload.transformed-files-batch-size setting.
  • Set custom metadata information when uploading files with upload.metadata.NAME=VALUE properties. See the synchronize.properties file for examples
  • Ignore missing or unreadable local files or folders when uploading, instead of failing with an error, by setting the upload.ignoreMissingPaths property to true.

Getting Started

Synchronize can be run from the command line using the scripts included in the bin directory of the JetS3t distribution.

For Windows computers use the script synchronize.bat.

For Unixy computers, use the script synchronize.sh.

Files are copied to the Amazon S3 or Google Storage service with an UP(load) operation, and copied from the service to your computer with a DOWN(load). By default, only new or changed files are transferred.

Preview Your Actions

Because Synchronize commands can potentially delete or replace your files, you should always test your commands using the noaction option before you run them for real. The --noaction option allows you to preview the actions that Synchronize will take, so you can avoid performing a command you will regret.

Properties File: synchronize.properties*

Synchronize looks for a Java properties file named synchronize.properties in the classpath. This properties file will often include the properties accesskey and secretkey which define the credentials for your Amazon Web Services account, though these can also be provided on the command line or when prompted if you don't wish to record them in a text file.

Here is the default synchronize.properties file, which includes a brief explanation of the main settings.

######################################
# Synchronize application properties
#
# This file should be available on the 
# classpath when Synchronize is run
######################################

# Service Access Key (if commented-out, Synchronize will ask at the prompt)
#accesskey=<YourServiceAccessKey>

# Service Secret Key (if commented-out, Synchronize will ask at the prompt)
#secretkey=<YourServiceSecretKey>

# Access Control List setting to apply to uploads, must be one of: 
# PRIVATE, PUBLIC_READ, PUBLIC_READ_WRITE
# The ACL setting defaults to PRIVATE if this setting is missing.
acl=PRIVATE

# Password used when encrypting/decrypting files, applicable on with the --crypto option. 
# password=

# If "upload.max-part-size" is set, files larger than this value will be split into 
# smaller parts no larger than the value and uploaded as Multipart Uploads.
# 5 GB is used as a default value if this property is not set, since this is the largest
# object size supported by services. 
# NOTE: The Multipart Upload feature is currently only available in the Amazon S3 service. 
#upload.max-part-size=5368709120

# If "upload.ignoreMissingPaths" is set to true, Synchronize will perform an upload despite missing  
# or unreadable source files. If set to false, Synchronize will halt if files or paths are missing. 
# WARNING: Be careful enabling this option, as it could cause legitimate objects in S3 to be
# deleted if the corresponding local files cannot be found or read.
#upload.ignoreMissingPaths=true

# Maximum number of files to transform and upload at a time, when file transformation is
# required (eg. when files are gzipped or encrypted during synchronization).
# When commented out, no batching takes place.  
#upload.transformed-files-batch-size=1000

# Custom metadata to apply when uploading new files to S3. Use the prefix "upload.metadata."
# followed by the metadata item name, an equals sign, and the metadata value. For example:
#upload.metadata.Cache-Control=max-age=300
#upload.metadata.Expires=Thu, 01 Dec 1994 16:00:00 GMT
#upload.metadata.my-metadata-item=This is the value for my metadata item 

Usage Instructions

To view Synchronize's usage instructions run synchronize.sh --help. These instructions describe the command-line parameters required by Synchronize and the options available.

To see some example commands and usage scenarios refer to the Examples section below.

Usage: Synchronize [options] UP <Path> <File/Directory> (<File/Directory>...)
   or: Synchronize [options] DOWN <Path> <DownloadDirectory>

UP      : Synchronize the contents of the Local Directory with a service.
DOWN    : Synchronize the contents of a service with the Local Directory
Path    : A path to the resource. This must include at least the
          bucket name, but may also specify a path inside the bucket.
          E.g. <bucketName>/Backups/Documents/20060623
File/Directory : A file or directory on your computer to upload
DownloadDirectory : A directory on your computer where downloaded files
          will be stored

Required properties can be provided via: a file named 'synchronize.properties'
in the classpath, a file specified with the --properties option, or by typing
them in when prompted on the command line. Required properties are:
          accesskey : Your Access Key (Required)
          secretkey : Your Secret Key (Required)
          password  : Encryption password (only required when using crypto)
Properties specified in this file will override those in jets3t.properties.

Options
-------
-h | --help
   Displays this help message.

--provider <provider id>
   Service provider, either 'S3' for Amazon S3 or 'GS' for Google Storage

-n | --noaction
   No action taken. No files will be changed locally or on service, instead
   a report will be generating showing what will happen if the command
   is run without the -n option.

-q | --quiet
   Runs quietly, without reporting on each action performed or displaying
   progress messages. The summary is still displayed.

-p | --noprogress
   Runs somewhat quietly, without displaying progress messages.
   The action report and overall summary are still displayed.

-f | --force
   Force tool to perform synchronization even when files are up-to-date.
   This may be useful if you need to update metadata or timestamps online.

-k | --keepfiles
   Keep outdated files on destination instead of reverting/removing them.
   This option cannot be used with --nodelete.

-d | --nodelete
   Keep files on destination that have been removed from the source. This
   option is similar to --keepfiles except that files may be reverted.
   This option cannot be used with --keepfiles.

-m | --move
   Move items rather than merely copying them. Files on the local computer will
   be deleted after they have been uploaded to service, or objects will be deleted
   from service after they have been downloaded. Be *very* careful with this option.
   This option cannot be used with --keepfiles.

-b | --batch
   Download or upload files in batches, rather than all at once. Enabling this
   option will reduce the memory required to synchronize large buckets, and will
   ensure file transfers commence as soon as possible. When this option is
   enabled, the progress status lines refer only to the progress of a single batch.

-g | --gzip
   Compress (GZip) files when backing up and Decompress gzipped files
   when restoring.

-c | --crypto
   Encrypt files when backing up and decrypt encrypted files when restoring. If
   this option is specified the properties must contain a password.

--properties <filename>
   Load the synchronizer app properties from the given file rather than from
   a synchronizer.properties file in the classpath.

--credentials <filename>
   Load your service credentials from an encrypted file, rather than from the
   synchronizer.properties file. This encrypted file can be created using
   the Cockpit application, or the JetS3t API library.

--acl <ACL string>
   Specifies the Access Control List setting to apply. This value must be one
   of: PRIVATE, PUBLIC_READ, PUBLIC_READ_WRITE. This setting will override any
   acl property specified in the synchronize.properties file

--reportlevel <Level>
   A number that specifies how much report information will be printed:
   0 - no report items will be printed (the summary will still be printed)
   1 - only actions are reported          [Prefixes N, U, D, R, F, M]
   2 - differences & actions are reported [Prefixes N, U, D, R, F, M, d, r]
   3 - DEFAULT: all items are reported    [Prefixes N, U, D, R, F, M, d, r, -]

Report
------
Report items are printed on a single line with an action flag followed by
the relative path of the file or object. The report legend follows:

N: A new file/object will be created
U: An existing file/object has changed and will be updated
D: A file/object existing on the target does not exist on the source and
   will be deleted.
d: A file/object existing on the target does not exist on the source but
   because the --keepfiles or --nodelete option was set it was not deleted.
R: An existing file/object has changed more recently on the target than on the
   source. The target version will be reverted to the older source version
r: An existing file/object has changed more recently on the target than on the
   source but because the --keepfiles option was set it was not reverted.
-: A file is identical between the local system and service, no action is necessary.
F: A file identical locally and in service was updated due to the Force option.
M: The file/object will be moved (deleted after it has been copied to/from service).

WARNING: Be very careful when restoring files from a service to a directory that already contains files. By default Synchronize will delete any files in the target directory that are not present in the service.

Examples

The best way to get the hang of Synchronize is to experiment with the commands on some test files you don't care about, and use the JetS3t Cockpit application to see how uploads are stored online.

Before you start, modify the sample properties text file called synchronize.properties located in the configs directory to include your own Access Key and Secret Key settings (see Getting Started above). Use the synchronize run scripts provided in the bin directory to run Synchronize.

Preview Your Actions

Because Synchronize commands can potentially delete or replace your files, you should always test your commands using the noaction option before you run them for real. The --noaction option allows you to preview the actions that Synchronize will take, so you can avoid performing a command you will regret.

Backing up files to Online Service

Let's say you have two directories containing important files (eg Documents and Reports) and you want to back them up to a bucket called MyBackups (note that you should really use a more unique bucket name, like <MyAWSAccessKey>.MyBackups):

synchronize.sh UP MyBackups Documents Reports

After you have run this command once on your own computer and uploaded some files, try running the same command again. Synchronize will look at the contents of your online account and work out that it already contains your files, so it will not have any work to do.

Now add a file or two to your Documents directory, and perhaps change one of the files as well. This time when you run the command, Synchronize will upload the new and changed files. Alternately, you can run the command with the --noaction option to make Synchronize tell you what it would do without actually doing the work:

synchronize.sh --noaction UP MyBackups Documents Reports

Restoring files from Online Service

There are a few cases where you might want to restore a directory from online. The simplest cases are when none of the files exist on your computer, for example if you have deleted the whole directory by mistake (oops!), or you want to download a copy of this directory to a second computer.

Let's simulate this simple case by downloading all the files in your online account to a new directory:

mkdir RestoreDirectory
synchronize.sh DOWN MyBackups RestoreDirectory

Synchronize will download all the contents of the online directory to the new directory name. Like the UPLOAD example above, re-running this command will do nothing as Synchronize will detect that your RestoreDirectory directory has the same contents as the online directory.

This same command can restore missing files, such as files that have accidentally been deleted. To see how this works, delete one of the files in RestoreDirectory and re-run the command to restore it.

Selectively Restoring Files

If you have many files backed up in your online account, you may wish to download only a subset of these files. Synchronize offers two closely-related techniques to selectively download files from your account. First, you can add a subdirectory path to a command's Path argument to download only the files in that subdirectory. In our upload example we backed up the contents of two directories: Documents and Reports. To download only the contents of the Documents subdirectory you could run the following command:

synchronize.sh DOWN MyBackups/Documents/ Documents

The second technique to download a subset of files is to use a prefix constraint. A prefix constraint is a portion of text that matches the beginning of specific objects in your online account. This approach is similar to using a subdirectory name as we did above, but with a subtle difference: the prefix constraint does not need to match an entire subdirectory or file name. For example, if your Documents directory contained some files whose names started with "TaxReport" and other files whose names started with "Projections", you could download only the tax documents with the following command:

synchronize.sh DOWN MyBackups/Documents/Tax Documents

Changed files

Things can get more complicated when you already have files that are stored in an online service, but the files' contents do not match. For example, let's say that you backed up your documents earlier but after doing this one of your documents somehow got corrupted. In this case the default Synchronize DOWNLOAD command above will restore the file by reverting it to the backed-up version from online.

Warning! To repeat: by default, Synchronize will revert changed files when downloading. That is, it will replace changed files on your computer with older versions from online. So be careful!

If you want to keep files you have updated and only download files not already on your computer you can prevent Synchronize from reverting files with the --keepfiles option. Let's say that you have changed a number of files in your NewDocumentsDir directory but one of them has been corrupted. You want to revert the one corrupt file but keep the others. To do this, delete the corrupted file then run the following Synchronize command:

synchronize.sh --keepfiles DOWN MyBackups NewDocumentsDirectory

With the keepfiles option, Synchronize will replace any missing files (ie the corrupted one) but leave the changed files alone.

Faster Synchronization, Especially for Large Filesets

Synchronize is designed to use as much information as possible to ensure that files can be synchronized to or from Amazon S3 or Google Storage safely and reliably. However, if you need to synchronize many files or the process just takes too much time you may want to try the following options to make the process as fast as possible.

  • Increase the number of HTTP connections and execution threads that are available to Synchronize by changing the values of the httpclient.max-connections storage-service.admin-max-thread-count properties in your jets3t.properties or synchronize.properties file.
    You should set these values much higher than the default values. A value of 100 for each of these properties is a good starting point if you have a reasonable broadband Internet connection.
    You could also try increasing the value of the storage-service.max-thread-count setting which controls how many file uploads or downloads can occur at one time. Be careful raising this too high because you may cause I/O errors if you transfer many large files.
  • Speed up file content comparisons by generating and re-using MD5 hash files. Synchronize uses MD5 hash comparisons to determine when a local file differs from a version online and normally the hash value of local files is computed on-demand. You can gain a significant speed-up by saving these computed values into files that can be referenced again later.
    To save MD5 hash values into <filename>.md5 files for re-use, set the following properties in your jets3t.properties or synchronize.properties file: filecomparer.generate-md5-files=true, filecomparer.use-md5-files=true and filecomparer.skip-upload-of-md5-files=true (assuming you don't want the <filename>.md5 files to be synchronized to your online account along with your data files).
    If you want to avoid cluttering up your data directories with <filename>.md5 files you can set the filecomparer.md5-files-root-dir property to point to a separate directory in which all the MD5 files will be stored.

Options, Options

This sections describes some of the Synchronize options in more detail.

noaction: Synchronize will not perform any action, and will not upload or download any files, but it will print reports and summaries as if it was run normally. This option is very useful for checking what actions a Synchronize command will perform before running the command for real.

quiet: Synchronize will only print out a summary of its actions instead of a line-by-line report describing the action taken for each file.

force: Synchronize will upload/download files even when it thinks they have not been changed. You might want to do this if you're worried that Synchronize isn't correctly identifying which files have changed, and you want to force it to update every file.

keepfiles: This option tells Synchronize to keeps files that it would otherwise revert or delete. With this option set, files that have been updated on the destination compared to the source directory, which would normally be reverted, will be left alone. Also, files on the destination that have been deleted in the source directory will be left in place, rather than deleted.

Note: The keepfiles option can sometimes be convenient but isn't intended for regular use. In effect it prevents Synchronize from doing its main job, which is to maintain an identical directory structure between your computer and your online account. If you have to use it regularly, Synchronize probably isn't the right tool for what you're trying to do.

gzip: Files are compressed to gzip files prior to being uploaded, and are decompressed when being downloaded. Note If Gzipped files are downloaded without this option they will not be decompressed, and will not have any file extension (like .gz) to indicate that they are gzip files. It will be your responsibility to decompress these files.

crypto: Files are encrypted with the password specified in the Properties File's password setting prior to being uploaded, or are decrypted with this password when being downloaded. Note If encrypted files are downloaded without this option they will not be decrypted, and will not have any file extension to indicate that they are encrypted files. It will be your responsibility to decrypt these files.

Notes

Compressing/encrypting uploads: Synchronize will create temporary files when used with any upload options that change the contents of uploaded files, such as compressing or encrypting them. This means that you will need up to twice as much free space in your default temp directory as taken by the files you intend to upload.