Using S3 buckets as a file storage

[]

Available options

In the default configuration, asset files (storage items) are downloaded from the server and saved to the local filesystem on each satellite. This is quite inefficient, especially in a cluster setup with two or more satellites (each file taking two or more times storage space in total).

A good alternative is S3 buckets. We recommend using S3 or any S3-compatible object storage (for example: MinIO, Google Cloud Storage) whenever possible. However, there are two different ways to use and configure S3:

a shared filesystem: using existing S3 buckets that are already used by the censhare Server;
using dedicated S3 buckets or the so-called s3-push configuration. Both configurations are stored in the DataStore assets, as explained below in more detail.

Each of the approaches has its advantages and disadvantages, discussed in this article.

One basic principle, however, is common to both of them: all files are always written by the censhare Server and satellites then read them whenever needed. This means that both systems can and should use different access roles; satellite should have just read-only access and not write privileges. For details on how to properly set up permissions, please consult the official AWS documentation.

Using dedicated DataStore buckets (aka s3-push)

Purpose, advantages and disadvantages

In this setup, a new S3 bucket is created and dedicated to a certain DataStore. The bucket always contains only files from the assets in this DataStore assets marked by the appropriate Output Channel feature and no other files. These files are created or deleted by the Server as part of the DataStore synchronization. This synchronization happens continuously as long as at least one satellite with this DataStore is connected to the Server.

Note that the Server never reads or uses the files for any purpose other than for the regular DataStore synchronization. The Server has its own data storage, defined by the filesystem element in the DataStore configuration asset. Such filesystems might be in some cases also using an S3 bucket, but it is always a completely different S3 bucket.

When and why use s3-push / dedicated buckets?

When the Server does not use S3 to store its own files, this is the only option for using S3.
- This is the case for majority of the censhare installations, hosted either on-premise or in the censhare datacenter.
The dedicated S3 bucket contains only the minimum necessary files, minimizing risks of data leaks.
- This is especially important when the satellite is in some kind of DMZ or cloud environment, as often the case.
The dedicated S3 buckets can be created in an appropriate geographic region / datacenter, improving latency of serving these files to users around the globe.

Why not use s3-push / dedicated buckets?

Files in these dedicated S3 buckets are copies of the original ones still stored somewhere else. This means additional costs for the storage space. Depending on the specific situation, this might be negligible or critical; please make sure to do a proper analysis.

Configuration

Configuration can be done directly in the DataStore configuration asset or in certain cases using the HCMS CLI tool.

DataStore configuration asset

---Note The configuration is stored in assets of the type module.satellite.osgi.configuration. They are always related to a parent asset of the type module.satellite.configuration to form a full satellite configuration group.

---The s3-push configuration has its own dedicated configuration element <s3-push> at the main level, i.e., directly in the config element.

Typical configuration with a single dedicated bucket in a single region looks like this:

XML

        <s3-push>
          <satellites default-bucket-region="${S3_DATA_REGION}" default-bucket-name="${S3_DATA_NAME}"
                      default-access-key="${S3_DATA_KEY1}" default-secret-key="${S3_DATA_SECRET1}">
          </satellites>
          <server>
            <bucket name="${S3_DATA_NAME}" region="${S3_DATA_REGION}" access-key="${S3_DATA_KEY2}" secret-key="${S3_DATA_SECRET2}"/>
          </server>
        </s3-push>

The configuration is actually split into two: one for the satellites, one for the Server. The bucket name and region is the same, but the credentials the access key and the secret key are usually different, since satellite credentials should not grant a write/delete permission.

The *-KEY attributes are actually optional and should be used only if necessary. If the satellites or even the Server run in a full AWS EC2/ECS/EKS environment, the correct way to provide permissions is to gran them directly to EC2 instance, ECS service, etc. For details, please refer to the official AWS documentation.

---Note In the most common case, satellites are hosted in AWS and the Server is hosted on-premise. In this setup, the <server> part must use access+secret keys, while the <satellites> should omit them.

---More complex configurations, e.g. multiple buckets in multiple regions, non-AWS S3-compatible object storage, etc., are out of scope for this article. Please look at the appropriate Xml Schema

Simple configuration using HCMS configuration tool (HCMS CLI)

Using hcms commands, you can read, create or change simple one-region s3-push configurations. This edits the XML configuration file (see above) in a more convenient fashion.

For details, see the appropriate documentation or follow the instructions provided here.

Using S3 buckets shared with the censhare Server

Purpose, advantages and disadvantages

In many cases, the censhare Server already uses S3 buckets (or something S3-compatible) as a file storage, especially when it is hosted on AWS.

HCMS and Online Channel satellites can also use these buckets directly, without the need to create new ones.

Why use the same, shared buckets?

Cost saving: each file is stored only once, thus paid only once.
Better latency: when the asset is marked for being deployed to a satellite, i.e, assigned an Output Channel, the files are already there; no need to upload them.

Why not use the same, shared buckets?

Potentially, this option is less secure. The S3 bucket(s) contain(s) all files from all assets, including those that are not shared with the satellite. These files might contain sensitive data which should not be accessible from DMZ or cloud environment where satellites often reside.
This option might not be available at all, for instance, when the Server does not use S3 already, or if it uses a private S3-compatible object storage that is not accessible for the satellites, i.e. server is in a private cloud, while satellites are in a public cloud or DMZ.

Configuration

Unlike the s3-push/dedicated buckets option, this one is not supported by the hcms configuration tool and can only be done by editing the configuration XML in the DataStore asset.

DataStore configuration asset

---Note The configuration is stored in assets of the type module.satellite.osgi.configuration. They are always related to a parent asset of the type module.satellite.configuration to form a full satellite configuration group.

---Each shared filesystem is represented by an <s3-filesystem/> sub-element in the <filesystems> element. Each shared filesystem needs to have the following attributes:

name: the filesystem name (id), the same one that is used for the Server
s3-region: the correct AWS region
s3-bucketName: name of the bucket
s3-accessKey and s3-secretKey: AWS credentials
- If the satellite is running on EC2/ECS/EKS environment where it automatically gains access (provided the access is set up correctly), those credentials are optional.

Important

Do not use the same credentials as the ones you use for the Server.
Readonly access is sufficient for satellites!
Set s3-streaming="true" (optional but highly recommended)
Without this option, files are actually downloaded to the local filesystem before serving, which is not practical.

Example configuration for S3 buckets shared with the Server

Below is an example configuration.

TEXT

          <filesystems>
            <!-- s3 file systems the Online Channel synchronizes with on its own instead of streaming the storage item file from the app server -->
            <!-- e.g.: <s3-filesystem name="name-on-server" s3-region="eu-west-1" s3-bucketName="bucket" s3-accessKey="ak" s3-secretKey="sk" />  -->
            <s3-filesystem name="assets-s3-bucket-1"
                           s3-streaming="true"
                           s3-region="eu-central-1"
                           s3-bucketName="os-187-test-bucket-1"
                           s3-accessKey="AKIA57HXKG2FVSZ2DJVC"
                           s3-secretKey="<key>"
            />
            <!-- This is actually a very, very uncommon example of multiple filesystems. -->
            <s3-filesystem name="assets-s3-bucket-2"
                           s3-streaming="true"
                           s3-region="eu-central-1"
                           s3-bucketName="os-187-test-bucket-2"
                           s3-accessKey="AKIA57HXKG2FVSZ2DJVC"
                           s3-secretKey="<key>"
            />
            <s3-filesystem name="assets-s3-bucket-3"
                           s3-streaming="true"
                           s3-region="ap-southeast-1"
                           s3-bucketName="os-187-test-bucket-3"
                           s3-accessKey="AKIA57HXKG2FVSZ2DJVC"
                           s3-secretKey="<key>"
            />
          </filesystems>