File management
Introduction
File management is a complex challenge for any media asset management system. In censhare, the rough concept of data storage is, that metadata and physical files are split up: all metadata of assets are stored in the Core Database, physical documents (like all the layout, image, and text documents) are stored in classical filesystems. While it would be technically possible to store the files in the database, there are numerous reasons not to do so. The main reason for censhare to keep documents in classical filesystems is the need to distribute the storage locations to multiple sites and to provide means of emergency file access in case of database failure.
In the following, we describe the building blocks that constitute file management in censhare, as well as the history of their development. This helps you to comprehend the design decisions made by censhare. Note that users never get in touch with file management details at all, as all file manipulation is takes place on lower levels of the system architecture. The sole access to files from the user perspective is by their asset placeholders in the client applications.
To understand the basics of censhare file management, one has to be aware, that assets can, but do not have to be linked to a physical document. A calendar asset, for example, consists only of metadata in the database and has no file associated with it. An image asset usually represents a file - the image document. For those assets that represent a file, there is exactly one Master File. This is the one that will be accessed when the asset is opened. The censhare system automatically produces additional variations of the Master File that serve certain tasks, for example, previews and thumbnails, PDF, XML-extracts, and the like, and all these files are associated with the same asset, although they become not obviously visible to users. In censhare terminology, all files associated with one asset are called Storage Items. Each Storage Item has its own location in the filesystem and contains a key that describes its purposes. The main file - the one that users are aware of, for example, has the key "master", the preview image file the key "preview" and so on. A common misconception for beginning censhare users is the belief, that each of the Storage Items is represented by an asset and that non-master Storage Items are children assets of the "master asset". That is not the case. There is only one asset representing a Master File. All the additional helper files that are automatically generated from the Master File belong to the metadata of this very same asset. This means that one censhare asset actually maintains a number of files and this number can grow very large, considering that preview and thumbnail storage items are produced per page and all Storage Items are subject to versioning. A multi-page layout asset can easily produce hundreds or thousands of Storage Items, although users are only aware of the one layout document that constitutes the master file of the current version. It is for this reason, that censhare decided to use consecutive numbers for naming files in the filesystem instead of descriptive filenames derived from the asset names.
Storage location
So, where do files actually go? censhare allows to define at least one, or any other number of physical locations to store files. Any filesystem that can be accessed over the network by the Application Server(s) might be used for the censhare repository. Providing the physical file systems and making them available to the censhare Application Server(s) is up to the customer and considered a technical installation topic. censhare supports numerous standard file system protocols, like Samba, NFS, AFP, Helios Ethershare, and others. Once available, logical file systems are set up in the censhare administration, where each logical file system represents one volume in one of the physical file systems. The logical file system description contains the path to the volume, a name, the purpose of the file system, and a domain node association. The domain node association is responsible for the distribution of files: when a file is stored, censhare looks up the nodes in the two domain trees of the corresponding asset. If there is no file system associated with that node, the system follows the domain path up to the first node that has a file system associated with it and stores the master file there. All other storage items of the same asset go into the same location as the master file. The actual file names and locations (which are numbers as described before) are stored in the metadata of the asset. This information is split into a reference to the logical censhare file system and the relative path inside the file system. This ensures to easily migrate physical file systems if required.
In the volumes that were assigned to censhare, the system takes full control of creating, naming, and deleting files and folders (there is no movement of files). Asset files are stored in dynamically numbered folders in a way, that never more than 100 files end up in one directory. This structure is fully automatic, new folders are created by the system on the fly as needed.
File access
One of the biggest challenges is file access. In the very first versions of censhare, the filesystems had to be mounted at the client computers, and the censhare client application would read from and write to the mounted filesystems directly. This procedure was common at the time but came with some problems:
First, it was possible for the user to interact directly with the mounted volume, bypassing censhare, and causing potential disaster by renaming, modifying, or deleting files.
censhare developed a procedure in which the asset volume would be write-protected for the users. The client application was still able to open files, but the user was not able to interfere with the repository. The get modified files back to the system, a second volume had to be created, called the asset temp volume, and mounted on the client computers, besides the asset volume. Now, whenever a file had to be written back to the system, the censhare client wrote the file into the temp value (which had write permission for the users) under a randomized name and informed the application server about the new file. The Application Server would move the file into the correct new location in the repository and inform the client, whether the migration succeeded or not. The entire process was organized in a transaction so that any failure on the way would roll back the save command.
Preparing each client computer in a way that the volumes (asset and temp) mount automatically at startup can be a hassle. There were also issues if a volume was dismounted due to network problems, usually resulting in the user restarting the computer.
This problem was solved in censhare by providing alternative access to the file. Instead of having the client application accessing the file by the operating system, censhare developed a streaming feature, that would transmit a requested file directly from the application server to the client within their communication. The client application would then create the file in a local folder of the client computer. Benchmark tests proved, that this procedure yields the same speeds as direct access and so censhare’s file streaming was implemented as the default file access method. It was no longer necessary to mount the file systems on the client computers. There was one exception, though: all layout applications like InDesign and XPress need valid, accessible paths to the storage locations of placed elements like images. For this reason, it was necessary to mount the asset volume at layout workstations, even so the client on the same machine was able to access files with file streaming without direct access to the volumes.
Finally, with InDesign CS4, came the possibility to provide alternative path information to the placed images. Instead of a file system path, it was now possible to give URIs as file location. Consequently, censhare developed an internal web service that would give access to the files for InDesign. Since then, it is not necessary any longer to mount volumes on client computers.
Note, however, that the procedure of moving new files first to the asset temp volume and from there to the asset volume is still maintained internally and it is necessary to provide both volumes for each physical filesystem that is used.
Distributed filesystems
The number of needed filesystems depends on the installation. As a basic rule, the needed files have to be available in the local network. Files can grow incredibly large and even good bandwidth between production sites can cause delays in file access. In our experience, there are at least as many file systems for the censhare repository as production sites or local networks. We discussed earlier, how the domain trees are used to decide, which file goes where so multiple production sites are typically indicated in the domain structure. The decision is made upon the check-in of an asset.
It is often desirable to move files between filesystems independently of manual check-in actions. censhare provides a collection of tools for this purpose, allowing to transfer files triggered by events, at certain times or intervals, based on rules that include domains, asset types, workflow steps, and the like. As a general rule, these tools work transparently for users. They do not have to deal with any technical details like the file transfer protocol (FTP).
Archiving and file replication
censhare also includes two internal file replication features. One is for archiving: besides the asset and temp volume, which make a logical censhare file system, it is possible to define an additional physical file system location for archiving purposes. The archiving file system is typically a less reliable but cheaper device. Storage items of assets that are flagged for archiving (manually or automatically based on rules), will go automatically into this location, under the same name and relative path as they resided in the asset volume.
The other allows to replicate any file-writing process to any number of additional file systems. This feature is used for backup purposes (but only covering files that belong to the censhare system!) or for duplicating repositories to other sites, allowing dynamic switching of production sites based on load.
Offline database
Finally, censhare provides a security feature for the case, that the censhare database gets unavailable for a critical time period during production. In that case, we would be in the unpleasant scenario, that all the files that are necessary to continue the current production are available in the unaffected file system(s), but we cannot access them due to the undescriptive numbered file names. To avoid such a situation, censhare logs all activities in the file system in a set of HTML files. These HTML files are organized by customizable criteria, typically based on issue structure and recent activities. The HTML file can be accessed either directly or by a web service of the application server and provides means of browsing assets by their most essential metadata, relation hierarchies, and previews. All Storage Items of an asset are listed and with a click on the respective link, the files can be transported from the repository to the local hard disk for emergency production.