Persistence Managers

The underlying data storage of CRX is composed of

  • one or more workspace stores,
  • the version storage,
  • the data store and 
  • the journal

Each workspace in the repository can be separately configured to store its data through a specific persistence manager (the class that manages the reading and writing of the data). Similarly, the repository-wide version store can also be independently configured to use a particular persistence manager. A number of different persistence managers are available, capable of storing data in a variety of file formats or relational databases.

CRX also employs a dedicated data store which is optimized for storing large binaries. This can also be configured to store its data in the file system or in a database.

Finally, CRX uses a journal, which logs each change made to the workspace and version stores. The journal can also be configured to use either file-based or database storage.

Configuration Files

The configuration information for CRX storage can be found in the following locations:

  1. crx-quickstart/server/webapps/crx-explorer_crx.war!/WEB-INF/repository-template.xml
  2. crx-quickstart/repository/repository.xml
  3. crx-quickstart/repository/workspaces/<workspace-name>/workspace.xml

The repository-template.xml file is packed inside the file crx-explorer_crx.war (indicated by the "explode" character, !, in the path above). As its name indicates, repository-template.xml serves as the template for repository.xml. The file is only used when there is no repository present, in most cases this will be on a fresh CRX install.  The repository-template.xml file is unpacked and copied to repository.xml, thus establishing the default configuration for the new CRX instance.

Within the repository.xml the <Workspace> section is used as a template when creating new workspaces. Every time a new workspace is created within the repository, the corresponding directory is created at crx-quickstart/repository/workspaces/<workspace-name> and the <Workspace> section of the repository.xml is copied into that directory as the workspace.xml file.

In the following sections, the changes to be made to the CRX configuration are described generically and can be applied to the repository-template.xml file, the repository.xml file or any one of the workspace.xml files.

However, changes made to these files have differing effects depending on what stage of the instance's lifecycle you are in. In general you should keep the following in mind:

  • When a change is made, the repository must be restarted to apply that change.
  • A change to an existing workspace.xml file will apply only to that particular workspace. Note that changing to an entirely different PM, or making another major alteration such as changing the location of the stored data, will result in the current content being orphaned and made inaccessible from within CRX (the actual data files or database storage will still be intact, but the workspace will not be able to access it). Minor parameter changes can be made without orphaning content.
  • A change to the <VersionStorage> section of the repository.xml will apply to the version storage only. As with workspace storage, the results of changing these parameters depend upon the specific PM in use, and the magnitude of the changes.
  • Changing the storage location in the <DataStore> section of the repository.xml will orphan the existing data store.
  • Similarly, changing the shared path parameter in the <Journal> section will break clustering (See Clustering).
  • A change to the <Workspace> section of the repository.xml will not apply to any already existing workspaces (since they are configured in there respective workspace.xml files). It will, however, apply to any subsequently created workspaces.
  • In general, changes to persistence managers or data stores should be made before installation of a instance, by altering the repository-template.xml. However, in specialized cases involving migration from one persistence manager to another, changes to repository.xml or workspace.xml may be required (see Migration).

Changing repository-template.xml

To make changes to repository-template.xml, perform the following steps:

  1. Ensure that the CRX instance is shut down.
  2. Extract the file WEB-INF/repository-template.xml from crx-quickstart/server/webapps/crx-explorer_crx.war with:
    jar xvf crx-explorer_crx.war WEB-INF/repository-template.xml
  3. Make the desired changes to the xml file.
  4. After saving your changes, place the altered xml file into the war file with:
    jar uvf crx-explorer_crx.war WEB-INF/repository-template.xml
  5. To apply the configuration changes, restart CRX from the command line using the start script found in crx-quickstart/server/.

Changing repository.xml

To make changes to repository.xml, perform the following steps:

  1. Ensure that the CRX instance is shut down.
  2. Make the desired changes to the crx-quickstart/repository/repository.xml.
  3. After saving your changes, restart CRX from the command line using the start script found in crx-quickstart/server/.

Changing workspace.xml

To make changes to workspace.xml, perform the following steps:

  1. Ensure that the CRX instance is shut down.
  2. Make the desired changes to the crx-quickstart/repository/workspaces/<workspace-name>/workspace.xml.
  3. After saving your changes, restart CRX from the command line using the start script found in crx-quickstart/server/.

Mixing and Matching Storage Mechanisms

In theory, each element in the CRX storage system can be independently configured to store its data in a file system or in one of a number of supported databases. In practice, mixing and matching storage techniques is rarely done. In most cases the default configuration consisting of the Tar Persitence Manager for all workspaces and the version storage, the File Data Store and the File Journal, is acceptable. In cases where a customer has a reason to prefer database storage, all three mechanisms are usually configured to write to the same database (see Configuring Database PMs, Configuring the Data Store and Configuring the Journal).

Persistence Managers

Each workspace in a repository has its own persistent store which holds all the content of that workspace except for large binaries (these reside in the data store). In addition, the version storage for the entire repository (common to all workspaces) also has a dedicated persistent store.

The implementation of a persistent store depends on the particular persistence manager configured for that workspace or version storage. The default persistence manager is the Tar Persistence Manager (TarPM).

Note

The persistence managers documented in this section are those that ship with CRX. In addition, CRX (2.0 and later) also supports the use of Apache Jackrabbit persistence managers, including custom-made ones.

Prior to CRX 1.4, persistence manager class names start with com.day.crx.*. Beginning with CRX 1.4, all persistence manager classes start with org.apache.jackrabbit.*, with the exception of the Tar persistence manager, which remains com.day.crx.persistence.tar.TarPersistenceManager.

Configuration Syntax

Persistence manager configuration is done in the <PersistenceManager> element. Depending on the scope of effect desired (see above), this element may be configured in the folowing locations:

To change the default used if no repository is present:

crx-quickstart/server/webapps/crx-explorer_crx.war!/WEB-INF/repository-template.xml
  <Repository>
    <Workspace>
      <PersistenceManager>

To change the default for all future workspaces in an already installed CRX instance:

crx-quickstart/repository/repository.xml
  <Repository>
    <Workspace>
      <PersistenceManager>

To change the configuration of a specific already created workspace in already installed CRX instance:

crx-quickstart/repository/workspaces/<workspace-name>/workspace.xml
  <Workspace>
    <PersistenceManager>

The general pattern of the persistence manager configuration in the repository-template.xml, repository.xml and workspace.xml is:

<PersistenceManager class="myPersistenceManager">
  <param name="parameterOne" value="valueOne"/>
  <param name="parameterTwo" value="valueTwo"/>
  <param name="parameterThree" value="valueThree"/>
</PersistenceManager>

Details of the parameters applicable to a particular persistence manager can be found in the sections that follow. For persistence managers not covered in this section, consult the Javadocs for that persistence manager class.

Tar Persistence Manager

The default persistence manager used by CRX for all worksapces and the version store is the Tar Persistence Manager (TarPM). This persistence manager stores data in the file system in standard Unix-style tar files.

TarPM Improvements

The following improvements have been made to the TarPM in CRX since it was introduced in CRX 1.3:

  1. Better scalability.
  2. More configuration options for fine-tuning.
  3. The integrity of the data is better protected by using checksums in the data and index.
  4. Improved recovery after power failures.
  5. Clustering is supported.

Caution

The TarPM in CRX 1.4 and later is not compatible with the original TarPM introduced in CRX 1.3. However, the name "TarPM" and the class name com.day.crx.persistence.tar.TarPersistenceManager have been retained.

TarPM versus a RDBMS-based PM

Both TarPM and database-based PMs support transactions, any file system, and optimization at runtime or in batch mode.

TarPM is a new technology and has the following advantages over using an RDBMS-based PM:

  • Tar files are append-only.
  • Tar files can be backed up easily online.
  • Tar is a standard file format, accessible via known tools, such as tar, WinZip, and so on.
  • Tar is a platform-independent format.
  • Low cost of ownership and license.
  • TarPM is specifically designed for JCR repositories.
  • TarPM is faster than RDBMS-based persistence managers for the JCR use case.
  • The TarPM takes advantage of the very simple key-value pair data structure of CRX.

Note

If you receive a "too many open files" error upon CRX installation, increase the number of open files per process in the operating system (using, for example the command ulimit). A common configuration is ulimit -n 8192.

Configuring the TarPM

<PersistenceManager class="com.day.crx.persistence.tar.TarPersistenceManager">
    <param name="maxFileSize" value="256"/>
    <param name="autoOptimizeAt" value="2:00-5:00"/>
    <param name="bindAddress" value=""/>    
    <param name="portList" value=""/>    
    <param name="preferredMaster" value="false"/>
    <param name="lockClass" value="com.day.crx.util.NativeFileLock"/>
    <param name="lockTimeout" value="0"/>
    <param name="fileMode" value="rw"/>
    <param name="optimizeSleep" value="1"/>
    <param name="maxIndexBuffer" value="32"/>
</PersistenceManager>
Parameter Description
 maxFileSize Optional, the default is 64 for CRX 1.4.x, and 256 for CRX 2.x. If the current data file grows larger than this number (in megabytes), a new data file is greated (if the last entry in a file is very big, a data file can actually be much bigger, as entries are not split among files). The maximum file size is 1024 (1 GB). Data files are kept open at runtime. Depending on the amount of data is stored in the Tar PM, this value needs to be increased or the limit of open files per process needs to be adjusted. If this value is changed when tar files already exist, new tar files will grow up to this size (existing files are not changed).
autoOptimizeAt
Optional, default is 2:00-5:00. Automatically optimize at the given time. When the optimization should be run. Example: 2:00 to automatically optimize every morning at two. The index files will be merged as well if required. To disable the automatic optimization, set the value to "-0" (which actually means 'stop optimization at midnight').
bindAddress If the synchronization between cluster nodes should be done over a specific network interface. By default all network interfaces are used. Default: empty (use all interfaces).
portList
The list of ports to use in master mode. By default any free port is used. When using a firewall, open ports must be listed. One port per workspace is required. A list of ports or ranges is supported, for example: 9100-9110 or 9100-9110,9210-9220. Default: 0 (any port).
preferredMaster
Only applicable in a clustering environment. If enabled, this cluster node will try to become the master even if another cluster node was started before. Default: false (not enabled).
lockClass
The name of the class to use for locking. Supported are com.day.crx.util.NativeFileLock and com.day.crx.util.CooperativeFileLock. When using a file system that does not support file locking (for example some older versions of NFS), the cooperative locking class should be used. Default: com.day.crx.util.NativeFileLock
lockTimeout
When clustering is used, the maximum time (in milliseconds) to wait to lock the shared files. Default: 0 for no limit.
fileMode
The file mode how to open the data files. Options are "rw" (read-write), "r" (read-only), "rwd" (read-write, content is written synchronously), and "rws" (read-write, content and metadata changes are written synchronously). Optionally a + can be appended to call fsync after writing (however this will slow down writes a lot). Default: "rw" for read-write.
optimizeSleep
The number of milliseconds to wait after optimizing a transaction. Floating point precision is supported. This setting is optional, the default is 1.
maxIndexBuffer
After an abnormal termination, at most this much data (in megabytes) needs to be scanned in order to re-create the tar entry index. This setting is optional, the default is 32.

Note

Do not manually change data in the tar files as most data is stored in a binary format.

Note

If you change the configuration after a workspace has already been created, you need to change both the repository.xml and workspace.xml files. This ensures that both the existing workspace and any future ones will have the new settings.

CRX Default Data Storage File Structure

crx-quickstart/                Created automatically when the self-extracting
| crx-quickstart.jar is run. Contains the entire
| installation.
|
|--repository/ The repository.
|
|--workspaces/ Content of all workspaces.
| |
| |--<workspace-name> Each workspace has its own directory.
| |
| |--blobs/ Deprecated storage location for large binaries.
| | These are now stored by the Data Store in
| | shared/repository/datastore/
| |
| |--copy/ Local copy of the central persistent storage
| | (found in shared/workspaces/<workspace-name>)
| | for this workspace. Contains tar files created
| | by the the TarPM.
| |
| |--index/ Workspace search index. Each cluster instance
| | maintains an index for each of its workspaces.
| |
| |--workspace.xml Configuration file for this workspace.
| |
| |--locks Records the locks currently held in this
| workspace.
|
|--version/ Version storage
| |
| |--copy/ Local copy of the central version persistent storage
| (found in shared/version).
| Contains tar files created by the the TarPM.
|
|--shared/ The central persistent storage for the cluster.
| In a cluster of size greater than 1 this
| directory is accessible from all instances, but
| each also maintains a copy in its local copy/
| directory. In a standalone instance the central
| storage is still in this shared directory, but
| this directory happens to be on the same
| machine as the installation.
|
|--workspaces/ Shared content of all workspaces.
| |
| |--<workspace-name>/ Each workspace has its own directory. Contains
| the tar files created by the TarPM.
|--repository/
| |
| |--datastore/ The data store for large binaries.
|
|--journal/ The journal records all changes to the
| repository content. Used to synchronize cluster
| instances.
|
|--version/ Version storage. Contains tar files holding
| versioning data created by the TarPM.
|
|--namespaces/ Namespace registry.
|
|--nodetypes/ Nodetype registry.

Optimizing Tar Files

As data is never overwritten in a tar file, the disk usage increases even when only updating existing data. When optimizing, the TarPM copies data that is still used from old tar files into new tar files and deletes the old tar files that contain only old or redundant data. If there is only one file, optimization will have no effect.

Note

If you are optimizing tar files in a cluster, you need to ensure that the Tar optimization times are set to the same value on all cluster nodes. For example, <param name="autoOptimizeAt" value="1:00-4:00"/>

Manually optimizing tar files using the CRX Console

To optimize tar files using the CRX console:

  1. In the CRX Console, log in as administrator.

  2. Click Repository Configuration.

  3. Select Tar Persistence Manager Optimization and click Start Optimization.

    file
  4. To stop optimization, click Stop Optimization.

    file

Automatically optimizing tar files

By default optimization is automatically run each night between 2 am and 5 am. See the option autoOptimizeAt in the TarPM configuration. We recommend that you optimize the tar file when the current system usage is low.

Manually optimizing tar files at runtime

You can start optimizing the tar file manually at runtime by placing a specially named file optimize.tar in the folder where the tar files are. This file can be empty.

When optimization starts, this file is automatically renamed to optimizeNow.tar. If you need to stop optimization, you can do so by deleting this file. The file is automatically deleted when the optimization run ends.

Manually Merging Tar Index Files

If many entries are stored in the tar files, the number of index files may grow. The index files are automatically merged before and after the scheduled Tar PM optimization. To reduce the number of index files at other times, you can merge these index files using the CRX Console. You can merge tar index files while the repository is running.

Manually merging tar index files using the CRX console

To merge tar index files using the CRX console:

  1. In the CRX console, click Repository Configuration.

    file
  2. Select Tar Persistence Manager Index Merge.

    file
  3. Click Start Index Merge. CRX indicates when it completes the merge.

    file

Note

Fragmented tar index files may have a negative impact on performance. In case of performance problems, merging the tar index files is recommended.

Consistency Checking and Fixing

The Tar PM can check repository consistency and fix consistency problems at startup.

Caution

When running a consistency fix in a clustered environment, only run it on one cluster node or the consistency check/fix will not update the cache on the other cluster nodes. Do not start other cluster nodes while the consistency fix is running. After the consistency fix is finished, the other cluster nodes can be started.

To enable consistency checking and automatically fix problems, set the following options in the (PersistenceManager section of) repository.xml and workspace.xml, re-start CRX and monitor crx/error.log for any relevant messages:

<param name="consistencyCheck" value="true"/>
<param name="consistencyFix" value="true"/>
To fix consistency problems, the consistency check setting must be enabled as well. After the consistency check has finished, disable the relevant settings, otherwise the consistency check always runs when starting up CRX.

Migrating from a Regular to a Clustered or a Clustered to a Regular Environment

You can use TarPM in a regular or clustered environment. For information on setting up TarPM in a clustered environment, see CRX Clustering.

The easiest way to migrate from a regular environment to a clustered environment or from a clustered environment to a regular environment is to export the data, change the configuration, and then import the data.

The DB2 PersistenceManager

CRX does not have a direct replacement for the DB2 PersistenceManager, but this persistence manager can still be used.

The DB2 persistence manager stores the workspace content into an IBM DB2 database. The DB2 persistence manager can still be used, or as an alternative, use org.apache.jackrabbit.core.persistence.bundle.BundleDbPersistenceManager with the schema parameter set to db2 as in:

<PersistenceManager
class="org.apache.jackrabbit.core.persistence.bundle.BundleDbPersistenceManager">
   <param name="schema" value="db2"/>
   <param name="driver" value="COM.ibm.db2.jdbc.net.DB2Driver"/>
   <param name="url" value="jdbc:db2://localhost/crx"/>
   <param name="user" value="root"/>
   <param name="password" value="crx"/>
</PersistenceManager>

The Oracle 9 Persistence Manager

The Oracle 9 persistence manager stores the repository data in an Oracle 9 database.

Note

Oracle 9i allows only 30 characters for table names. Therefore the oracle persistence manager configuration must be a minimum of seven characters and a maximum of 30 for schemaObjectPrefix in the table names. For example: schemaObjectPrefix: crxop -> table name CRXOP_DEFAULT_CRXOPBINVAL.

The Oracle 9 persistence manager is configured as follows:
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.bundle.Oracle9PersistenceManager">
    <param name="driver" value="oracle.jdbc.OracleDriver"/>
    <param name="url" value="jdbc:oracle:thin:@localhost:1521:crx"/>
    <param name="user" value="root"/>
    <param name="password" value="crx"/>
</PersistenceManager>

Note

In PersistenceManager class, org.apache.jackrabbit.core.persistence.bundle.Oracle9PersistenceManager, use the number nine (9) between Oracle and PersistenceManager.

driver (optional) If not set, default setting is oracle.jdbc.OracleDriver.
url The URL and database of your Oracle 9server. The above entry is for an Oracle 9 database that runs on port1521 on your computer. The database that stores the content of theworkspace is named "crx."
user The user name with which you can connect to the database. The user needs full access on the database.
password The password for the username.

The Oracle Persistence Manager

The Oracle persistence manager stores the repository data in an Oracle database. It supports Oracle databases in version 10 or higher.
The Oracle persistence manager is configured as follows:
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.bundle.OraclePersistenceManager">
    <param name="driver" value="oracle.jdbc.OracleDriver"/>
    <param name="url" value="jdbc:oracle:thin:@localhost:1521:crx"/>
    <param name="user" value="root"/>
    <param name="password" value="crx"/>
</PersistenceManager>
driver (optional) If not set, default setting is oracle.jdbc.OracleDriver.
url The URL and database of your Oracle server.The above entry is for an Oracle database that runs on port 1521 onyour computer. The database that stores the content of the workspace isnamed crx.
user The user name with which you can connect to the database. The user needs full access on the database.
password The password for the user name.

The Microsoft SQL Server Persistence Manager

The Microsoft SQL server persistence manager stores the repository data in a Microsoft SQL server database.
The Microsoft SQL persistence manager is configured as follows:
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.bundle.MSSqlPersistenceManager">
    <param name="url" value=""jdbc:sqlserver://localhost;database=test"/>
    <param name="user" value="root"/>
    <param name="password" value="crx"/>
</PersistenceManager>
driver (optional) If not set, default setting is com.microsoft.sqlserver.jdbc.SQLServerDriver.
url The URL and database of your Microsoft SQLserver. The previous entry is for a Microsoft SQL database that runs onyour computer. The database that stores the content of the workspace isnamed test.
schema Database schema, in which content is stored.
user The user name with which you can connect to the database. The user needs full access on the database.
password The password for the username.

The MySQL Persistence Manager

The MySQL Persistence Manager stores the workspace content in a MySQL database. You need to install MySQL separately and provide an empty database where CRX can store the repository. Initially the account through which CRX accesses the database must have sufficient privileges to enable the creation of tables. If desired, these privileges can be revoked after the inital start up of the CRX instance.

Note

In the MySQL configuration file my.ini./my.cnf set the parameter max_allowed_packet to a value of at least 256M. Failure to make this change will result in a failed installation.

The MySQL persistence manager is configured as follows:
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.bundle.MySqlPersistenceManager">
    <param name="driver" value="com.mysql.jdbc.Driver"/>
    <param name="url" value="jdbc:mysql://localhost:3306/crx"/>
    <param name="user" value="root"/>
    <param name="password" value="crx"/>
</PersistenceManager>
driver (optional) If not set, default setting is com.mysql.jdbc.Driver.
url The URL and database of your MySQL server.The above entry is for a MySQL database that runs on port 3306 on yourcomputer. The database that stores the content of the workspace isnamed crx.
user The user name with which you can connect to the database. The user needs full access on the database.
password The password for the user name.

Note

Using a database as the persistence manager does not expose the system to SQL Injection flaws. Any persistence manager (Tar or RDBMS PM) can be used as CRX accesses data through the JCR interface where the access control is handled on the repository itself. Every call is also authenticated and authorized before data can be accessed. It is unlike a relational database where the access control is not handled on the data layer. Therefore, there is no vulnerability for SQL Injection in CRX's repository.

Using MySQL to Store the Journal

When using MySQL to store the journal, you need to add the parameter databaseType with a value of mysql in the journal configuration (located in repository-template.xml). See the following code snippet:

<Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
<param name="revision" value="${rep.home}/revision.log"/>
<param name="driver" value="com.mysql.jdbc.Driver"/>
<param name="url" value="jdbc:mysql://192.168.180.67:3306/SHINE_WCMS_5_3"/>
<param name="user" value="root"/>
<param name="password" value=""/>
<param name="databaseType" value="mysql"/>
</Journal>

Upgrading CQ/CRX on a MySQL Persistence Manager in a Cluster

If you are upgrading CQ or CRX on a MySQL persistence manager in a cluster, clone the instance first and then clean up the files related to the cluster node ID so that CRX regenerates an ID for the cloned instance.

See the Clustering documentation for more information.

The Native Persistence Manager

Note

The Native persistence manager is deprecated. It has been replaced as the default PM by the Tar persistence manager.

The native persistence manager stores all repository data in its own, bundled database. The database files are stored in the repository folder in the CRX installation folder.

Note

The native persistence manager uses the HSQLDB database. There is no persistence manager that works with HSQLDB in Jackrabbit. Instead, migrate data to the Tar persistence manager, or another database persistence manager, for example, org.apache.jackrabbit.core.persistence.bundle.H2PersistenceManager.

To use the native persistence manager, use the following code for the <PersistenceManager> section:
<PersistenceManager class="com.day.crx.persistence.NativePersistenceManager"/>

Note

The native persistence manager supports up to 8 GB of repository content, not including large files, which are stored separately. This is enough for most applications, and can support roughly 8 million nodes (depending on their size and the repository structure).

Data Store

The data store holds large binaries. On write, these are streamed directly to the data store and only an identifier referencing the binary is written to the PM store. By providing this level of indirection, the data store ensures that large binaries are only stored once, even is they appear in multiple locations within the content in the PM store. In effect the data store is an implementation detail of the PM store. Like the PM, the data store can be configured to store its data in a file system (the default) or in a database.

CRX 2.0 and higher uses Data Store instead of blob store to store large binaries. The Data Store is configured via the DataStore tag in repository.xml configuration file, and is enabled by default.

    <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
        <param name="minRecordLength" value="4096"/>
    </DataStore>
minRecordLength The minimum object length. The default is 100 bytes; smaller objects are stored inline (not in the data store). The maximum value is 32000 because Java does not support strings longer than 64 KB in writeUTF.

Note

The following section is derived from the Apache Open Source project, see http://wiki.apache.org/jackrabbit/DataStore.

Overview

The main features of the data store are as follows:

  • Space saving: Only one copy per unique object is kept.
  • Fast copy: Only the identifier is copied.
  • Storing and reading does not block others.
  • Objects in the data store are immutable.
  • Garbage collection is used to purge unused objects.
  • Hot backup is supported.

Advantages of Data Store

The main advantages of the data store over the blob store are as follows:

  • Unlike the blob store, the data store keeps only one copy per object, even if it is used multiple times.
  • The data store detects if the same object is already stored and only stores a link to the existing object.
  • The data store can be shared across multiple workspaces, and even across multiple repositories, if required.
  • Data store operations (read and write) do not block other users because they are performed outside the persistence manager.
  • Multiple data store operations can be performed at the same time.

File Data Store

To use the file-based Data Store for clustering, the Data Store path must be set to a shared directory used by all cluster instances. When a cluster is configured through the GUI (See CRX Clustering) this parameter is automatically configured properly.

    <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
        <param name="path" value="<sharedDirectory>/datastore"/>
        <param name="minRecordLength" value="4096"/>
    </DataStore>
path The path to a shared directory used by all cluster nodes.
minRecordLength The maximum value for minRecordLength is approximately 32000.

Database Data Store

Instead of file-base data store, a datatbase can be used instead. It is configured as follows:

<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
<param name="url" value="jdbc:postgresql:test"/>
<param name="user" value="sa"/>
<param name="password" value="sa"/>
<param name="databaseType" value="postgresql"/>
<param name="driver" value="org.postgresql.Driver"/>
<param name="minRecordLength" value="1024"/>
<param name="maxConnections" value="3"/>
<param name="copyWhenReading" value="true"/>
<param name="tablePrefix" value=""/>
<param name="schemaObjectPrefix" value=""/>
</DataStore>


url
The database URL (required).
user
The database user name (required).
password
The database password (required).
databaseType

The database type. By default the sub-protocol of the JDBC database URL is used if it is not set.

It must match the resource file <databaseType>.properties, for example, mysql.properties.

Currently supported are: db2, derby, h2, mssql, mysql, oracle, sqlserver.

driver
The JDBC driver class name. By default the default driver of the configured database type is used.
minRecordLength The minimum record length. The default is 1024.
maxConnections Set the maximum number of concurrent connections in the pool. At least 3 connections are required if the garbage collection process is used.
copyWhenReading The copy setting, enabled by default. If enabled, a stream is always copied to a temporary file when reading a stream, so that reads can be concurrent. If disabled, reads are serialized.
tablePrefix The table name prefix. The default is empty. Can be used to select an non-default schema or catalog. The table name is constructed like this: ${tablePrefix}${schemaObjectPrefix}${tableName}. Before CRX 2.0, this setting was case sensitive (must be lowercase for PostgreSQL and MySQL, and uppercase for other databases). For CRX 2.0 and later, this setting is no longer case sensitive.
schemaObjectPrefix The schema object prefix. The default is empty. Before CRX 2.0, this setting was case sensitive (must be lowercase for PostgreSQL and MySQL, and uppercase for other databases). For CRX 2.0 and later, this setting is no longer case sensitive.

Note

Although the Data Store is enabled by default, the blob store is still available for backward compatibility. Blob store was used in earlier versions of CRX with DB persistence to store binaries directly in the database as a part of persistent manager storage.

Blob parameters are still supported, but they are not relevant for new installations: when the Data Store is used, all new large entries are stored there instead of the blob store.

Note

MySQL does not support streaming very large binaries from client to server (writing). This may cause problems when using the DbDataStore with MySQL. See:

Running Garbage Collection

Use garbage collection to remove any unused files in the Data Store.  

To run garbage collection:

  1. In the CRX console, click Repository Configuration.

  2. Click Data Store Garbage Collection.

    file
  3. Select one or more of the following options:

    Option Description
    Run memory garbage collection first To run the garbage collection of the main memory first (also known as heap garbage collection). This process evicts objects that are still in the main memory, but that are no longer referenced. The data store garbage collection only reclaims items that are no longer in the main memory.
    Delete unused items Selecting this option means that any unused files are deleted from the Data Store. If this option is disabled, only the last modified date of the used items is updated, but no files are deleted. If multiple repositories share the same data store, this option should not be enabled; instead, old items should be removed manually or by using a script (for example, by deleting files older than one week).
    Use a persistence manager scan When this option is enabled, the process uses a low-level persistence manager scan if the persistence manager supports this option. Selecting this option speeds up the garbage collection process but may slow down concurrent operations. If the option is disabled, a higher level node traversal algorithm is used.
    Delay after scanning 10 nodes Enter the time in milliseconds that you want garbage collection to wait after scanning 10 nodes. The default and recommended value is 10 milliseconds. Adding a delay means that garbage collection may take slightly longer but there is less impact on production server performance.
  4. Click Run. CRX runs the garbage collection and indicates when it has completed.

    file

Automating Garbage Collection

If possible, the garbage collection should be run when there is little load on the system, for example in the morning. By default the Tar PM optimization runs between 2 am and 5 am, which also slows down the system, that means a good time to run the garbage collection is 5 am, though you should check that no backups are running at this time.

Garbage collection can be automated using the wget or curl HTTP clients. The following is an example of how to automate backup by using curl:

Caution

In the following example curl commands various parameters might need to be configured for your instance; for example, the hostname (localhost), port (7402), admin password (xyz) and various parameters for the actual garbage collection.

  1. Login to crx:

    curl -c login.txt "http://localhost:7402/crx/login.jsp?UserId=admin&Password=xyz&Workspace=crx.default"
    
  2. Run the garbage collection; for example:

    curl -b login.txt -f -o progress.txt "http://localhost:7402/crx/config/datastore_gc.jsp?memGc=checked&delete=checked&pmScan=checked&sleep=10&action=run"

    The curl command returns when the garbage collection is completed on the server.

    The above is an example that selects all options, selected as parameters to the command (omit if not required):

    Parameter (Option) Description

    memGc=checked

    (Run memory garbage collection first)

    To run the garbage collection of the main memory first (also known as heap garbage collection). This process evicts objects that are still in the main memory, but that are no longer referenced. The data store garbage collection only reclaims items that are no longer in the main memory.

    delete=checked

    (Delete unused items)

    Selecting this option means that any unused files are deleted from the Data Store. If this option is disabled, only the last modified date of the used items is updated, but no files are deleted. If multiple repositories share the same data store, this option should not be enabled; instead, old items should be removed manually or by using a script (for example, by deleting files older than one week).

    pmScan=checked

    (Use a persistence manager scan)

    When this option is enabled, the process uses a low-level persistence manager scan if the persistence manager supports this option. Selecting this option speeds up the garbage collection process but may slow down concurrent operations. If the option is disabled, a higher level node traversal algorithm is used.

    sleep=10

    (Delay after scanning 10 nodes)

    Enter the time in milliseconds that you want garbage collection to wait after scanning 10 nodes. The default and recommended value is 10 milliseconds. Adding a delay means that garbage collection may take slightly longer but there is less impact on production server performance.
  3. As the -o parameter has been specificed in the curl command, the file:

        progress.txt

    will be created in your current directory. View this to check status information and statistics related to the garbage collection, for example:

      www.day.com  
    Content Repository Extreme 2.1
    JSR-283 Compliant Repository
     
    Main ConsoleContent LoaderContent ZipperRepository Configuration
    UserID: admin | Workspace: crx.default | Log Out | Switch Workspace | Impersonate
    Data Store Garbage Collection
    Memory GC finished.
    Directory: /CRX/crx-quickstart/repository/shared/repository/datastore
    Using persistence manager scan.
    Delay after scanning 10 node: 10 ms
    Scan started...
    Initial scan completed (333 nodes).
    Scanning node #334
    Final scan completed.
    Deleted 0 unused objects.
    Data store garbage collection successfully finished in 7 second(s); 333 nodes.

      Global Content Management     Copyright © 2010 Day Management AG     www.day.com  
  4. Remove the login cookie:

    rm login.txt

    Remove the progress file (unless required):

    rm progress.txt

Journal

Whenever CRX writes data it first records the intended change in the journal. Maintaining the journal helps ensure data consistency and helps the system to recover quickly from crashes. As with the PM and data stores, the journal can be stored in a file system (the default) or in a database.

For information on configuring the journal, see Manual Configuration.