In a human body the heart and the brain are the 2 most important organs, if those are not performing well, nothing else is. The Alfresco database and filesystem where the content store resides are the brain and heart of Alfresco. Those are the 2 most important layers of every Alfresco architecture.
Get to know the Alfresco Database throughput
If your project will have lots of concurrent users and operations or the number/estimate number of documents is very big (> 1M)) you need to be informed about your database throughput.
Most common throughput factor of the database is transactions per second.
DB performance in a transactional system are usually the underlying database files and the log file. Both are factors because they require disk I/O, which is slow relative to other system resources such as CPU.
In the worst-case scenario, in big alfresco databases with a big number of documents ):
Database access is truly random and the database is too large for any significant percentage of it to fit into the cache, resulting in a single I/O per requested key/data pair.
Both the database and the log are on a single disk. This means that for each transaction, the AlfrescoDB is potentially performing several filesystem operations:
- Disk seek to database file
- Database file read
- Disk seek to log file
- Log file write
- Flush log file information to disk
- Disk seek to update log file metadata (for example, inode information)
- Log metadata write
- Flush log file metadata to disk
Faster Disks normally can help on such sittuations but there are lots of ways (scale up, scale out) to increase transactional throughput.
In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:
- I/O bottlenecks in the RDBMS throughput
- Excessive queue for transactions due to overload of connections
- On active-active cluster configurations, excessive latency
Alfresco Database treads pool.
Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.
If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).
http://docs.oracle.com/cd/E17076_04/html/programmer_reference/transapp_throughput.html
How to calculate the size of your Alfresco database
All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.
To size your alfresco database in terms of space we’ve done a series of tests by creating content (and metadata) on an empty repository and analyzing the database growth.
Be aware that:
- Content is not stored in the database but is directly stored on the disk
- Database size is un-affected by size of the documents or the document’s content
- Database size is affected by the number/type of metadata fields of the document
Hi all, back with another Alfresco related post, this time to show you how to size your alfresco database in terms of space.
The following factors are relevant to calculate the approximate size for an Alfresco database
- Number of meta data fields
- Permissions
- Number of folders
- Number of documents
- Number of versions
- Number of users
I’ve made a series of tests where i could verify how the Alfresco database grows.
I’ve made a bulk import with the following data.
Document creation method | In-place bulk upload |
Number of Documents Ingested | 148 |
Total Size of Documents | 929.14 MB |
Number of metadata fields per document | 13 fields |
Total number of metadata fields | 1924 |
The table below shows the types of documents and its average sizes
Document Type | Extension | Average Size (KB) |
MS Word Document | .doc | 1024 |
Excell Sheet | .xls | 800 |
Pdf document | 10240 | |
PowerPoint presentation | .ppt | 5120 |
Jpg image | .jpg | 2048 |
Checking the diagram below we can see that the database indexes grow more than the data itself. By observing the growth of the database size we’ve concluded that the average metadata field occupation on the Alfresco database is approximately 5.5 K per metadata field
Also interesting is to verify the tables that grow in size(KB) after the content ingestion. Note that we are not applying any permission.
To size your database appropriately you must ask the right questions whose answers will help you to determine the database sizing.
- Estimated number of users in Alfresco
- Estimated number of groups in Alfresco
- Estimated number of documents on the first year
- Documents growth rate
- Average number of versions per document
- Average number of meta-data fields per document
- Estimated number of folders
- Average number of meta-data fields per folder
- Estimated number of concurrent users
- Folder based permissions (inherited to child documents)?
Database sizing formulas
Consider to following figures to determine your approximate database size.
- DV = Average number of document versions
- F = Estimated number of folders
- FA = Estimated number of folder metadata fields (standard + custom)
- D = Number of Documents * DV – Estimated number of documents including the versions
- DA =Estimated number of documents metadata fields (standard + custom)
The number of records on specific alfresco tables is calculated as follows:
- Number of records on alf_node (TN = F + D * DV)
- Number of records records on node_properties (TNP = F * FA + D * DA)
- Number of records records on node_status = (TNS = F + D)
- Number of records records on alf_acl_member= (TP = D) assuming permission will be set at the folder level and inherited
The approx. number of records in the database will be TRDB = TN + TNP + TNS + TP
The following formula is based on the number of database records. On our benchmarks we’ve observed that each database record takes about 4.5k of db space.
Formula #1 Database size = TRDB * 4.5K
Alternatively, we can base our calculations on the number of metadata fields of the documents, considering 5.5k for each metadata field and use the following formula.
Formula #2 Database size = (D * DA + F * FA) * 5.5K
The 2 formulas provided are only approximations on the size that your database will need and are based on benchmarks executed against a vanilla Alfresco version 4.2.2.
If we wish to consider users and groups add consider 2k for each user and 5k for each group.
Note that the formulas are not taking in consideration additional space for logging, rollback, redolog, etc.
Tuning your Alfresco database
In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:
• Wrong or improper support for ACID transaction properties
• I/O bottlenecks in the RDBMS throughput
• Excessive queue for transactions due to overload of connections
• On active-active cluster configurations, excessive latency
Considering that your database layer will be used under concurrent load I’ve come up with a set of hints that will contribute to maximize your Alfresco database performance.
Database Thread pool configuration
A default Alfresco instance is configured to use up to a maximum of forty (40) database connections. Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols.
Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.
It’s recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75. If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).
After increasing the size of the Alfresco database connection pool, you must also increase the number of concurrent connections your database can handle, to at least the size of the Alfresco connection pool. Alfresco recommends configuring at least 10 more connections to the database than is configured into the Alfresco connection pool, to ensure that you can still connect to the database even if Alfresco saturates its connection pool.
Database Validation query
By default Alfresco does not periodically validate each database connection retrieved from the database connection pool. Validating connections is, however, very important for long running Alfresco servers, since there are various ways database connections can unexpectedly be closed (for example by transient network glitches and database server timeouts). Enabling periodic validation of database connections involves adding the db.pool.validate.query property to alfresco-global.properties and the query is specific for your database type.
Database | Value for db.pool.validate.query |
MySQL[1] | SELECT 1 |
PostgreSQL | SELECT VERSION() |
Oracle | SELECT 1 FROM DUAL |
Database Scaling
Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. Considering our existing customers, the biggest running repositories are under Oracle (most of them RAC).
If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines. This can be achieved using Oracle RAC or a Mysql based solution with haproxy[1] (opensource solution) or a commercial solution like MariaDB[2] or Percona[3].
You can use either solution, depending also you the knowledge that you have in-house. The golden rule is that the response time from the DB in general, should be around 4ms or lower. At this layer i don’t recommend to use any virtualization technology. The database servers should be physical servers.
The high availability and scalability for database is vendor dependent and should be addressed with the chosen vendor to achieve the maximum performance possible.
Database Monitoring
Monitoring your database performance is very important as it can detect some possible performance problems or scaling needs.
I’ve identified the following targets that should be monitored and analysed on a regular base.
- Transactions
- Number of Connections
- Slow Queries
- Query Plans
- Critical DM database queries ( # documents of each mime type, … )
- Database server health (cpu, memory, IO, Network)
- Database sizing statistics (growth, etc)
- Peak Period of resource usage
- Indexes Size and Health
I hope this post can help you to understand the importance of the Alfresco database and that you can make use of it on your sizing exercise. Stay tuned for more Alfresco related posts.
All the best, One love,
Luis