Quantcast
Channel: Luis Cabaceira on Alfresco
Viewing all 21 articles
Browse latest View live

Alfresco Repository Caches Unfolded

$
0
0

Alfresco Repository Caches Unfolded

During my consulting practice on various customer accounts i’ve faced some challenges regarding the tuning of the repository caches for increased performance, this post is a result of my personal experiences and it attempts to explain the way Alfresco makes usage of the repository caches.

Note : The information on this post applies only to Alfresco versions 4.1.X.

DISCLAIMER : You should not change the cache values unless you have performance issues and you are absolutely sure about what you are doing. Before changing the cache values you should first gather all the relevant information that will support and justify your changes. The repository caches can have a big positive difference on the performance of your Alfresco repository but they consume Java heap memory. Tuning the caches in a wrong way can make your system irresponsive and may lead to Out of Memory issues. The optimal settings to use on the caches depend on your particular use case and the amount of memory available to your Alfresco server.

The Repository Caches

The alfresco repository features various in-memory caches , they are transaction safe and clusterable. There are 2 levels of caching involved.

Level 1 Cache (cache-context.xml) – The in-transaction caches

Sizes in cache-context.xml are for the in-transaction caches (Level 1 cache) i.e. before it gets flushed to permanent storage. TransactionalCache has to hold any values (read, updated, new, removed) local to the transaction.  On commit, the values are flushed to the level 2 cache (EHCache in 4.1), which makes the data available to other transactions.  Of course, a rollback means the values get thrown away and the shared caches are not modified. This gives Alfresco the power of repeatable read as far as cache values are concerned.

So, if there are a lot of transactions that pull in a lot of cache entries, the transaction-local cache can get full, which is  bad for performance as the only way to guarantee correctness of the shared cache is to clear it. When it comes to site membership evaluation, a large number of ACLs are accessed within single transactions, which is why the transactional cache sizes are larger relative to the shared cache sizes.

Level 2 Cache (ehcache-custom.xml)

The Level 2 (L2) cache provides out-of-transaction caching of Java objects inside the Alfresco system. Alfresco provides support for EHCache. Using EHCache does not restrict the Alfresco system to any particular application server, so it is completely portable.
The L2 cache objects are stored in memory attached to the application scope of the server. Sticky sessions must be used to keep a user that has already established a session on one server for the entire session. By default, the cache replication makes use of RMI to replicate changes to all nodes in the cluster using the Peer Cache Replicator. Each replicated cache member notifies all other cache instances when its content has changed.
Level 2 cache is a technology to speed up database access. When the application makes a database query, it does not have to do a (costly) SQL request if the object is already present in the Level 2 cache. For debugging purposes, you can disable the L2 cache. The database will keep working, but at a slower rate.

If you have issues with the replication of information in clustered systems, that is, the cache cluster test fails, you may want to confirm this by setting the following properties to true in the alfresco-global.properties file as follows :

system.cache.disableMutableSharedCaches=true
system.cache.disableImmutableSharedCaches=true

Default Values for the Caches

Currently, out of the box, a vanilla Alfresco comes setup for approximately  512MB of cache heap memory , that’s the recommended default for a Java heap size of 1GB.  In our days were we have much bigger heaps ( I’ve seen heaps from 4GB up to 54 GB ) and also much bigger numbers in terms of users, concurrency and repository sizes.  This means that the cache defaults on ehcache.xml are designed for dev environments (1G heap) and i personally think they can be tuned on every production environment.

All default cache settings are available in the <configRoot>\alfresco\ehcache-default.xml file, but you should not directly modify this file.

Individual Cache Settings for L2 cache

Each cache is configured in an XML block similar to this:
<cache
name=”org.alfresco.cache.node.rootNodesCache”
maxElementsInMemory=”500″
eternal=”true”
overflowToDisk=”false”
statistics=”false”
/>

name
The name attribute is the name of the cache and generally indicates the type of objects being cached.
maxElementsInMemory
The maxElementsInMemory controls the maximum size of the cache. This value can be changed to tune the size of the cache for your system. Ehcache caches are implemented using a linked-map system, which means that memory is only required for objects that are actually in memory. If you set the maxElementsInMemory to a high value, it will not automatically allocate that number of slots. Instead, they are added to the linked list as required. When maxElementsInMemory is reached, the cache discards the oldest objects before adding new objects.
The cached objects will be garbage collected by means of weak referencing should memory become a problem. It should be noted that some object references are effectively shared by the caches e.g. so the amount of memory used is generally not as high as the approximate value may suggest – but it’s best to error on the side of caution.
timeToIdleSeconds 
timeToIdleSeconds and timeToLiveSeconds control the automatic timeout of cached objects.
overflowToDisk
overflowToDisk controls whether the cache should overflow to disk rather than discarding old objects.
statistics
When set to true and the rest of the tracing mechanism is enabled, the alfresco logs will contain usage statistics of this specific cache.

How to Wisely tune the repository caches for your use-case

There are 2 main files that you can edit/enable to tune your repository caches :  ehcache-custom.xml and cache-context.xml. You’ll find the ehcache-custom.xml.sample file on your shared/classes/alfresco/extension directory under your application server installation folder. To change the default values you need to rename this file to ehcache-custom.xml, perform your changes and restart your application server. When you decide that you need to tune a specific cache, you should do it on both the  ehcache-custom.xml and cache-context.xml files.

We strongly advice you not to tune the caches without tracing their current usage first.

The best way to predict and determine the optimal values for your caches is by using a tracing mechanism that will help you to determine which caches fill up quickly for your particular server use-case.  Take a look below on how to enable the cache tracing mechanism.

TRACING YOUR CACHE SIZES

1 – log4j.properties

Edit your log4j.properties and set the following logging category  to DEBUG to output detailed Ehcache usage information.

org.alfresco.repo.cache.EhCacheTracerJob=DEBUG

To target specific caches, you can even append the cache name or package:

org.alfresco.repo.cache.EhCacheTracerJob.org.alfresco

2 – Override the ehCacheTracerJob bean

The configuration file <configRoot>/alfresco/scheduled-jobs-context.xml contains the
ehCacheTracerJob bean configuration. You will need to override this bean to change the trigger schedule. You will do this by enabling the scheduler property to activate the trigger. To override this bean create a <yourname>-context.xml file on the extensions root folder and provide the overriding for the bean as per the instructions below.

<!– enable DEBUG for ‘org.alfresco.repo.cache.EhCacheTracerJob’ and enable scheduler property to activate —>
<bean id=”ehCacheTracerJob”>
<property name=”jobDetail”>
<bean id=”ehCacheTracerJobDetail”>
<property name=”jobClass”>
<value>org.alfresco.repo.cache.EhCacheTracerJob</value>
</property>
</bean>
</property>
  <!– enable this to activate bean
        <property name=”scheduler”>
            <ref bean=”schedulerFactory” />
        </property>
        –>
<!– start after an hour and repeat hourly –>
<property name=”startDelayMinutes”>
<value>60</value>
</property>
<property name=”repeatIntervalMinutes”>
<value>60</value>
</property>
</bean>

When triggered, the job will collect detailed cache usage statistics and output them to the log/console, depending on how logging has been configured for the server.

3 – Set caches to use statistics

In your ehcache-custom.xml, choose the caches you want to monitor and set the statistics property=true.

<cache
name=”org.alfresco.cache.node.rootNodesCache”
maxElementsInMemory=”250000″
eternal=”true”
overflowToDisk=”false”
statistics=“true”
/>

After making those 3 changes and restarting your server, you should start seeing a detailed output with relevant information in regards to your usage of the repository caches. You should get log traces similar to the ones on the example below.

The following example is from a test of the Alfresco Repository running a simple 150 concurrent user test scenario. Randomly selected from a pool of 1000 test users, a user logs in, views their home space, uploads a small file to the repository and logs out. This test ensures that new objects are continually added to the caches as the new files are added by random users.

Some objects are shared between the caches, so the reported sizes are an overestimate in some cases. Nevertheless, they serve as a useful indication of relative sizes. Do not the last statement that gets logged that clearly indicated the estimated size of your heap that is being consumed by the cache.

09:09:34,458 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeImpl.sourceNodeAssocs
Hit Count:                        56245 hits            |         Miss Count:         20102 misses
Deep Size:                       19.62 MB              |         Current Count:       5000 entries
Percentage used:            100.00 percent     |         Max Count:           5000 entries
Estimated maximum size:      19.62 MB

09:10:06,099 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeImpl.targetNodeAssocs
Hit Count:                            56253 hits            |         Miss Count:         20114 misses
Deep Size:                           19.62 MB              |         Current Count:       5000 entries
Percentage used:                100.00 percent     |         Max Count:           5000 entries
Estimated maximum size:      19.62 MB

09:10:06,099 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.VersionCountImpl
Hit Count:                          0 hits                   |         Miss Count:             0 misses
Deep Size:                         0.00 MB               |         Current Count:          0 entries
Percentage used:              0.00 percent        |         Max Count:            100 entries
Estimated maximum size:        NaN MB

09:10:06,115 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeAssocImpl
Hit Count:                          0 hits                |         Miss Count:             0 misses
Deep Size:                         0.00 MB            |         Current Count:          0 entries
Percentage used:              0.00 percent     |         Max Count:           1000 entries
Estimated maximum size:        NaN MB


09:10:31,428 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  permissionsAccessCache
Hit Count:                         2610635 hits        |         Miss Count:       6148423 misses
Deep Size:                        6.02 MB                |         Current Count:      12165 entries
Percentage used:             24.33 percent       |         Max Count:          50000 entries
Estimated maximum size:      24.75 MB

09:10:31,615 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  PermissionCache
Hit Count:                         9035796 hits        |         Miss Count:      19266775 misses
Deep Size:                        2.55 MB                |         Current Count:       3519 entries
Percentage used:             35.19 percent       |         Max Count:          10000 entries
Estimated maximum size:       7.23 MB

09:10:31,615 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob] EHCaches currently consume 421.84 MB or 28.20% VM size
The criteria to check the cache tracing logs is:

  • (MissCount – CurrentCount) must be as low as possible.
  • (HitCount/MissCount) must be as high as possible.

Estimated maximum size affects the permanent memory taken up by the cache. If the caches grow too large, they may crowd out transient session memory and slow down the system. It is useful to have this running, on occasion, to identify the caches with a low HitCount/MissCount ratio.

An important indicator that you need to increase your caches is when you see messages like the ones below on your alfresco.log file indicating that some specific caches are full.

13:25:12,901 WARN [cache.node.nodesTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.nodesTransactionalCache’ is full (125000).
13:25:14,182 WARN [cache.node.aspectsTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.aspectsTransactionalCache’ is full (65000).
13:25:14,214 WARN [cache.node.propertiesTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.propertiesTransactionalCache’ is full (65000).

After analysing your tracing results and your alfresco log file , if you decide to perform a repository cache tuning, your should enable for editing both the ehcache-custom.xml and cache-context.xml.

I strongly advice you not to change the original cache-context.xml file directly, but to create a new context file where you override the beans that you need. You find the original cache-context files on the alfresco.war webapp, under /classes/alfresco/cache-context.xml. The tuning of a specific cache should occur on both the cache-context file and the ehcache-custom.xml file. For your reference, i’m attaching to this post 2 sample cache configuration files with settings that were  required for to handle a large group structure environment.

custom-cache-context.xml ehcache-custom.xml

Individual Caches Information

This section shows details on the most important repository caches with a brief definition on their purpose.

Default Cache
defaultCache This cache is for when someone forgets to write a cache sizing snippet.
Store Caches
rootNodesCache Primary root nodes by store.  Increase for Multi Tenancy
allRootNodesCache All root nodes by store.  Increase for Multi Tenancy
Nodes Caches
nodesCache Bi-direction mapping between NodeRef and ID.
aspectsCache Node aspects by node ID.  Size relative to nodesCache
propertiesCache Node properties by node ID.  Size relative to nodesCache
childByNameCache Child node IDs cached by parent node ID and cm:name.  Used for direct name lookups and especially useful for heavy CIFS use.
contentDataCache cm:content ContentData storage.  Size according to content in system.
Permissions and ACLs: (Mostly used in Alfresco Share)
authorityToChildAuthorityCache Size according to total number of users accessing system
authorityToChildAuthorityCache Members of each group. Size according to number of groups, incl. site groups
zoneToAuthorityCache Authorities in each zone. Size according to zones.
authenticationCache Scale it according to the number of users, not concurrency, because Share sites will force all of them to be hit regardless of who logs in.
authorityCache NodeRef of the authority.  Size according to total authorities
permissionsAccessCache Size according to total authorities
readersCache Who can read a node, Size according to total authorities.
readersDeniedCache Who can’t read a node,  Size according to total authorities.
nodeOwnerCache cm:ownable mapping, needed during permission checks
personCache Size according to the number of people accessing the system
aclCache ACL mapped by Access control list properties
aclEntityCache DAO level cache bi-directional ID-ACL mapping.
Other Caches 
propertyValueCache Caches values stored for auditing.  Size according to frequency and size of audit queries and audit value generation
immutableEntityCache QNames, etc.  Size according to static model sizes
tagscopeSummaryCache Size according to tag use.  Stores rollups of tags

AVM Caches are related to a deprecated part of Alfresco so you shouldn’t need to tune the following caches :

  • avmEntityCache
  • avmVersionRootEntityCache
  • avmNodeCache
  • avmStoreCache
  • avmNodeAspectsCache

Conclusion

Repository caches play an important role on the repository performance and they should be tuned wisely. Always use the tracing method before you decide to tune your caches and don’t forget to check your logs for signs that can help you on decide that you need to tune your caches.

I hope you enjoyed this article, stay tuned for more information from the field.

 


Alfresco Boxes

$
0
0

Alfresco Boxes – State of the Art Automation to create your alfresco environment

Hi folks, my second post on my new Alfresco blog could not be about a better topic, let me start by telling you that Alfresco Boxes rocks and it just makes me proud to be a personal friend and colleague of its creator, Maurizio Pillitu. It was never so easy to setup and run a fully featured alfresco environment.

Alfresco Boxes is a community-driven effort, currently in experimental phase, not supported and not guaranteed by Alfresco. As usual, if you choose to use this technology and you run into problems, please let us know so we can quickly change the project name and pretend that is someone else’s fault. :-).

Jokes apart, If you want to create your alfresco environments using a fast, intelligent, reusable, automated way then Alfresco Boxes is what you’re looking for.  Let’s start jamming with some boring but necessary theory. I’ll promise i wont take it very far and you’ll be able to start to experiment all of this real quickly.

On the github project page (https://github.com/maoo/alfresco-boxes) Maurizio  starts to define Alfresco boxes as :

“A collection of utilities for Alfresco VM provisioning and deployment that works with either Packer, Vagrant or Docker; the deployment and installation logic is provided by chef-alfresco.”

Packer is a free (open-source ) tool for creating identical machine images for multiple platforms from a single source configuration. It’s easy to use and automate the creation of any type of machine image. It embraces modern configuration management by encouraging you to use automated scripts to install and configure the software within your Packer-made images. Packer brings machine images into the modern age, unlocking untapped potential and opening new opportunities. Out of the box Packer comes with support to build images for Amazon EC2, DigitalOcean, VirtualBox, and VMware.

Vagrant does the same as packer does, only using a specific file extension (.box) that contains all files needed by a local provider (VMWare or VirtualBox) to run the VM; check vagrantcloud and vagrantbox.es to get an idea of the pre-packaged boxes available that you can re-use on your own projects.

Docker is a linux kernel module that provides virtualization “super-powers”. Docker introduces the concept of container, which is a virtualization of the host operating system resources. Docker is a server that can start,stop and kill containers, so its basically a virtualization server with a new approach to the virtualization paradigm. As with Vagrant, Docker delivers an Image Index (Alfresco images soon to come)

In a nutshell you can use either one of the 3 technologies above (depending on what you want to achieve) to automate your alfresco deployment strategy and the creation of your alfresco environments. On this blogpost i will focus exclusively Packer but, from a technical standpoint, the 3 approaches have one thing in common… “Chef Alfresco”.

Chef Alfresco is a Chef Cookbook (a collection of build tasks) that defines the installation steps needed to deliver a fully working Alfresco instance on a given VM. If you’re not familiar with Chef, it is a Build Automation tool that uses a agent-client/server architecture to process and execute build tasks, so called recipes.

Chef Alfresco depends on other cookbooks, such as artifact-deployer, that fetches artifacts from remote Apache Maven repositories and defines default values (i.e. Maven artifact coordinates) for all artifacts (WARs, ZIPs, JARs) involved in the Alfresco deployment process; it also depends on other third-party recipes that install the DB (MySQL), Servlet Container (Tomcat) and transformation tools (ImageMagick, LibreOffice, swftools). If you want to check the full list of dependencies, check the Berkshelf file of alfresco-boxes.

 ! OK, ITS TIME TO GET BUSY, LET’S TRY IT !

Now that you know the basics of the theory its time to have a real taste on the Alfresco Boxes technology. Following is a list of step by step actions and a how to video that will guide you on how to use all of this in a simple way. First think you need is to check if you have the prerequisites necessary to proceed.

MacOS Users : To have Ruby installed, the best way is to install XCODE, preferable the latest version.

Now that the prerequisites are in place, lets start to install Alfresco boxes and build our first Alfresco vmware in a automated-way. The following instructions focused on the Packer approach. You find detailed documentation on other variants such as Docker or Vagrant on Github Alfresco Boxes project.

Installing Alfresco Boxes

1 – Download and install Packer.

  • Go to http://www.packer.io/downloads.html and download the appropriate version for your operating system.

  • Unzip the packer package and make sure the unziped location is part of your local path.

    • For Linux/MacOs : export PATH=<your_path_to_packer>:$PATH

  • Test if packer is installed and available on your system by typing the following command :

    • # packer -v

2 – Download and install virtualBox

3 – Checkout the Alfresco Boxes git project.

  • Checkout alfresco boxes by running the following command on a terminal (you need to have git installed, if you don’t know about git, now its a good time to learn it and start using it )

  • # git clone -b alfresco-boxes-0.5.1 https://github.com/maoo/alfresco-boxes.git

The git clone command created a alfresco-boxes folder on your system and downloaded all the projects content for the specific branch being checked out.

4 – Local configuration adjustments

4.1 Change directory to alfresco-boxes/packer/

# cd alfresco-boxes/packer/

4.2 Edit file precise-alf421.json to choose an IP that can be bridged to one of your host Network Interfaces:

{
  "type": "shell",
  "execute_command": "echo 'vagrant' | sudo -S sh '{{ .Path }}'",
  "inline": ["/tmp/static-ip.sh <your_ip_range_here>.223 192.168.1.1 255.255.255.0"]
}

4.3 Edit the vbox-precise-alf421/data_bags/maven_repos/private.json 

Edit this file to set your access credentials to artifacts.alfresco.com ( access can be requested by Alfresco Customers via the Alfresco Support Portal).

P.S. - If you don’t have credentials to artifacts.alfresco.com you can still test alfresco-boxes  using the Community edition: change the alfresco-allinone.json version attribute from 4.2.1 to 4.2.f

{
  "id":"private",
  "url": "https://artifacts.alfresco.com/nexus/content/groups/private",
  "username":"your_user",
  "password":"your_password"
}

You can optionally use your Maven encryped password and set your Maven Master password in precise-alf421.json:

"maven": {
  "master_password":"{your_mvn_master_password}"
}

4.4 Generate the Virtual Machine box:

cd alfresco-boxes/packer/vbox-precise-421
packer build -only virtualbox-iso precise-alf421.json

This will create a output-virtualbox-iso/.ovf and output-virtualbox-iso/.vdmk, ready to be imported into VirtualBox. You should now have a fully functional version of Alfresco with everything installed and ready to run. :)

The user/password to login (and check the local IP - ifconfig – that is assigned by DHCP) is vagrant/vagrant.

5  Virtual Machine (Ubunto) Notes

Open a terminal with a ssh connection to your virtual machine.

# ssh vagrant@<your_vm_ip>

5.1 – Start by changing/setting the root password

#sudo sh

# passwd root

5.2 Change tomcat7 password

# su -

# passwd tomcat7

5.3 Take a note of your usernames and passwords :

Make notes of your usernames and passwords, you will need them later.

root | <your_root_password>
tomcat7 | <your_tomcat7_password>

5.4 Tomcat 7 locations

CATALINA_BASE=/var/lib/tomcat7
CATALINA_HOME=/usr/share/tomcat7

5.5 To start | stop tomcat ( use sudo and tomcat7)

# sudo service tomcat7 start|stop|restart|status

5.6 Tomcat logs directory

/var/logs/tomcat7

5.7 Alfresco and Alfresco Share logs directory

/var/lib/tomcat7

5.8 Alfresco shared dir location

/var/lib/tomcat7/shared

5.9 Alfresco admin user 

( admin | admin )

Hope you enjoyed the article on its current status, i will be posting a step by step video to illustrate the most relevant steps for your guidance on the near future , so stay tuned.

Love, Passion, Unity and OpenSource can take us further. We’re together, thanks for reading.

 

Solr Tuning – Maximizing your Solr Performance

$
0
0

Solr Tuning Tips

Hi folks, another useful post for your alfresco related knowledge. This time dedicated to Solr tuning tips, using this information wisely can heavily contribute to increased performance on your Solr related topics such as searching and indexing content.

Solr, when properly tuned It’s extremely fast, easy to use, and easy to scale. I wanted to share some lessons learned from my field experience while using Alfresco with Solr.

Solr It’s a search indexer built on top of Lucene , there are two main disciplines to consider for Solr :

* Indexing data (writing/committing)
* Search for data (reading/querying)

Each of those disciplines has different characteristics and considerations while addressing performance.

It’s important to mention that there is no rule of thumb that enables you to maximize your Solr performance for every project. Solr tuning is an exercise that highly depends
on the specific project use cases, architecture and business scenarios.  Depending on the particularities of your project, the actions to perform may vary in terms of what
needs to be done to achieve best Solr performance. This post includes procedures and methodologies that will help you to understand how Solr performance is driven.

Let’s first Analise how does alfresco search and indexing works together with Alfresco

2 main Solr Cores (Live content and archived content)

By default the Solr that comes with Alfresco contains 2 cores, the live content core (workspaces:spacesStore) and the archived content (archive:spacesStore). Each of this cores contains the indexes for their particular set of content.

solr_commns

Alfresco Search ( After a user searches for content what happens behind the scenes ? )

Alfresco sends a secure GET request (https) to the Solr cores and Solr responds with a streaming formatted in JSON or XML with the response to the search request.
This is then interpreted by alfresco and the results are presented in a user-friendly format.

Solr Indexing new items (tracking Requests)

This tracking occurs by default on every 15 seconds (can be configured), Solr asks alfresco for changes in content and newly created documents in order to index those changes on its cores. It also asks for changes on the content models and for changes on the ACLs for documents.

In summary, Solr updates its indexes by looking at the number of transactions that have been committed since it last talked to Alfresco, a bit like index tracking in a cluster. On the diagram above you see several http requests going from Solr to Alfresco, those requests are explained below:

  1. New models and model changes https://localhost:8443/alfresco/service/api/solr/model
    1. Solr keeps track of new custom content models that have been deployed and download them to be able to index the properties in these models.
  2. ACLs changes https://localhost:8443/alfresco/service/api/solr/aclchangesets
      1. Any changes on permission settings will also be downloaded by Solr so it can do query time permission filtering.
  3. Document Content changes
      1.  https://localhost:8443/alfresco/service/api/solr/textContent
  4. New transactions (create, delete, update or any other action that triggers a transaction)
      1. https://localhost:8443/alfresco/service/api/solr/transactions

     

Brief analysis to New document indexing scenario

Let’s check what happens in Solr when we create a new document and Solr executes it’s tracking detecting that a new document has been created.

  1. First Solr requests a list of ids of all new transactions on that document (create, update, delete, … ) https://localhost:8443/alfresco/service/api/solr/transactions
  2. Transactions and acl changesets are indexed in parallel, and for each transactionId, Solr requests, on this order:
    1. Document metadata
    2. Document Content

Solr Architecture variations methods

There are 3 different architecture variations than can be considered while using Solr with Alfresco on a cluster. For the scope of this post  i will only be addressing cluster-based configurations that include the following advantages:

Alfresco  – >  Solr search load balancing

This is the most obvious use case for scalability purposes. Search requests are directed from Alfresco to a pool of Solr instances, each of which contains a full copy of the index and is able to service requests in a purely stateless fashion.

Solr -> Alfresco index tracking balancing

In the other direction, Solr nodes use a load balancer to redirect their index tracking requests to one or multiple dedicated/shared Alfresco nodes. This is useful in case of large indexing load, due to a heavy concurrent write/update scenario.

Option 1 – Solr in the same machine as alfresco, non dedicated tracking

Screen Shot 2014-06-10 at 22.44.49

On this architecture we have a solr instance deployed on the same application server of both alfresco and share web-applications.

Advantages

  • Easy to maintain / backup.

Disadvantages

  • Shared JVM, if Solr crashes both Alfresco and Share become unavailable.
  • Shared hardware, Memory is shared between all layers of the application
  • When Solr downloads content, there is transformation load to generate the indexing text file (CPU/Memory intensive) on the Alfresco side, having everything on the same box impacts both search and indexing as all the applications are on the same application server sharing its resources like the connection pools, threads, etc.
  • Only possible to scale vertically (Only possible to add more CPU and Memory)

Option 2 – Solr separated from alfresco – Non-Dedicated tracking

Screen Shot 2014-06-10 at 22.47.36

On this architecture variation we have the Solr instances deployed on separated machines and application servers from alfresco and share.

Advantages

  • Simple upgrade, administration and maintenance of Solr server’s farm
  • Allows for vertical and horizontal scalability
  • Introduces the ability to load balance the queries
  • Ready for Future Solr sharding feature
    • It’s expected that alfresco will support, on a near future the ability to slit the index on different solr instances that will lead to increased performance, this architecture is ready to implement that feature.

Disadvantages

Remote Tracking can stress the network, if network problems occur solr performance gets affected.

Option 3  Solr server with a dedicated tracking Alfresco instance

Screen Shot 2014-06-10 at 22.50.49

On this architecture variation we use dedicated alfresco instances on the solr servers that are only used for indexing and do not receive or process any user’s requests. These local alfresco instances take care of important CPU/Memory intensive operations such as the transformations and the overall tracking and indexing actions. With this scenario the repository Tier is released from those operations resulting on a overall performance gain. This Alfresco instances are not part of the cluster and do not require ehcache configuration.

Note:  When Solr requests Alfresco for the content to be indexed, it’s the Alfresco server that is responsible for perform the content transformation onto a plain text file, only then the content is sent to Solr for indexing. This is an IO/CPU/Memory intensive operation that can decrease the overall alfresco performance.

Advantages

  • Indexing operations offloaded from repository and client tier
  • Dedicated Alfresco for transformation operations
    • Allow for specific transformations tuning on the index tier and on the repository tier considering the use cases. Transformation for previews and thumbnails (share related) and transformations for Solr indexing.
  • Allows for Vertical and horizontal scalability
  • General performance increase

Disadvantages

  • None, in my opinion this is the best option :)

Solr Indexing Best practices

Now lets discuss the Solr indexing best practices, if your problem is regarding the indexing performance this is the juice part of the post for you.

General Indexing Golden Rules ( Solr Indexing Tuning )

  • Have local indexes (don’t use shared folders, NFS, use Fast hardware (RAID, SSD,..)
  • When using an alfresco version previous to 4.1.4 you should reduce your caches as the default caches configuration may lead to OOM when solr in under big load.
  • Manage wisely your Ram buffer size (ramBufferSizeMB) on solrconfig.xml, this is set to 32 MB by default, but generally increasing this to 64 or even 128 has proven increased performance. But this depends on the amount of free memory you might have available.
    • ramBufferSizeMB sets the amount of RAM that may be used by Solr indexing for buffering added documents and deletions before they are flushed to disk.
  • Tune the mergeFactor, 25 is ideal for indexing, while 2 is ideal for search. To maximize indexing performance use a mergeFactor of 25.
  • During the indexing, plug in a monitoring tool (YourKit) to check the repository health during the indexing. Sometimes, during the indexing process, the repository layer executes heavy and IO/CPU/Memory intensive operations like transformation of content to text in order to send it to Solr for indexing. This can become a bottleneck when for example the transformations are not working properly or the GC cycles are taking a lot of time.
  • Monitor closely the JVM health of both Solr and Alfresco (GC, Heap usage)
  • Solr operations are memory intensive so tuning the Garbage collector is an important step to achieve good performance.
  • Consider if you really need tracking to happen every 15 seconds (default). This can be configured in Solr configuration files on the Cron frequency property.             alfresco.cron=0/15 * * * * ? *
  • This property can heavily affect performance, for example during bulk injection of documents or during a lucene to solr migration. You can change this to 30 seconds or more when you are re-indexing. This will allow more time for the indexing threads to perform their actions before they get more work on their queue.
  • Increase your index batch counts to get more results on your indexing webscript on the repository side. In each core solrcore.properties, raise the batch count to 2000 or more alfresco.batch.count=2000
  • In solrconfig.xml of each core configure the ramBufferSize to be at least 64 Mb , you can even use 128 if you have enough memory .<ramBufferSizeMB>64</ramBufferSizeMB>
  • In solrconfig.xml of each core configure the mergeFactor to 25, this is the ideal value for indexing. <mergeFactor>25</mergeFactor>
  • Disable Full Text Indexing on archive:SpacesStore Solr, this is done by adding the property alfresco.index.transformContent=false. Alfresco never searches for content inside files that are deleted/archived. This saves on disk space, memory on Solr, Cpu during Indexing and overall resources.
  • Tune the transformations that occur on the repository side, set a transformation timeout.
  • Important must reply project questions:
    • SSL really needed?  If inside the intranet, you should disabled to reduce complexity.
    • Full Text indexing really necessary, some customers do full text index but they don’t actually use it.
    • Is an archive core really necessary for indexing, if you are not making use of this indexing core, it would be beneficial to disable it.

For index updates, Solr relies on fast bulk reads and writes. One way to satisfy these requirements is to ensure that a large disk cache is available. Use local indexes and the fastest disks possible. In a nutshell, you want to have enough memory available in the OS disk cache so that the important parts of your index, or ideally your entire index, will fit into the cache. Let’s say that you have a Solr index size of 8GB. If your OS, Solr’s Java heap, and all other running programs require 4GB of memory, then an ideal memory size for that server is at least 12GB. You might be able to make it work with 8GB total memory (leaving 4GB for disk cache), but that also might NOT be enough.

Solr Indexing Troubleshooting techniques

Troubleshooting Solr indexing performance means finding the bottleneck that is delaying the overall indexing process. Since this is a process that involves at least 2 layers of your application architecture, the best way to troubleshoot is trough a dedicated series of tests measuring performance and comparing results.

First thing to discover is where is the bottleneck occurring, it can be on:

  • Repository layer
    • Database – If it’s a database performance issue, normally adding more connections to the connection pool will increase performance.
    • I/O – If it’s a IO problem, it can normally occur when using virtualized environments, you should use hdparam to check read/write disk speed performance if you are running on a linux based system, there are also some variations for windows. Find the example below:

                  sudo hdparm -Tt /dev/sda

Timing cached reads: 12540 MB in 2.00 seconds = 6277.67 MB/sec  
Timing buffered disk reads: 234 MB in 3.00 seconds = 77.98 MB/sec
  • JVM – Jvm configuration can impact the performance on the repository layer indexing activities.
  • Cpu and Memory usage – monitor the usage of the CPU and Memory on this layer and check for unusual usage of this two components.
  • Tranformations – Set a timeout for the transformations that occur on the repository layer. There is no timeout set by default and sometimes, when there’s a transformation issue the threads are frozen waiting for the transformations to occur.
  • SOLR Indexing layer
    • Number of threads for indexing – You can add more threads to the indexing processes if you detect that indexing is slow on the Solr side.
    •  Solr caches  – There are several caches that you can configure to increase indexing performance.
    • JVM – Jvm configuration can impact the performance on the Solr layer indexing activities. Focus your efforts on analyzing and tuning the Garbage collector, check for big GC pauses by analyzing the gc logs.
    • Hardware scalability – If none of the above actions improve your performance you may need to increase memory and CPU power on the Solr layer. Also consider horizontal scaling when appropriate.

The rule for troubleshooting involves testing and measuring initial performance, apply some tuning and parameter changes and retest and measure again until we reach the necessary performance. I strongly advice you to plugin a profiling tool such as Yourkit to both the repository and Solr servers to help with the troubleshooting.

Solr Search Best practices

This section is about tuning the search performance while using Solr, in general it will be sufficient to follow the golden rules below, if applying those does not solve your problem you might need to scale your architecture.

General Search Golden Rules

  • Use local folders for the indexes (don’t use shared folders, NFS)
  • Use Fast hardware (RAID, SSD,..)
  • Tune the mergeFactor, a mergeFactor of 2 is ideal for search.
  • Decrease the Solr caches, specially when running an Alfresco version prior to 4.1.4.
  • Increase your query caches and the RAMBuffer.
  • Avoid path search queries, those are know to be slow.
  • Avoid using sort, you can sort your results on the client side using js or any client side framework of your choice.
  • Avoid * search, avoid ALL search
  • Tune your Garbage collector policies and JVM memory settings.
  • Consider lowering your memory on the JVM if the total heap that you are assigning is not being used. Big JVM heap sizes lead to bigger GC pauses.
  • Get the fastest CPU you can, search is CPU intensive rather then RAM intensive.
  • Separate search and indexing tiers. If you can have 2 separate solr server farms, you can dedicate one to the indexing and the other to search. This will increase your global performance ( Only available since alfresco 4.2.X )
  • Optimize your ACL policy, re-use your permissions, use inherit and use groups. Don’t setup specific permissions for users or groups at a folder level. Try to re-use your Acls.
  • Upgrade your Alfresco release with the latest service packs and hotfixes. Those contain the latest Solr improvements and bug fixes that can have great impact on the overall search performance.
  • Make sure you are using only one transformation subsystem. Check the alfresco-global.properties and see if you are using either OooDirect or JodConverter, never enable both sub-systems.

Typical issues with Searching

It can happen that you are searching and indexing on the same time, this causes concurrent accesses to the indexes and that is known to cause performance issues. There are some workarounds for this situation. To start you should Plugin a profiler and search for commit Issues (I/O locks), this will allow you to check if you are facing this problem.

Solr Search Troubleshooting techniques

To troubleshoot your Solr search problems you should start by choosing a load testing tool such as SolrMeter or Jmeter and design your testing scenario with one of those tools. You can also choose to use the default alfresco benchmark scenario. The second step is to attach a profiler like Yourkit or other java profiler of your choice and records search performances snapshots for analysis.

Apply the tunings suggested on this document (specially on the golden rule section) and retest until you reach the necessary performance.

Solr usage in Share

If your project relies on the share client offered by Alfresco, you should know that tuning your Solr indexing and search performance will affect positively the overall share performance.

Share relies on Solr in the following situations:

  • Full Text Search (search field in top right corner)
  • Advanced Search
  • Filters
  • Tags
  • Categories (implemented as facets)
  • Dashlets such as the Recently Modified Documents
  • Wildcard searches for People, Groups, Sites (uses database search if not wildcard)

Overall Best Practices Technical Details

This section contain important technical details that will allow you to implement the various best practices mentioned previously on this post.

1 – Turn on Logging During Search

If you want to have a look at the queries that Alfresco is running against Solr when you click around in Alfresco Share. You can enable debug logging as follows in log4j.properties.

log4j.logger.org.alfresco.repo.search.impl.solr.SolrQueryHTTPClient=debug

A log for a full text search on “Alfresco” looks like this:

2014-01-17 08:21:15,696  DEBUG [impl.solr.SolrQueryHTTPClient] [http-8080-26] Sent :/solr/alfresco/afts?q=%28%28PATH%3A%22%2Fapp%3Acompany_home%2Fst%3Asites%2Fcm%3Atest2%2F*%2F%2F*%22+AND+%28Alfresco++AND+%28%2BTYPE%3A%22cm%3Acontent%22+%2BTYPE%3A%22cm%3Afolder%22%29%29+%29+AND+-TYPE%3A%22cm%3Athumbnail%22+AND+-TYPE%3A%22cm%3AfailedThumbnail%22+AND+-TYPE%3A%22cm%3Arating%22%29+AND+NOT+ASPECT%3A%22sys%3Ahidden%22&wt=json&fl=*%2Cscore&rows=502&df=keywords&start=0&locale=en_GB&fq=%7B%21afts%7DAUTHORITY_FILTER_FROM_JSON&fq=%7B%21afts%7DTENANT_FILTER_FROM_JSON

How to disable SSL communication between Solr and Alfresco

By default, the communication between Solr and Alfresco is encrypted, if you don’t need this encryption it’s a good idea to disable this in order to reduce complexity that can contribute to increased performance.

On the Alfresco server, edit the alfresco-global.properties and set:

  • solr.secureComms=none
  • On the alfresco webapp deployment descriptor web.xml, comment out the security constraint.

<!-­‐-­

<security-­‐constraint>
<web-­‐resource-­‐collection>
<url-­‐pattern>/service/api/solr/*</url-­‐pattern>
</web-­‐resource-­‐collection>
<auth-­‐constraint>
<role-­‐name>repoclient</role-­‐name>
</auth-­‐constraint>
<user-­‐data-­‐constraint>
<transport-­‐guarantee>CONFIDENTIAL</transport-­‐guarantee&gt
</user-­‐data-­‐constraint>
</security-­‐constraint>
<login-config>
<auth-method>CLIENT-CERT</auth-method>
<realm-name>Repository</realm-name>
</login-config>
<security-role>
<role-name>repoclient</role-name>
</security-role>

-->

  • For every Solr core that you have configured set alfresco.secureComms=none on the solcore.properties file.
  • On the alfresco Solr deployment descriptor web.xml or solr.xml under  Catalina/conf/localhost/solr.xml, comment out the security constraint as previously shown.

Detailed information can be found in the Alfresco customer Portal

How to set a transformation Timeout on Alfresco

To set a timeout limit (that it’s not set by default) can help you with your tuning and troubleshooting activities.

Timeout (ms) Use this limit to set timeout on reading data from the source file to be transformed. This limit works with transformers that don’t bulk read their source data, as it is enforced by a modified InputStream that either throws an exception or returns an End of file (EOF) early. The property associated with this transformation limit is timeoutMs.

You can set this property on your alfresco-global.properties as the following example:

content.transformer.default.timeoutMs=180000

How to set transformation limits on Alfresco

Setting appropriate transformation limits can help you to fine-tune your transformations and to improve indexing performance.

In Alfresco 4.2d much of the configuration of transformers is done using Alfresco global properties. In the case of the Enterprise edition these may be changed dynamically via JMX without stopping the server. Prior to this it was possible to control Content Transformer limits to a more limited extent using Spring XML and a few linked Alfresco global properties

You can find detailed information on transformation limits at http://wiki.alfresco.com/wiki/Content_Transformation_Limits#Introduction

  • How to Rebuild Solr Indexes

One useful action that is sometimes required is to rebuild the indexes from scratch. In order to rebuild the solr indexes, proceed as follows :

Do as follows:

1.      Stop Tomcat that runs Solr web application

2.      Remove index data of archive core at alf_data/solr/archive/SpacesStore

3.      Remove index data of workspace core at alf_data/solr/workspace/SpacesStore

4.      Remove cached content model info of archive core at alf_data/solr/archive-SpacesStore/alfrescoModels/*

5.      Remove cached content model info of workspace core at alf_data/solr/workspace-SpacesStore/alfrescoModels/*

6.      Restart Tomcat that runs Solr web application

7.      Wait a bit for Solr to start catching up…

Note : index.recovery.mode=FULL is not used by Solr – only by Lucene

About Sizing on the Solr servers

Sizing your Solr servers depends a lot on the specific search requirements of your project. The important factors you need to consider are :

  • Search Ratio, to get the search ratio you should divide the typical usage of the system in Read/Write/Search. Start with 100% and give a % to each of the operations.
  • Number of Documents in the repository
  • Group Hierarchy
  • Number of Acls
  • Amount of Cpu Cores you have available ( the more the better :))
Solr can have high memory requirements. You can use a formula to calculate the memory needed for the Alfresco internal data structures used in Solr for PATH queries and read permission enforcement. By default, there are two cores in Solr: WorkspaceSpacesStore and ArchiveSpacesStore. Normally, each core has one searcher but can have a maximum of two searchers.

Alfresco provides a formula that helps you to calculate the ammount of memory needed on your Solr servers, check the following url for guidance. http://docs.alfresco.com/community/concepts/solrnodes-memory.html

Below you find an excell file that will help you with the calculations ( you need to rename the extension from .txt to xlsx )

Calculate_Memory_Solr Beta 0.2_xlsx

I hope you enjoyed this post, i’ve surely enjoyed writing it and i hope it can help you with your projects. More interesting posts from my field experience are coming to the blog, so stay tuned.

“The greatest ideas are opensource, together we are Stronger”

 

 

 

The Science of Capacity Planning

$
0
0

Sizing and Architecture – The Science of Capacity Planning

Hi folks, decided to write an article about Capacity Planning because it has always been one of my passions. I’ve 14 years of consulting experience across major accounts in EMEA and I’ve been involved on hundreds of ECM related IT projects. I’ve found adequate capacity planning mechanisms on just a few of those, and by “magic” those were/are the most successful and long lasting projects.

What is Capacity Planning on a ECM context

Capacity Planning is the science and art of estimating the space, computer hardware, software and connection infrastructure resources that will be needed over some future period of time. It’s a mean to predict the types, quantities, and timing of critical resource capacities that are needed within an infrastructure to meet accurately forecasted workloads

Capacity planning is critical to ensure success of any ECM implementation. The predicting and sizing of a system is impossible without a good understanding of user behavior as well as an understanding of the complete deployment architecture including network topology and database configuration.

A high level description on my concept of capacity planning is shown below. Basically a good capacity planning mechanism/application implements each one of the phases outlined below, considering a customized Peak Period Methodology explained below.

The capacity planning approach that I’m refereeing to is done after general deployment, so prior to this capacity planning you’ll need a good sizing exercise to define the initial architectural requirements for your ECM platform.

In this article we are assuming that we have a fully deployed production environment where we will focus our capacity planning efforts.

Peak Period Methodology

I consider that the Peak period Methodology is the most efficient way to implement a capacity planning strategy as it gathers vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.

The peak period may be an hour, a day, 15 minutes or any other period that is used to collect utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.

On my personal approach to ECM capacity planning implementation, i focus my efforts on 6 key layers, obtaining specific metrics during a defined peak period:

  • Web Servers machines ( Apaches/WebServer for static content )

HTTP hits/Sec – Useful for measuring the load on the Web servers.

  • Application Servers machines Holding the client application

Page Views / Second – Understand the throughput of client applications

  • Application Servers machines holding the Ecm Server ( Alfresco/FileNet/Documentum/SharePoint )

Transactions / Second – Understand the throughput of the ECM server

  • LDAP Servers machines ( LDAP )

Activities ( reads ) / Second – Understand the throughput of LDAPS

  • Database Servers machines ( Oracle )

Database Transactions/Sec – Measuring the load on the database servers.

  • Network

KB/Sec – A measure of the raw volume of data received and sent by the application. Useful for measuring the load on the network and machine network card.

On top of that i also collect a very important aspect on the main application client , the response time ( time taken for a client application do respond to a user request ). The values i take in consideration for capacity are :

A.R.T – Average response time

M.R.T – Maximum response time

How to implement capacity Planning ?

I normally use a Collector agent that collects the necessary data from the various sources during the defined peak period. The collector runs daily and stores its data on Elastic Search for peak period analysis. The more data gets inside elastic search along the application live cycle, the more accurate are the capacity predictions because they do represent the “real” application usage during the defined peak period.

The Collector Agent uses zookeeper to store important information and definitions regarding repositories, machines, peak period definition, urls and other environment related constants.

To minimize impact on overall system performance the collector executes every day at a chosen period (outside business hours). That is configured at the OS level (using the crontab functionality or similar)

Integration Capacity Planning with monitoring systems

This approach is designed to integrate with most of the existing system monitoring software units such as HPOpenView, JavaMelody. I’m currently implementing this approach to perform capacity planning on Alfresco installations and I’m integrated it with our existing open source stack for motorization framework (great job of Miguel Rodiguez from Alfresco support ).

Capacity Planning Troubleshooting

Gathering this relevant data represents an important role on troubleshooting, on the capacity planning implementations I’ve seen, to Analise capacity data represents a crucial role while troubleshooting an application.

Data Analysis to predict architecture changes

By performing regular analysis to our capacity planning data, we know exactly when and how we need to scale our architecture, this represents a very important role when modeling and sizing our architecture for the future business requirements.

What’s next

In next September and October i will be speaking in the Alfresco Summit in San Francisco and London on how to appropriate size an Alfresco Installation. This presentation will also include relevant information about capacity planning and the implementation of this approach on a real life scenario. Consider yourself invited to join the Alfresco Summit and to assist my presentation, http://summit.alfresco.com/2014-speakers/luis-cabaceira

Until them, all the best. One Love.

Luis

Sizing and tuning your Alfresco Database

$
0
0

In a human body the heart and the brain are the 2 most important organs, if those are not performing well, nothing else is. The Alfresco database and filesystem where the content store resides are the brain and heart of Alfresco. Those are the 2 most important layers of every Alfresco architecture.

Get to know the Alfresco Database throughput

If your project will have lots of concurrent users and operations or the number/estimate number of documents is very big (> 1M)) you need to be informed about your database throughput.

database_performance_ui

Most common throughput factor of the database is transactions per second.

DB performance in a transactional system are usually the underlying database files and the log file. Both are factors because they require disk I/O, which is slow relative to other system resources such as CPU.

In the worst-case scenario,  in big alfresco databases with a big number of documents ):

Database access is truly random and the database is too large for any significant percentage of it to fit into the cache, resulting in a single I/O per requested key/data pair.

Both the database and the log are on a single disk. This means that for each transaction, the AlfrescoDB is potentially performing several filesystem operations:

  • Disk seek to database file
  • Database file read
  • Disk seek to log file
  • Log file write
  • Flush log file information to disk
  • Disk seek to update log file metadata (for example, inode information)
  • Log metadata write
  • Flush log file metadata to disk

Faster Disks normally can help on such sittuations but there are lots of ways (scale up, scale out) to increase transactional throughput.

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

  • I/O bottlenecks in the RDBMS throughput
  • Excessive queue for transactions due to overload of connections
  • On active-active cluster configurations, excessive latency

Alfresco Database treads pool.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

http://docs.oracle.com/cd/E17076_04/html/programmer_reference/transapp_throughput.html

 

How to calculate the size of your Alfresco database

All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.

To size your alfresco database in terms of space we’ve done a series of tests by creating content (and metadata) on an empty repository and analyzing the database growth.

Be aware that:

  • Content is not stored in the database but is directly stored on the disk
  • Database size is un-affected by size of the documents or the document’s content
  • Database size is affected by the number/type of metadata fields of the document

Hi all, back with another Alfresco related post, this time to show you how to size your alfresco database in terms of space.

The following factors are relevant to calculate the approximate size for an Alfresco database

  • Number of meta data fields
  • Permissions
  • Number of folders
  • Number of documents
  • Number of versions
  • Number of users

I’ve made a series of tests where i could verify how the Alfresco database grows.

I’ve made a bulk import with the following data.

Document creation method In-place bulk upload
Number of Documents Ingested 148
Total Size of Documents 929.14 MB
Number of metadata fields per document 13 fields
Total number of metadata fields 1924

The table below shows the types of documents and its average sizes

Document Type Extension Average Size (KB)
MS Word Document .doc 1024
Excell Sheet .xls 800
Pdf document .pdf 10240
PowerPoint presentation .ppt 5120
Jpg image .jpg 2048

Checking the diagram below we can see that the database indexes grow more than the data itself. By observing the growth of the database size we’ve concluded that the average metadata field occupation on the Alfresco database is approximately 5.5 K per metadata field

Screen Shot 2014-09-16 at 21.54.17

 

Also interesting is to verify the tables that grow in size(KB) after the content ingestion. Note that we are not applying any permission.

Screen Shot 2014-09-16 at 21.55.08

 

To size your database appropriately you must ask the right questions whose answers will help you to determine the database sizing.

  1. Estimated number of users in Alfresco
  2. Estimated number of groups in Alfresco
  3. Estimated number of documents on the first year
  4. Documents growth rate
  5. Average number of versions per document
  6. Average number of meta-data fields per document
  7. Estimated number of folders
  8. Average number of meta-data fields per folder
  9. Estimated number of concurrent users
  10. Folder based permissions (inherited to child documents)?

Database sizing formulas

Consider to following figures to determine your approximate database size.

-       DV = Average number of document versions

-       F = Estimated number of folders

-       FA = Estimated number of folder metadata fields (standard + custom)

-       D = Number of Documents * DV – Estimated number of documents including the versions

-       DA =Estimated number of documents metadata fields (standard + custom)

The number of records on specific alfresco tables is calculated as follows:

-       Number of records on alf_node (TN = F + D * DV)

-       Number of records records on node_properties (TNP = F * FA + D * DA)

-       Number of records records on node_status = (TNS = F + D)

-       Number of records records on alf_acl_member= (TP = D) assuming permission will be set at the folder level and inherited

The approx. number of records in the database will be TRDB = TN + TNP + TNS + TP

The following formula is based on the number of database records. On our benchmarks we’ve observed that each database record takes about 4.5k of db space.

Formula #1 Database size = TRDB * 4.5K

Alternatively, we can base our calculations on the number of metadata fields of the documents,  considering 5.5k for each metadata field and use the following formula.

Formula #2 Database size = (D * DA + F * FA) * 5.5K

The 2 formulas provided are only approximations on the size that your database will need and are based on benchmarks executed against a vanilla Alfresco version 4.2.2.

If we wish to consider users and groups add consider 2k for each user and 5k for each group.

Note that the formulas are not taking in consideration additional space for logging, rollback, redolog, etc.

Tuning your Alfresco database

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

•       Wrong or improper support for ACID transaction properties

•       I/O bottlenecks in the RDBMS throughput

•       Excessive queue for transactions due to overload of connections

•       On active-active cluster configurations, excessive latency

Considering that your database layer will be used under concurrent load I’ve come up with a set of hints that will contribute to maximize your Alfresco database performance.

Database Thread pool configuration

A default Alfresco instance is configured to use up to a maximum of forty (40) database connections.  Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

It’s recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.  If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

After increasing the size of the Alfresco database connection pool, you must also increase the number of concurrent connections your database can handle, to at least the size of the Alfresco connection pool. Alfresco recommends configuring at least 10 more connections to the database than is configured into the Alfresco connection pool, to ensure that you can still connect to the database even if Alfresco saturates its connection pool.

Database Validation query

By default Alfresco does not periodically validate each database connection retrieved from the database connection pool.  Validating connections is, however, very important for long running Alfresco servers, since there are various ways database connections can unexpectedly be closed (for example by transient network glitches and database server timeouts). Enabling periodic validation of database connections involves adding the db.pool.validate.query property to alfresco-global.properties and the query is specific for your database type.

Database Value for db.pool.validate.query
MySQL[1] SELECT 1
PostgreSQL SELECT VERSION()
Oracle SELECT 1 FROM DUAL

Database Scaling

Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. Considering our existing customers, the biggest running repositories are under Oracle (most of them RAC).

If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines. This can be achieved using Oracle RAC or a Mysql based solution with haproxy[1] (opensource solution) or a commercial solution like MariaDB[2] or Percona[3].

[1]http://haproxy.1wt.eu

[2]https://mariadb.org

[3]http://www.percona.com

You can use either solution, depending also you the knowledge that you have in-house. The golden rule is that the response time from the DB in general, should be around 4ms or lower. At this layer i don’t recommend to use any virtualization technology. The database servers should be physical servers.

The high availability and scalability for database is vendor dependent and should be addressed with the chosen vendor to achieve the maximum performance possible.

Database Monitoring

Monitoring your database performance is very important as it can detect some possible performance problems or scaling needs.

I’ve identified the following targets that should be monitored and analysed on a regular base.

  • Transactions
  • Number of Connections
  • Slow Queries
  • Query Plans
  • Critical DM database queries ( # documents of each mime type, … )
  • Database server health (cpu, memory, IO, Network)
  • Database sizing statistics (growth, etc)
  • Peak Period of resource usage
  • Indexes Size and Health

I hope this post can help you to understand the importance of the Alfresco database and that you can make use of it on your sizing exercise. Stay tuned for more Alfresco related posts.

All the best, One love,

Luis

Monitoring your Alfresco solution

$
0
0

Hi folks, this post follows my previous post about capacity planning and it provides you with the tools (and a vmware image ready to run) for your to implement it.

Would like to start with a huge thank you message to Miguel Rodriguez (Alfresco Support Engineer ). He is the creator of this monitoring solution and also the person responsible for setting up the  vmware image with all the tools, scripts, etc. My hero list just got bigger, Miguel got a place just after Spider Man and the Silver Surfer :)

Monitoring Alfresco with OpenSource tools

Monitoring your Alfresco architecture is a know best practice. It allows you to track and store all relevant system metrics and events that can help on:

  • Troubleshooting possible problems
  • Verify system Heath
  • Check user behavior
  • Build a robust historical data-warehouse to later analysis and capacity planning

This posts explains a typical monitoring scenario over an Alfresco deployment, using only opensource tools.

I’m proposing a fully opensource stack of monitoring tools that build the global monitoring solution. The  solution will make use of the following opensource products.

The solution will be monitoring all layers of the application, producing valuable data on all critical aspects of the infrastructure. This will allow a pro-active system administration opposed to a reactive way of facing possible problems by predicting the problems before they happen and take the necessary measures to maintain a healthy system on all layers.

I see this approach as as both a monitoring and capacity planning system allowing to provide “near” real time information updates,  customize reporting and to provide custom search mechanism over the collected data.

The diagram below shows how the different components of the solution integrate. Note that we centralize data from all nodes and the various layers of the application in a single location.

monitoring_diagram

The sample architecture being monitored consists on a cluster of two Alfresco/Sharenodes for serving user requests and two Alfresco/Solr nodes for indexing/searching content.

Consider 3 major components of the monitoring solution

  • Logstash file tailing to monitor Alfresco log files and logstash command execution to monitor specific components i.e. processes, memory, disk,java stack traces, etc.
  • JavaMelody to monitor applications running in a JVM and other system resources.
  • Icinga to send jmx requests to Alfresco servers.

Dedicated Monitoring Server Download

All software components of the monitoring server are installed Vmware image that we offer for free (within the opensource spirit :)).

You can download your copy of this monitoring server in  http://eu.dl.alfresco.com.s3.amazonaws.com/release/Support/AlfrescoMonitoringVirtualServer/v1.0/AlfrescoMonitoringVirtualServer-1.0.tar 

The ElasticSearch server that will collect all the logs from the various components of the application and will host the graphical user interfaces (Kibana and Grafana) to view the monitoring data.

About JavaMelody

JavaMelody is used to monitor Java or Java EE application servers in QA and production environments. It is a tool to measure and calculate statistics on real operation of an application depending on the usage of the application by users. Very easy to integrate in most applications and is lightweight with mostly no impact to target systems.

This tool is mainly based on statistics of requests and on evolution charts, for that reason it’s one important add on to our benchmarking project, as it allow us to see in real time the evolution charts of the most important aspects of our application.

It includes summary charts showing the evolution over time of the following indicators:

  • Number of executions, mean execution times and percentage of errors of http requests, sql requests, jsp pages or methods of business façades (if EJB3, Spring or Guice)
  • Java memory
  • Java CPU
  • Number of user sessions
  • Number of jdbc connections

These charts can be viewed on the current day, week, month, year or custom period.

You can have detailed information about javamelody at https://code.google.com/p/javamelody/

Installing JavaMelody

Its really easy to attach javamelody monitor to all alfresco applications (alfresco.war or share.war) and every other web-application that is deployed on your application server.

Step 1

Configure the JavaMelody monitorization on Alfresco tomcat by copying the itextpdf-5.5.2.jar, javamelody.jar and jrobin-1.5.9.1 to the tomcat shared libfolder under <tomcat_install_dir>\shared\lib or your application server (if not tomcat) global classloader location.

Step 2

Edit the global tomcat web.xml (D:\alfresco\tomcat\conf\web.xml) file to enable javamelody monitorization on every application. Add the following filter :

<filter>
<filter-name>monitoring</filter-name>
<filter-class>net.bull.javamelody.MonitoringFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>monitoring</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<listener>
<listener-class>net.bull.javamelody.SessionListener</listener-class>
</listener>

And that’s about it, after restarting you can access the monitorization of every application in http://<your_host>:<server_port>/<web-app-context>/monitoring, for example http://localhost:8080/alfresco/monitoring

monitor_jm

Monitoring Stages Breakdown

Stage 1 – Data Capturing(Logstash)

monitor_datacap

We capture monitoring data using different procedures.

  • Scheduled Jobs (Db queries, Alfresco jmx Beans queries, OS level commands)
  • Logs indexing with logstash. We use logstash it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)
  • The Alfresco audit log (when configured) is also parsed and indexed by elastic search, proving all the enabled audit statistics.
  • Metrics with JavaMelody

Stage 2 – Monitoring Data Archiving(ElasticSearch)

monitor_archiving

On the diagram above we can see the flow of data capturing using logstash and Elastic Search. Let’s see some details on each of the boxes on the diagram.

Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)

Redis is a logs data broker, receiving data from log “shippers” and sending the data to a log “indexer”

ElasticSearch is a distributable, RESTful, free Lucene powered search engine/server

Kibana3 is a tool for displaying and interacting with your data.

Stage 3 – Trending and Analysis (Kibana,Grafana)

monitor_trending

To analyze the data and the trends we use install 2 different GUIs on the monitorization server (Kibana and Grafana).

Kibana allows us to check the indexed logs with metadata, and to troubleshoot on specific log traces. It provides a very robust search mechanism on top of the elasticsearch indexes. It provides strategic technical insights with an global overview on all layers of the platform delivering actionable insights in real-time from almost any type of structured and unstructured data source.

On the flow above we can see how the information and statistics get to Grafana.

Grafana is a beautiful dashboard for displaying various Graphite metrics through a web browser. It has enormous potential, it’s easy to setup and to customize for different business needs.

Let’s have a closer look on the remaining components on the flow diagram.

Statsd is a network daemon that listens for statistics, like counters and timers sent over UDP and sends them to Carbon.

Carbon accepts metrics over various protocols and caches them in RAM as they are received, flushing them to disk on an interval using the underlying whisper library.

Whisper provides fast, reliable storage of numeric data over time

Grafana is an easy to use and feature rich Graphite dashboard

Stage 4 – Monitoring

monitor_ui

We use scheduled commands and index data on elastisearch checking the following monitoring information from the Alfresco and Solr servers.

  • JVM Memory Usage
  • Server Memory
  • Alfresco Cpu utilization
  • Overall Server Cpu utilization
  • Solr Indexing Information
  • Number of documents on Alfresco “live” store
  • Number of documents on Alfresco “archive” store
  • Number of concurrent users on Alfresco repository
  • Alfresco Database pool occupation
  • Number of active sessions on Alfresco Share
  • Number of active sessions on Alfresco Workdesk
  • Number of busy tomcat threads
  • Number of current tomcat threads
  • Number of maximum tomcat threads

Those can be extended at any time, performing monitorization on any target relevant on your use case.

Stage 5 – Troubleshooting

trouble

While troubleshooting we use Kibana/Grafana and JavaMelody.

Kibana allow’s us to check the “indexed” logs with meta-data and verify exactly what classes are related with the problem as well as the number of occurrences and root of the exceptions.

Grafana show us what/how/when the server resources are being affected by the problem.

JavaMelody provides detailed information on crucial sections of the application. The goal of JavaMelody is to monitor Java or Java EE application servers in QA and production environments.

It produces graphs for Memory, CPU, HTTP Sessions, Threads, GC, JDBC Connections, SQL Hits, Open Files, Disk Space, Network I/O, Statistics for HTTP traffic, Statistics for SQL queries, Thread dumps, JMX Beans information and overall System Information. Java Melody has a Web interface to report on data statistics.

Using these 3 tools, troubleshooting a possible problem becomes an friendly task and it boosts the speed of the investigations, that normally would take ages to gather all the necessary information to get to the root cause of the issue.

Stage 6 – Notification and Reporting

notifica

We use Icinga in order to notify the delegated alfresco administrator (email) when there is some problem with the Alfresco system. Icinga is an enterprise grade open source monitoring system that keeps watch over networks and resources, notifies the user of errors and recoveries and generates performance data for reporting.

Icinga Web is highly dynamic and laid out as a dashboard with tabs which allow the user to flip between different views that they need at any one time

icinga

Stage 7 – Sizing Adjustments

Sizing will be a human action on the capacity and monitoring solution. Performing a regular analysis to the monitorization/capacity planning data, we will know exactly when and how we need to scale our architecture.

The more data gets inside elastic search along the application life cycle, the more accurate are the capacity predictions because they represent the “real” application usage during the defined period.

This represents a very important role when modeling and sizing the architecture for the future business requirements.

7.1 – Peak Period Methodology

The Peak period Methodology is the most efficient way to implement a capacity planning strategy as it allows to analyze vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.

The peak period may be an hour, a day, 15 minutes or any other period that is used to analyze the collected utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.

Your monitoring Targets on a Alfresco installation

I’ve identified the following targets to be candidates to participate on the Monitoring system and have their data indexed and stored on elastic search.

Database

  • Transactions
  • Number of Connections
  • Slow Queries
  • Query Plans
  • Critical DM database queries ( # documents of each mime type, … )
  • Database server earth ( cpu, memory, IO, Network)
  • Database statistics integration
  • Database sizing statistics ( growth, etc)
  • Peak Period

Application Servers (Tomcats)

  • Request and response times
  • Access Logs ( number of concurrent requests, number concurrent users , etc)
  • Cpu
  • Io
  • Memory
  • Disk Space Usage
  • Peak period
  • Longest Request
  • Threads ( Concurrent Threads, Busy Threads )

Application JVM

  • Jvm Settings Analysis
  • GC Analysis
  • Log Analysis (Errors, Exceptions, Warnings, Class Segmentations(Authorization, Permissions, Authentication)
  • Auditing Enabling and Analysis (Logins, Reads, Writes, Changed Permissions, Workflows Running, Workflows States)
  • Caches Monitoring (Caches usage, invalidation, cache sizes )
  • Protocol Analysis (FTP, CMIS; Sharepoint, WEBDAV, IMAP, CIFS )
  • Architecture analysis

Search Subsystem(Solr)

  • Jmx Beans Monitorization
  • Caches ( Configuration, Utilization, Tuning, Inserts, Evictions and Hits )
  • Indexes Health
  • Jvm Settings Analysis
  • Jvm Health Analysis
  • Garbage collection Analysis
  • Query Debug (Response times, query analysis, slow queries, Peak periods)
  • Search and Index Memory Usage

Network

  • Input/Output
  • High availability
  • Tcp Errors / Network errors at Network protocol level
  • Security Analysis ( Ports open, Firewalls, network topology , proxies, encryption )

Shared File Systems

  • Networking to clients hosts
  • Storage Type ( SAN, NAS )
  • I/O

Clustering

  • Cluster members subscription analysis (subscription analysis)
  • Cluster cache invalidation strategy and shared caches performance
  • Cluster load balancing algorithm performance (cluster nodes load distribution)

The Alfresco Audit Trail

The monitoring solution also uses and indexes the Alfresco audit trail log, when audit is enabled. Alfresco audit should be used with caution as auditing too many events may have a negative impact on performance.

Alfresco has the option of enabling and configuring an audit trail log. It stores specific user actions (configurable) on a dedicated log file (audit trail).

Building on the auditing architecture the data producer org.alfresco.repo.audit.access.AccessAuditor gathers together lower events into user recognizable events. For example the download or preview of content are recorded as a single read. Similarly the upload of a new version of a document is recorded as a single create version. By contrast the AuditMethodInterceptor data producer typically would record multiple events.

A default audit configuration file located at <alfresco.war>/WEB-INF/classes/alfresco/audit/alfresco-audit-access.xml is provided that persists audit data for general use. This may be enhanced to extract additional data of interest to specific installations. For ease of use, login success, login failure and logout events are also persisted by the default configuration.

Default audit filter settings are also provided for the AccessAuditor data producer, so that internal events are not reported. These settings may be customized (by setting global properties) to include or exclude auditing of specific areas of the repository, users or some other value included in the audit data created by AccessAuditor.

No additional functionality is provided for the retrieve of persisted audit data, as all data is stored in the standard way, so is accessible via the AuditService search, audit web scripts, database queries and Alfresco Explorer show_audit.ftl preview.

Detailed information on Audit possibilities available at:

And that’s about it folks, i hope you liked this article and that it can help you on monitoring your projects. More articles with relevant information from the field are coming up, so stay tuned.

All the best, One Love,

Luis

 

A formula for your alfresco troughtput

$
0
0

Hi all, back with another interesting topic, this time with a methodology to help you determine the throughput of your Alfresco repository server. I call it the CAR :)

formulas_s

C.A.R = Capacity of an alfresco repository server

To know the capacity an  Alfresco repository server we can introduce the concept of “transactions per second” where  a “transaction” is considered to be a basic repository operation (create, browse, download, update and delete) .

Screen Shot 2014-09-23 at 14.19.12

The “C.A.R.” methodology

Aim is to define common standard figure that can empirically define the capacity of a repository instance.

The name (C.A.R.) stands for “Capacity of an Alfresco Repository”

The C.A.R. methodology is based on the following sentence and its represented in Transactions per second.

The capacity of an Alfresco repository server is determined by the number transactions it can handle in a single second before degrading the expected performance.

Screen Shot 2014-09-23 at 14.53.00

To create a formula that reflects that sentence we need to introduce the 2 important figures :

EC = The expected concurrency represented in number of users.

TT = user think time represented in seconds, means that in average for a period of time (“The think time”) the system will receive requests from N different users.

ERT = Expected response times object.

Decreasing ERT generally means the necessity to increase the capacity of the alfresco repository server.

This a complex object represented is key value pairs with the types of response times being considered, the weight of each type and the correspondent value in seconds. It takes expected user behavior as arguments.

Sample ERT

Value
Weight
Download
2 seconds 20
Write/Upload 2 seconds 10
Search 5 seconds 10
Browse/Read 5 seconds 60

When we decrease our ERT arguments values we normally will need to scale (up/out) our Alfresco and database Servers.

We can say that C.A.R.of an Alfresco repository server is :

Number of transactions that that the server can handle in one second under the expected concurrency(EC) with the agreed Think Time (TT) ensuring the expected response times(ERT) .

We can still fine-grain our definitions considering that “Create/Update” operations are most resource consuming than “Read and Delete” on both repository and on the database server.

Shape shifting – a flexible formula approach

The C.A.R. formula is not deterministic as it cannot reflect the variables on different use cases. Its dynamic and specific to each use case and it is built on a system of attributes, values, weights and affected areas.

To have a definition of the formula that really represents the capacity of the alfresco servers on your infra-structure you need to consider one or more ERT(expected response times) objects, representing the use case expected response times on use case specific operations. Those objects act as increasers (+) and removers(-) on the server throughput.

The formula can shift and may be adapted with more ERT Objects that define the use case for fine tuned predictions.

Obtaining the current C.A.R. on Alfresco

The easiest way is to enable the audit trail and parse that tail computing for the transactions that are occurring on the repository. With the new reporting and analysis features coming up on Alfresco One 5.0 it will be even simpler to get access to this information.

Some initial lab tests

We’ve executed some simple lab tests configured with one server running Alfresco and another running the database. and observed that a single server.

Alfresco Server Details

  • Processor: 64-bit Intel Xeon 3.3Ghz (Quad-Core)
  • Memory: 8GB RAM
  • JVM: 64-bit Sun Java 7 (JDK 1.7)
  • Operating System: 64-bit Red Hat Linux

Test Details

  • ERT = The sample ERT values shown on this post
  • Think Time = 30 seconds.
  • EC = 150 users

The C.A.R. of the server was between 10-15 TPS during usage peaks. Through JVM tuning along with network and database optimizations, this number can rise over 30 TPS.

I think this is a very reliable form of definition on the capacity on an Alfresco repository that can be used as a support for a sizing exercise. What do you think ? Opinions are welcome and highly appreciated, use the comments area on the blog to add yours !

“One Love”,

Luis

Super Sizing your Alfresco Repository

$
0
0

Hi everyone, i’m back to share with you a very interesting tool that can help you on your tests and benchmarks.

“have you ever wanted to test/benchmark your alfresco project implementation with millions  of documents before you deliver it for its go-live stage ?”

I’m sure you did, it’s normally not that easy to create a big number of dummy documents and correspondent meta-data fields that can emulate what will be present in production. To solve this paradigm we’ve created a tool (opensource as allways :) ) that can help you to do just that. Many thanks to Alex Strachan from Alfresco support that wrote the user interface for this tool.

The tool is named “SuperSizeMyRepo” and its available at https://github.com/lcabaceira/supersizemyrepo. It’s a multi-thread tool that enables you to create a huge amount (Millions) of (bulk-import-ready) content and metadata for your alfresco repository.

Types of documents created

  • MS Word Documents (.doc) with an average size of 1024k
  • MS Excel Documents(.xls) with  average size of 800k
  • Pdf documents(.pdf) with average size of 10MB
  • MS PowerPoint Presentation Documents(.ppt) with average size of 5MB
  • Jpeg images(.jpg) with average size of 2MB

All documents are created with their correspondent meta-data xml properties file.

ui

Configuring the Documents meta-data

As you can see from the UI screenshot above, you can configure manually the values of the meta-data fields that will created for the documents, but even more interesting is the ability to inject aspects directly into the document creation.

Injecting Aspects

You can also edit the field names, meaning that if you specify custom aspects, you can configure the remaining fields to have the properties names of the attributes present on your custom aspects.

How about Indexing ?

We’ve also thought about testing the search sub-system (Solr or Lucene) with big amounts of data. For this reason the documents are created with lots of random words that will get indexed into Solr or Lucene. This way you can test both the repository and the search layer.

What are the images for ?

To have the documents created with the sizes announced we also needed to include some random images. We provide you with a set of images that you can download and use as your local library for the documents creation. This set of random images is available here. You can also use your own set of images as long as they are all JPGS and they are present on the images folder root.

What is the deployment folder ?

The deployment folder is where your documents will be created, normally this is a place inside your contentStore, this way you can perform a in-place bulk import, one of the fastest ways to inject lots of content on your Alfresco repository. You can specify any folder for the documents creation.

Maximum number of files per folder

When you import the documents the folder structure (if any) will also be imported. According to Alfresco best practices, having a huge number of documents on the same folder can lead to performance degradation, mainly because of acl permission checking that happens when a user browsers a folder. Alfresco needs to determine what documents can be shown to the user and for that he needs to verify the permissions of each item on that directory. To reduce this overload, we’ve introduced the option to specify a maximum number of documents that the tool can create on a single folder, when this number is reached the tool will create new folders and the new documents will be created on those folders.

JumpStart with the compiled version

If you wish to run the compiled version (available in the uiJars folder) there are no pre-requirements apart from having java installed on your server to be able to execute a jar file.

Download the jar file for your OS, currently the UI is released for 3 different OS.

Note for MacOs users :  To execute the jar you should open a terminal and run it with the -XstartOnFirstThread option like the example below :

java -XstartOnFirstThread -jar ./ssmr-ui-1.0.3-osx-jar-with-dependencies.jar

Want to take the Deep dive approach ?

Great, i would like to take this opportunity to invite you to participate on this project and to contribute with new features and your own ideas. This section provides guidance for you to download the source code and build it yourself.

1 – Software requirements

  • JDK 1.7
  • Apache Maven 3.0.4+

2 – Configuration requirements

During the installation of maven, a new file name settings.xml was created. This file is our entry point to the your local maven settings configuration, including the remote maven repositories. Edit your settings.xml file and update the server’s section including the alfresco server id and your credentials.

Note that the root pom.xml references 2 different repositories : alfresco-public, alfresco-public-snapshots . The id of each repository must match with a server id on your settings.xml (where you specify your credentials for that server).

Section from configuration settings.xml

 
        <server>
            <id>alfresco-public</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-public-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>

Section from pom.xml

 <repository>
            <id>alfresco-public</id>
            <url>https://artifacts.alfresco.com/nexus/content/groups/public</url>
 </repository>
  <repository>
            <id>alfresco-public-snapshots</id>
            <url>https://artifacts.alfresco.com/nexus/content/groups/public-snapshots</url>
   </repository>

3 – Location/Path Where to create the files

Edit the src/main/java/super-size-my-repo.properties and configure your deployment location and the images location.

files_deployment_location : Should be a in a place inside your contentStore. This will be the root for the in-place bulkImport.

images_location : The tool randomly chooses from a folder of local images to include on the various document types. You need to set the images_location to a folder where you have jpg images. You can use the sample images by pointing the images_location to your /images. The bigger your images are, the bigger your target documents will be. For the sizes of the documents considered we expect jpg images with aprox 1.5MB

Tool Configuration files and options

You find the tool configuration file under src/main/java/super-size-my-repo.properties This configuration file contains the following self-explanatory properties.

files_deployment_location=<PATH_WHERE_THE_FILES_WILL_BE_CREATED>
images_location=<DEFAULT_LOCATION_FOR_BASE_IMAGES>
num_Threads=<NUMBER_OF_THREADS_TO_EXECUTE>
threadPoolSize=<SIZE_OF_THE_THREAD_POOL>
max_files_per_folder=<NUMBER_OF_MAX_FILES_IN_A_SINGLE_FOLDER>

The only 2 properties that are mandatory to adjust are files_deployment_location and images_location All of the other properties have default running values.

How to run with maven ?

Issue the following maven command to generate the targets (executable jar) from the project root.

P.S. – Don’t forget to configure your properties file.

# mvn clean install

This will build and generate the executable jar on the target directory.

To run this jar, just type :

java -jar super-size-my-repo-<YOUR_VERSION>-SNAPSHOT-jar-with-dependencies.jar

Next Steps ?

After running the tool, you will have lots of documents to import using the Alfresco bulk importer. To perform a in-place import, you need to define the files_deployment_location to a location inside your contentstore.

Now you can execute the in-place-bulk import action to add all the documents and correspond ant meta-data to a target Alfresco repository.

The Streaming bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport

The in-place bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport/inplace

Note that you may need to adjust localhost and the 8080 port with your server details if you not running alfresco locally or you’re not running alfresco on the default 8080 port.

Check http://wiki.alfresco.com/wiki/Bulk_Importer for more details.

And that is it folks, if you like to contribute to the evolution of this tool, send me an email and i will add you as a contributor with commit rights to the github repository.

I hope you enjoyed this article as much as i enjoyed writing it. I wish you can make use of this nice tool. Stay tuned for more Alfresco related articles and don’t forget to support open-source projects.

OpenSource – Together we are stronger, One Love

Luis


Alfresco Behaviours and Policies

$
0
0

The power of Alfresco Behaviours and Policies

Hi all, back with another post after some busy months on the field that i just could not find time to share a post or two. Since i truly believe that the technology know-how is to be shared, this time i will speak about a very powerful feature of Alfresco. The Alfresco Behaviours and Policies.

I believe that an effective ECM project should be more focused on how creative we are while interacting with the technology than on the technology itself. It may sound weird at first , but the idea behind that thought is  “Thrust the technology. Use it Creatively”. This invites us to redirect most of our implementation focus on how creative and effective we are using the technology while implementing our business requirements and facing our project challenges.

Because the Alfresco technology was designed having extensibility and integration  in consideration, there are lots of different ways to reach the same business goals. Choosing what component to implement specific goals can be hard. On my years of consulting practice,  i’ve seen lots of “killing flies with machine guns” scenarios that could have been implemented with a more simplistic (sometimes much cheaper) approach. When i joined Alfresco, after my technology a deep dive,i felt like a a kid in a party with a huge table full of delicious flavours but just can’t make his mind on what to eat. 

If you think about it, the success factors of a project are not just “how good and optimised is my code, how fast are my servers performing, how effective are my processes”. With new technologies such as Alfresco we must factor in “how creative and efective were my choices inside my chosen technology”. I could write a very long post only on this topic, but actually, you are reading a post about Alfresco Behaviours and Policies , a very powerful alfresco feature, sometimes forgotten or not considered in important implementation decisions.

Imagine a business requirement defining that specific mime-types (such as for example big video files and audio files) should be automatically stored on a different (cheaper) disk than all the other contents. How can we accomplish this ? Hopefully i will be able to explain you (and provide you with the code for it) during this post.

Alfresco allows you to fire automated actions over content on its repository. There are many ways of automating those actions on content (rules, scheduled tasks, policies…). For this post we will focus only on behaviours that are parts of business logic binded to repository policies and events.

Screen Shot 2015-04-07 at 13.48.27

There are a set of policies that are called from the alfresco services. For example, the policies available in NodeService are listed on the table below. Note the self-explanatory Inner Interfaces names that help deducting the events that trigger them.

Interface Inner Interface
NodeServicePolicies BeforeCreateStorePolicy
OnCreateStorePolicy
BeforeCreateNodePolicy
OnCreateNodePolicy
BeforeMoveNodePolicy
OnMoveNodePolicy
BeforeUpdateNodePolicy
OnUpdateNodePolicy
OnUpdatePropertiesPolicy
BeforeDeleteNodePolicy
OnDeleteNodePolicy
BeforeAddAspectPolicy
OnAddAspectPolicy
BeforeRemoveAspectPolicy
OnRemoveAspectPolicy
OnRestoreNodePolicy
BeforeCreateNodeAssociationPolicy
OnCreateNodeAssociationPolicy
OnCreateChildAssociationPolicy
BeforeDeleteChildAssociationPolicy
OnDeleteChildAssociationPolicy
OnCreateAssociationPolicy
OnDeleteAssociationPolicy
BeforeSetNodeTypePolicy
OnSetNodeTypePolicy

There are also Policies for  ContentService,  CopyService and VersionService and some others. If you search the Alfresco source code using “*Policies” pattern you will also find policies like CheckOutCheckInServicePolicies, LockServicePolicies, NodeServicePolicies, TransferServicePolicies, StoreSelectorPolicies , AsynchronousActionExecutionQueuePolicies and RecordsManagementPolicies

An alfresco behaviour is simply a Java class that implement one of the policy interfaces. One advantage on using behaviours  over rules is that the behaviours are  applied globally to the repository while rules can be disabled by configuration (such as in bulk import scenarios). In comparison to scheduled task the behaviour is applied in real-time, while a scheduled task executes at a configured timestamp.

In a nutshell :

  • The Alfresco repository lets you inject behaviour into your content.
  • Custom behaviours serve as method handlers for node events (policies), they can make the repository react to changes
  • Behaviours can be bind to event (policy) for particular Class (Type, Aspect, Association)
  • Policies can extend Alfresco beyond content models make extensions smarter by Encapsulating features and business logic.

Screen Shot 2015-04-07 at 14.00.37

At the end of this post i will provide deeper technical details on Policy and Behaviours, but now lets focus on our practical and usable example that you can actually use on your projects.

A practical example

Let’s get back to our business requirement that says specific mime-types (such as for example big video files and audio files) should be stored on a different (cheaper) disk than all the other contents.

Step 1 – The content store selector facade

Since we will have more than one content store, we will make use of the well know content store selector facade. Alfresco manages the storage of binaries through one or more content stores, each of which is associated with a single location on a file system accessible to Alfresco e.g. //data/alfresco_content. The Alfresco Content Store Selector allows for the direction of content to specific physical stores based upon the appropriate criteria (folder rules or policies).

Full documentation of the Content Store Selector can be found below with the appropriate configuration examples: http://docs.alfresco.com/4.2/concepts/store-manage-content.html

1.1 – Creating of a new content store

Let’s start by creating a new content store for the mediaFiles. Create a new directory under your <alf_data> directory, or mount a file system that you want to use to the media files. We will call this store mediaStore. Now we need to make alfresco aware of the new store by performing some configuration to alfresco.

In <tomcat_dir>/shared/classes/alfresco/extension create a new spring context file and name it as content-store-selector-context.xml
Paste the following configuration xml
<?xml version='1.0' encoding='UTF-8'?>
 <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
 <beans>
 <bean id="mediaSharedFileContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">
 <constructor-arg>
 <value><PATH_TO_YOUR_MEDIA_STORE></value>
 </constructor-arg>
 </bean>
 <bean id="storeSelectorContentStore" parent="storeSelectorContentStoreBase">
 <property name="defaultStoreName">
 <value>default</value>
 </property>
 <property name="storesByName">
 <map>
 <entry key="default">
 <ref bean="fileContentStore" />
 </entry>
 <entry key="mediastore">
 <ref bean="mediaSharedFileContentStore" />
 </entry>
 </map>
 </property>
 </bean>
 <!-- Point the ContentService to the 'selector' store -->
 <bean id="contentService" parent="baseContentService">
 <property name="store">
 <ref bean="storeSelectorContentStore" />
 </property>
 </bean>
 <!-- Add the other stores to the list of stores for cleaning -->
 <bean id="eagerContentStoreCleaner" class="org.alfresco.repo.content.cleanup.EagerContentStoreCleaner" init-method="init">
 <property name="eagerOrphanCleanup" >
 <value>${system.content.eagerOrphanCleanup}</value>
 </property>
 <property name="stores" >
 <list>
 <ref bean="fileContentStore" />
 <ref bean="mediaSharedFileContentStore" />
 </list>
 </property>
 <property name="listeners" >
 <ref bean="deletedContentBackupListeners" />
 </property>
 </bean>
 </beans>

1.2 – Configuration of the cm:storeSelector aspect

Now we need to make alfresco share aware of the multiple content stores, to do that we need to configure the cm:storeSelector aspect.
In <tomcat_dir>/shared/classes/alfresco/web-extension rename the spring context file web-client-config-custom.xml.sample to web-client-config-custom.xml and configure the cm:storeSelector aspect as follows:

<!-- Configuring in the cm:storeSelector aspect -->
<config evaluator="aspect-name" condition="cm:storeSelector">
 <property-sheet>
 <show-property name="cm:storeName" component-generator="StoreSelectorGenerator" />
 </property-sheet>
</config>
<config evaluator="string-compare" condition="Action Wizards">
 <aspects>
 <aspect name="cm:storeSelector"/>
 </aspects>
</config>

Next we need to merge the following xml snippet on our share-config-custom.xml file to make the content store selector aspect visible in share.

<!-- Configuring in the cm:storeSelector aspect -->
 <config evaluator="node-type" condition="cm:content">
 <forms>
 <form>
 <field-visibility>
 <!-- aspect: cm:storeSelector -->
 <show id="cm:storeName" />
 </field-visibility>
 <appearance>
 <!-- Store Selector -->
 <field id="cm:storeName" label="Store Name" description="Content Store Name" />
 </appearance>
 </form>
 </forms>
 </config>
 <config evaluator="string-compare" condition="DocumentLibrary" replace="true">
 <aspects>
 <!-- Aspects that a user can see -->
 <visible>
 <aspect name="cm:storeSelector" />
 </visible>
 </aspects>
 </config>

1.3 – Some Simple ways of using the new content store

The new content store is set using the cm:storeName property.  The cm:storeName property can be set in number of ways:

  • Manually, by exposing this property so its value can be set by either Explorer or Share
  • Running a script action that sets the cm:storeName property value within the script
  • Using a rule that runs a script action to set the property
  • Using a Behaviour that automates the choice of the store based on the mime-type of the content (we will see how during this post)

The default behaviour is as follows:
• When the cm:storeSelector aspect is not present or is removed, the content is copied to a new location in the ‘default’ store
• When the cm:storeSelector aspect is added or changed, the content is copied to the named store
• Under normal circumstances, a trail of content will be left in the stores, just as it would be if the content were being modified. The normal processes to clean up the orphaned content will be followed.

To automate the store classification we can write a simple script in JavaScript and call it action_mediastore.js. The script contents would be:

var props = new Array(1);
props[“cm:storeName”] = “mediastore”;
document.addAspect(“cm:storeSelector”, props);
document.save();

We would then save the script in Data Dictionary/Scripts. Note that the script above is adding the storeSelector aspect and assigning a value (in this case mediastore) to the property.
Now we can execute the action over any file or folder and we select “Execute Script”. We then select our script action_mediastore.js

Step 2 – Coding the new behaviour

We will build a custom content policy (Behaviour) that depending on the documents mime-type will apply the content-store-selector aspect to it and choose the appropriate contentStore.

Using the Alfresco sdk, start a new repository amp project (I will focus on how to use the alfresco sdk on a different post).

The first thing we need is to create a behavior binded to the OnContentUpdatePolicy. Note that metadata detection (and mime-type) is post commit, this is why we needs to use a OnContentUpdatePolicy as the method ContentServicePolicies.OnContentUpdatePolicy is fired *after* Tika detection of mimetype. We will be using onContentUpdate, when newContent = true for our use case

2.1 – Behaviour class implementing OnContentUpdatePolicy

Our class definition will be as follows :

public class SetStoreByMimeTypeBehaviour extends TransactionListenerAdapter
 implements ContentServicePolicies.OnContentUpdatePolicy {

Check that we are implementing one of the ContentService policies , in this case the OnContentUpdatePolicy

Next we define the properties that we will inject via Spring .

 private PolicyComponent policyComponent;

 private ServiceRegistry serviceRegistry;

 private Map<String, String> mimeToStoreMap;

We need to provide our class with the correspondent getters and setters for the properties that will be injected.

 

public PolicyComponent getPolicyComponent() {
 return policyComponent;
 }

 public void setPolicyComponent(PolicyComponent policyComponent) {
 this.policyComponent = policyComponent;
 }

 public ServiceRegistry getServiceRegistry() {
 return serviceRegistry;
 }

 public void setServiceRegistry(ServiceRegistry serviceRegistry) {
 this.serviceRegistry = serviceRegistry;
 }

 public void setMimeToStoreTypeMap(Map<String, String> mimeToModelTypeMap) {
 this.mimeToStoreMap = mimeToModelTypeMap;
 }

 public Map<String, String> getMimeToStoreTypeMap() {
 return mimeToStoreMap;
 }

The init method is one of the methods that we need to override to implement the behavior. This method initiates the behavior and registers the class with the chosen policy. In this case we are doing it for all content nodes (ContentModel.TYPE_CONTENT=cm:content).

Note the important NotificationFrequency.FIRST_EVENT. Behaviours can be defined with a notification frequency – “every event” (default), “first event”, “transaction commit”. In this case, we want the behavior to fire only on first event. Consider that during a given transaction, certain policies may fire multiple times (ie. “every event”).

 public void init() {
 if (log().isDebugEnabled()) {
 log().debug("Initializing Behavior");
 }
 this.onContentUpdate = new JavaBehaviour(this, "onContentUpdate", NotificationFrequency.FIRST_EVENT);
 this.policyComponent.bindClassBehaviour(QNAME_ONCONTENTUPDATE, ContentModel.TYPE_CONTENT, onContentUpdate);
 }

Last, but not least we need to override the onContentUpdate method to implement our logic. We get the mimeType of the node and assign the content a specific store depending on that. Note that the store name is coming from the mapping implemented on the spring bean configuration (explained on the next section)

 @Override
 public void onContentUpdate(NodeRef nodeRef, boolean newContent) {
 if (log().isDebugEnabled()) {
 log().debug("onContentUpdate, new[" + newContent + "]");
 }

 NodeService nodeService = serviceRegistry.getNodeService();
 ContentData contentData = (ContentData) nodeService.getProperty(nodeRef, ContentModel.PROP_CONTENT);
 String nodeMimeType = contentData.getMimetype();
 log().debug("nodeMimeType is " + nodeMimeType);

 QName storeName = getQNameMap().get(nodeMimeType);

 if (storeName != null) {
 log().debug("storeName is " + storeName.toString());
 String name = storeName.toString().substring(2,storeName.toString().length());
 log().debug("Stripped storeName is " + name);
 // add the aspect
 Map storeSelectorProps = new HashMap(1, 1.0f);
 storeSelectorProps.put(QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI,"storeName"), name);
 nodeService.addAspect(nodeRef, QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "storeSelector"), storeSelectorProps);

 // Extract meta-data here because it doesn't happen automatically when imported through FTP (for example)
 ActionService actionService = serviceRegistry.getActionService();
 Action extractMeta = actionService.createAction(ContentMetadataExtracter.EXECUTOR_NAME);
 actionService.executeAction(extractMeta, nodeRef);
 }
 else {
 log().debug("No specific store configured for mimetype [" + nodeMimeType + "]");
 }
 }

The full source code for our class is the following:

package org.alfresco.consulting.behaviours.mimetype;

import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

import org.alfresco.model.ContentModel;
import org.alfresco.repo.action.executer.ContentMetadataExtracter;
import org.alfresco.repo.content.ContentServicePolicies;
import org.alfresco.repo.policy.Behaviour;
import org.alfresco.repo.policy.Behaviour.NotificationFrequency;
import org.alfresco.repo.policy.JavaBehaviour;
import org.alfresco.repo.policy.PolicyComponent;
import org.alfresco.repo.transaction.TransactionListenerAdapter;
import org.alfresco.service.ServiceRegistry;
import org.alfresco.service.cmr.action.Action;
import org.alfresco.service.cmr.action.ActionService;
import org.alfresco.service.cmr.repository.ContentData;
import org.alfresco.service.cmr.repository.NodeRef;
import org.alfresco.service.cmr.repository.NodeService;
import org.alfresco.service.namespace.NamespaceService;
import org.alfresco.service.namespace.QName;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * Behavior that casts the node type to the appropriate store
 * based on the applied mime type.
 * @author Luis Cabaceira
 *
 */
public class SetStoreByMimeTypeBehaviour extends TransactionListenerAdapter
 implements ContentServicePolicies.OnContentUpdatePolicy {

 private static final QName QNAME_ONCONTENTUPDATE = QName.createQName(NamespaceService.ALFRESCO_URI, "onContentUpdate");

 private Behaviour onContentUpdate;

 private PolicyComponent policyComponent;

 private ServiceRegistry serviceRegistry;

 private Map<String, String> mimeToStoreMap;

 private Map<String, QName> qnameMap;

 public void init() {
 if (log().isDebugEnabled()) {
 log().debug("Initializing Behavior");
 }
 this.onContentUpdate = new JavaBehaviour(this, "onContentUpdate", NotificationFrequency.FIRST_EVENT);
 this.policyComponent.bindClassBehaviour(QNAME_ONCONTENTUPDATE, ContentModel.TYPE_CONTENT, onContentUpdate);
 }

 /*
 * (non-Javadoc)
 * @see org.alfresco.repo.content.ContentServicePolicies.OnContentUpdatePolicy#onContentUpdate(org.alfresco.service.cmr.repository.NodeRef, boolean)
 */
 @Override
 public void onContentUpdate(NodeRef nodeRef, boolean newContent) {
 if (log().isDebugEnabled()) {
 log().debug("onContentUpdate, new[" + newContent + "]");
 }

 NodeService nodeService = serviceRegistry.getNodeService();
 ContentData contentData = (ContentData) nodeService.getProperty(nodeRef, ContentModel.PROP_CONTENT);
 String nodeMimeType = contentData.getMimetype();
 log().debug("nodeMimeType is " + nodeMimeType);

 QName storeName = getQNameMap().get(nodeMimeType);

 if (storeName != null) {
 log().debug("storeName is " + storeName.toString());
 String name = storeName.toString().substring(2,storeName.toString().length());
 log().debug("Stripped storeName is " + name);
 // add the aspect
 Map storeSelectorProps = new HashMap(1, 1.0f);
 storeSelectorProps.put(QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI,"storeName"), name);
 nodeService.addAspect(nodeRef, QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "storeSelector"), storeSelectorProps);

 // Extract meta-data here because it doesn't happen automatically when imported through FTP (for example)
 ActionService actionService = serviceRegistry.getActionService();
 Action extractMeta = actionService.createAction(ContentMetadataExtracter.EXECUTOR_NAME);
 actionService.executeAction(extractMeta, nodeRef);
 }
 else {
 log().debug("No specific store configured for mimetype [" + nodeMimeType + "]");
 }
 }

 /**
 *
 * @return
 */
 private Map<String, QName> getQNameMap() {
 if (qnameMap == null) {
 qnameMap = new HashMap<String, QName>();
 // Pre-resolve QNames...
 for (Entry<String,String> e : mimeToStoreMap.entrySet()) {
 QName qname = this.qnameFromMimetype(e.getKey());
 if (qname != null) {
 qnameMap.put(e.getKey(), qname);
 }
 }
 }
 return qnameMap;
 }

 /**
 *
 * @param mimeType
 * @return
 */
 private QName qnameFromMimetype(String mimeType) {
 QName qname = null;

 String qNameStr = mimeToStoreMap.get(mimeType);
 qname = QName.createQName(qNameStr, serviceRegistry.getNamespaceService());
 return qname;
 }

 public PolicyComponent getPolicyComponent() {
 return policyComponent;
 }

 public void setPolicyComponent(PolicyComponent policyComponent) {
 this.policyComponent = policyComponent;
 }

 public ServiceRegistry getServiceRegistry() {
 return serviceRegistry;
 }

 public void setServiceRegistry(ServiceRegistry serviceRegistry) {
 this.serviceRegistry = serviceRegistry;
 }

 public void setMimeToStoreTypeMap(Map<String, String> mimeToModelTypeMap) {
 this.mimeToStoreMap = mimeToModelTypeMap;
 }

 public Map<String, String> getMimeToStoreTypeMap() {
 return mimeToStoreMap;
 }

 protected Log log() {
 return LogFactory.getLog(this.getClass());
 }

}

2.2 – Registering the behaviour with Spring

Next step is to register our behavior with Spring, for that we will need a context file (service-context.xml) that will register our bean and that has the mapping between the mimetype of the content and the correspondent store. Note the mapping of the several video formats to the new mediaStore.

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
 <!-- -->
 <!-- SetStoreByMimeTypeBehaviour -->
 <!-- -->
 <bean id="mime-based-store-selector-behavior"
 class="org.alfresco.consulting.behaviours.mimetype.SetStoreByMimeTypeBehaviour"
 init-method="init" depends-on="dictionaryBootstrap">
 <property name="policyComponent" ref="policyComponent" />
 <property name="serviceRegistry" ref="ServiceRegistry" />
 <!-- stores to mimetype map -->
 <property name="mimeToStoreTypeMap">
 <map>
 <entry key="video/mpeg"><value>mediaStore</value></entry>
 <entry key="audio/mpeg"><value>mediaStore</value></entry>
 <entry key="audio/mp4"><value>mediaStore</value></entry>
 <entry key="video/mp4"><value>mediaStore</value></entry>
 <entry key="video/x-m4v"><value>mediaStore</value></entry>
 <entry key="video/mpeg2"><value>mediaStore</value></entry>
 <entry key="video/mp2t"><value>mediaStore</value></entry>
 <entry key="video/quicktime"><value>mediaStore</value></entry>
 <entry key="video/3gpp"><value>mediaStore</value></entry>
 <entry key="video/3gpp2"><value>mediaStore</value></entry>
 <entry key="video/x-sgi-movie"><value>mediaStore</value></entry>
 </map>
 </property>
 </bean>
</beans>

2.3 – Deploy and Test

Package your class and your context file into an amp file (using the alfresco sdk) and deploy it to your repository with the apply-amps.sh (alfresco-mmt.jar).

You can now test to upload video or audio files and verify that those are being stored on your new mediaStore.

Summary (Technical deep dive) on Policies and Behaviours

We’ve seen that Policies provide hook points to which we can bind behaviours to events based on class or association

Behaviours are (policy) handlers that execute specific business logic, they can be implemented in Java and/or JavaScript. Behaviours can be bound to a type or aspect

org.alfresco.repo.policy.JavaBehaviour
JavaBehaviour(Object instance, String method, NotificationFrequency frequency)

org.alfresco.repo.jscript.ScriptBehaviour
ScriptBehaviour(ServiceRegistry serviceRegistry, ScriptLocation location, NotificationFrequency frequency)

Screen Shot 2015-04-07 at 16.03.57

We can have several types of Policies
  • ClassPolicy (type or aspect)
  • AssociationPolicy (peer or parent-child)
  • PropertyPolicy (not used)

Screen Shot 2015-04-07 at 16.05.50

There are Different Types Of Bindings for Behaviors

Service – called every time
Class – most common, bound to type or aspect
Association – bound to association, useful for adding smarts to custom hierarchies
Properties – bound to property, too granular

Let’s take a look at a register and invoke pattern for Behaviour components

public interface NodeServicePolicies
{ 
 public interface OnAddAspectPolicy extends ClassPolicy
 {
 public static final QName QNAME = QName.createQName(NamespaceService.ALFRESCO_URI, "onAddAspect");

 // Called after an <b>aspect</b> has been added to a node
 public void onAddAspect(NodeRef nodeRef, QName aspectTypeQName);
 }
}

public abstract class AbstractNodeServiceImpl implements NodeService
{
 // note: policyComponent is injected … (not shown here)

 public void init()
 {
 // Register the policy
 onAddAspectDelegate = policyComponent.registerClassPolicy 
 (NodeServicePolicies.OnAddAspectPolicy.class);
 }

 protected void invokeOnAddAspect(NodeRef nodeRef, QName aspectTypeQName)
 { 
 NodeServicePolicies.OnAddAspectPolicy policy = onAddAspectDelegate.get(nodeRef, aspectTypeQName);
 policy.onAddAspect(nodeRef, aspectTypeQName);
 }
}

Build and Implement pattern for Behaviour components

public class XyzAspect implements NodeServicePolicies.OnAddAspectPolicy, ...
{
 // note: policyComponent is injected … (not shown here)

 public void init()
 {
 // bind to the policy
 policyComponent.bindClassBehaviour(
 OnAddAspectPolicy.QNAME,
 ContentModel.ASPECT_XYZ,
 new JavaBehaviour(this, "onAddAspect”, 
 Behaviour.NotificationFrequency.TRANSACTION_COMMIT));
 }

 public void onAddAspect(NodeRef nodeRef, QName aspectTypeQName)
 {
 // implement behaviour here … (for when aspect XYZ is added)
 }
}

Conclusion

Alfresco Behaviours and Policies are very powerfull features that can make your extensions very smart. I hope you enjoyed this post and that you can make usage of its contents.

Stay tuned for more posts with my field experiences,

One Love,

Luis

Application Lifecycle Management Methodology for Alfresco

$
0
0

alm

In Wikipedia, Application Lifecycle Management (ALM) is defined as :

“The marriage of business management to software engineering made possible by tools that facilitate and integrate requirements management, architecture, coding, testing, tracking, and release management”

Starting a development effort with the appropriate source control mechanisms and release methodology can save you hundreds of painful hours when managing releases and your source code. Adopting a smart, reliable and robust application lifecycle and release process can exceed your expectations and actually be the foundation for your project success.

PSG – Alfresco Sdk Rock and Roll with Jenkins

jenkins

Following is a list of “mortal sins” of application lifecycle” what we seek to avoid.

– Manual changes in production environments
– Manual error prone testing procedures and stressful UAT phases
– Multiple development standards and unmanaged versioning policies

The main commandments of Application Lifecycle management.

  • Identify and respect your release
  • If it’s not tested (automatically) it’s not working
  • If it’s not documented it doesn’t exist
  • Controlled integration is possible and should not limit business improvement
  • Centralize common configuration while leave projects enough flexibility for special cases

The PSG Methodology

PSG stands for “Plain Simple Goals” and its a application lifecycle and release management methodology focused at Alfresco based installations. It uses the Alfresco Maven SDK and Jenkins as the foundations providing a methodology for Alfresco development, release management and application lifecycle management.

The main goal of this methodology if to provide a reproducible and scalable way to manage application build, test, release, maintenance and integration policies.

My invite to you, on this particular post, is not only to read the post but to actually try the project, following easy and exact step by step instructions along the post, in return you will get :

  • Alfresco Development and Test Infra-structure
  • A Rapid and Smart Alfresco Development methodology
  • A Build Infra-structure
  • An reliable Release Process
  • A robust Alfresco Application lifecycle management approach

Technically speaking you will end up with :

  • In memory H2 database and in-memory application server (tomcat)
  • A extension module for the alfresco repository that creates an Alfresco Module Package(.amp)
  • A extension module for alfresco Share that creates an Alfresco Module Package(.amp)
  • A running instance of the alfresco repository with your overrides and your .amp extension deployed and tested
  • A running instance of the alfresco share application with your overrides and your .amp extension deployed and tested

Let’s start then, PSG is a project hosted on my personal github github and it contains the technical foundations of this methodology.

Step 0 – Download your working copy of the foundation

Start by clicking here to download your copy of the “Plain Simple Goals” methodology. Unzip the contents to a place on your computer. This will be your development environment home, so take note of this path. You will need to come back here later.

Step 1 – Pre-Requirements

1 – Software requirements

If you don’t have Java JDK, Maven and Jenkins installed on your computer/server now is the time to do so. Visit the urls below to install the pre-requirements.

About Maven

Maven’s primary goal is to allow a developer to comprehend the complete state of a development effort in the shortest period of time. In order to attain this goal there are several areas of concern that Maven attempts to deal with:

  • Making the build process easy
  • Providing a uniform build system
  • Providing quality project information
  • Providing guidelines for best practices development
  • Allowing transparent migration to new features

About Jenkins

Jenkins is an award-winning application that monitors executions of repeated jobs, such as building a software project or jobs run by cron. Among those things, current Jenkins focuses on the following two jobs:

  • Building/testing software projects continuously, just like CruiseControl or DamageControl. In a nutshell, Jenkins provides an easy-to-use so-called continuous integration system, making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build. The automated, continuous build increases the productivity.
  • Monitoring executions of externally-run jobs, such as cron jobs and procmail jobs, even those that are run on a remote machine. For example, with cron, all you receive is regular e-mails that capture the output, and it is up to you to look at them diligently and notice when it broke. Jenkins keeps those outputs and makes it easy for you to notice when something is wrong.

About Java 

:) Just kidding

2 – Credentials for Enterprise

If you wish to work with alfresco enterprise you need to have login credentials on the Alfresco Nexus repository (artifacts.alfresco.com). You can request login credentials on the Alfresco support portal. Alternatively you can just build and run the open source version Alfresco Community.

3 – Configuration requirements to build Alfresco Enterprise/Community

During the installation of maven, a new file name settings.xml was created. This file is our entry point to the your local maven settings configuration, including the remote maven repositories. Edit your settings.xml file and update the server’s section including the alfresco server id and your credentials.

Note that the root pom.xml references 2 different repositories : alfresco-privatealfresco-private-snapshots. If you are building Community you should call those alfresco-public and alfresco-public-snapshots. Note that the id of each repository must match with a server id on your settings.xml (where you specify your credentials for that server).

Section from the settings.xml maven configuration file

To build alfresco enterprise

...
   <repository>
     <id>alfresco-private</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/private</url>
   </repository>
   <repository>
     <id>alfresco-public-snapshots</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/private-snapshots</url>
   </repository>
 ...

To build alfresco Community

...
   <repository>
     <id>alfresco-public</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/public</url>
   </repository>
   <repository>
     <id>alfresco-public-snapshots</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/public-snapshots</url>
   </repository>
 ...

Section from the root pom.xml (Enterprise)

...
        <server>
            <id>alfresco-private</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-private-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
 ...

Section from the root pom.xml (Community)

...
        <server>
            <id>alfresco-public</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-public-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
 ...

Step 2 – Source Control Mechanism and Distribution Management

Source Control

The project uses the maven SCM plugin. The SCM Plugin offers vendor independent access to common scm commands by offering a set of command mappings for the configured scm. Each command is implemented as a goal.

Configure the main pom.xml with your own source control mechanism, the example in the download is configured with my github account. If you don’t have one yet, you can create your own free github account and use it.  If you use subversion you can have something like :

...
<scm>
  <connection>scm:svn:http://<YourRepo>/svn_repo/trunk</connection>
  <developerConnection>scm:svn:https://<YourRepo>/svn_repo/trunk</developerConnection>
  <url>http://<YourRepo>/view.cvs</url>
</scm>

 ...

Distribution Management
We need to have configured the repository that will hold the artifacts of the releases. That is configured using the deploy plugin of maven. In the psg project i have my own public-cloudbees repository configured, you can create your own free cloudbees repository and update the pom.xml accordingly so that your releases are stored in your repository. Since during this post we will install a instance of artifactory (see the bottom of this post for installation instructions), we should use it to hold our release artifacts. Configure it as follows:

...
<distributionManagement>
  <repository>
    <id><your-company>-private-release-repository</id>
    <url>dav:https://<YOUR_CI_SERVER_IP>:8080/artifactory/<project_name>/release/</url>
  </repository>
  <snapshotRepository>
   <id><your-company>-private-snapshot-repository</id>
   <url>dav:https://<YOUR_CI_SERVER_IP>:8080/artifactory/<project_name>/snapshot/</url>
  </snapshotRepository>
</distributionManagement>
 ...

Note that the repository id configured on the pom.xml must match a server id on your local maven settings.xml file.

Section from your local settings.xml maven configuration file

...
<server>
        <id><your-company>-private-snapshot-repository</id>
        <username>YOUR_PRIVATE_MAVEN_REPOSITORY_USERNAME</username>
        <password>YOUR_PRIVATE_MAVEN_REPOSITORY_PASSWORD</password>
        <filePermissions>664</filePermissions>
        <directoryPermissions>775</directoryPermissions>
    </server>
    <server>
        <id><your-company>-private-release-repository</id>
        <username>YOUR_PRIVATE_MAVEN_REPOSITORY_USERNAME</username>
        <password>YOUR_PRIVATE_MAVEN_REPOSITORY_PASSWORD</password>
        <filePermissions>664</filePermissions>
        <directoryPermissions>775</directoryPermissions>
    </server>
...

This will enable you to perform releases with :

  • Prepare release : mvn release:prepare
  • Perform release : mvn release:perform
  • Prepare and Perform release : mvn release:prepare release:perform

Step 3 – Finally , let run it ! 

On the projects root folder you have the heart of the project, the parent pom.xml. This is the file that aggregates your full build, including all the modules and overlays to the different applications and generates deployable artifacts ready for your release. Before we run it for the first time let’s review once more What is included in this project build
applications(apps folder)

Alfresco module packages(amps folder)

To run the project for the first time, issue the following maven command to run the project from the root of your development environment.

# mvn clean install -Prun

This will build and run all the modules, when the maven process finishes, open a browser and point it to :

alfresco

SHARE

Step 4 – Jenkins Integration – Build and Release Processes 

Now that we see what our build can do we will go a step further and we’ll perform the integration with Jenkins. This is where the automation fun begins.

Our goals are automating the build and deployment process. The project artifacts are always built after every check-in of new source code, which means early warnings of broken builds.

Process Goals

• Deploy to any environment by the push of a button
• Revert to a previous deployment by the push of a button.
• Deploy automatically every night on the dev environment.
• Log of each build

Our engine for the automated deployments and continues integration (CI) will be Jenkins. Jenkins connects to your SCM (svn, git, …), downloads the source code and then it builds your project artifacts.  When the build has produced the required artifacts ( <your>-alfresco.war, <your>-share.war) those can be deployed automatically to the different environments (e.g. DEV, TEST, PRODUCTION).

Step 1 – Install Jenkins and Artifactory

Let’s check the installation of required software on the Continuous Integration server. This should ideally be a independent server machine that will run:

• Jenkins Server
• Artifactory Repository

We will now cover the installation of those components on the designated CI server box (LINUX). The Jenkins server will act as an Automation tool that :

  • Monitors execution of repeated jobs
  • Allows for Continuous Integration
  • Test orchestration
  • Executes and Tests Releases
  • Rolls back and redeploy of previous builds

The releases can be scheduled and run periodically on early development stages

Jenkins will deploy remotely to any Alfresco environment and run remote integration tests. The release reports should be published on the site and recorded as part of the release documentation. Your tests should be self contained and runnable in CI and the must produce intelligible reports. Every task-development must contain the appropriate tests.

Jenkins installation on CI server box (let’s hire a free buttler :))

Screen Shot 2015-04-07 at 23.40.10

STEP 1)

sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo

STEP 2)

sudo rpm --import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key

STEP 3)

sudo yum install jenkins

Jenkins Details

Jenkins will be launched as a daemon up on start. See /etc/init.d/jenkins for more details.
To Start/Stop or stop Jenkins you can use the following command:

sudo service jenkins start/stop/restart

The ‘jenkins’ user is created to run this service. If you change this to a different user via the config file, you must change the owner of

/var/log/jenkins
/var/lib/jenkins
/var/cache/jenkins

The Log file will be placed in /var/log/jenkins/jenkins.log. Check this file if you are troubleshooting Jenkins.

The configuration file /etc/sysconfig/jenkins will capture configuration parameters for the launch. By default, Jenkins listen on port 8080 but we can change this port to 9090 because the default 8080 may be taken by other tomcat instance.

Note that the built-in firewall may have to be opened to access this port from other computers. (See http://www.cyberciti.biz/faq/disable-linux-firewall-under-centos-rhel-fedora/ for instructions how to disable the firewall permanently)

To test your Jenkins server, just type http://<server_ip>:9090 on your browser.

This concludes the installation of Jenkins, official documentation available:

https://wiki.jenkins-ci.org/display/JENKINS/Installing+Jenkins+on+RedHat+distributions

Artifactory installation on CI server box

Screen Shot 2015-04-07 at 23.28.50

Start by going to http://www.jfrog.com/open-source/#os-arti and download your zip bundle with artifactory. By the time of this post the latest version is available at http://bit.ly/Hqv9aj

Setup is really easy, we just needed to follow the instructions available on their 1 minute setup video : http://www.jfrog.com/video/artifactory-1-min-setup/

After the installation is completed you can access artifactory by going to :

http://<server_ip>:8081

Source code control strategy

Your project should have

  • Standard trunk/branches/tags per project structure
  • Maintenance branches
  • Tagging per release
  • Build number based on SVN revision

This way you can have separate business projects (parties) to run independently on separate SVN roots, whilst allowing the final binary product to be integrated in the main Enterprise Alfresco instance.

Release Artifacts

We will be producing and storing 2 different types of artifacts on our release. The artifacts have 2 categories, deployment artifacts and storage-only.

Deployment artifacts: alfresco.war and share.war and server-configuration.alf

Storage-only artifacts: repo-extension-amp, share-extension-amp, custom-amp

The deployment artifacts can be deployed to any target environment either by the click of a button or by a Jenkins schedule task. Note that storage-only artifacts are self-contained in the deployment artifacts, this is achieved by the dependency management implicit in the alfresco maven Sdk. The next diagram illustrates this.

depen

The server-configuration.alf is a compressed package that follows a specific configuration structure that is part of every deployment. This contains environment specific configuration for the target server, it contains a set of configuration files that are specific to each target server (Dev, Test, Prod) such as, for example, the alfresco-global.properties.

So, before we move on we need to make sure we got that working (Using the alfresco sdk to build the customised war files). Using the psg template you can easily configure this so i will not get on details on this process.

Unit Testing

A set of unit tests is included as part of every code release/deployment to check that code is working and continues to work as intended.

Automated tests must meet very specific objectives:

  • Every developer must be able to run the combined collection of all the developer’s tests.
  • The continuous integration (CI) server must be able to run the entire suite of tests without any manual intervention.
  • The outcome of the tests must be unambiguous and repeatable.

Running this automated unit tests will allows any developer to verify that their current changes do not break existing code-under-test. The team leader or manager showd insist that this happen. It very important as it virtually eliminates the accidental or unintended side-effect problems.

There are 3 key objectives for developers to keep in mind when writing the unit tests:

  • Readability: Write test code that is easy to understand and communicates well.
  • Maintainability: Write tests that are robust and hold up well over time.
  • Automation: Write tests that require little setup and configuration (preferably none).

Integration Testing

Once unit tested components are delivered Jenkins should integrate them together. These “integrated” components are tested to weed out errors and bugs caused due to the integration. This is a very important step in on the Development Life Cycle.

Goal is to avoid programmers developing different components and that some bugs emerge during the integration step. In most projects a dedicated testing team focuses on Integration Testing.

The integration team should be able to

  • Step 1: Create a Test Plan
  • Step 2: Create Test Cases and Test Data
  • Step 3: If applicable create scripts to run test cases
  • Step 4: Once the components have been integrated execute the test cases
  • Step 5: Fix the bugs if any and re test the code
  • Step 6: Repeat the test cycle until the components have been successfully integrated

To write an Integration Test Case you should describe exactly how the test should be carried out. The Integration test cases specifically focus on the flow of data/information/control from one component to the other.

Integration Test cases should focus on scenarios where one component is being called from another. Also the overall application functionality should be tested to make sure the app works when the different components are brought together.

The various Integration Test Cases will be executed as part of the build process.

The release process

Maven maven will be compiling the extension amps and adding them to the correspondent .war files. We will have 3 main artifacts for deployment (alfresco.war and share.war and server-configuration.alf).

The remaining artifacts (.amps) will also be created as part of the release but will not be directly deployed. They are allready part of the war artifacts via the maven dependency management

To execute part of the release process in Jenkins we are using the maven release-plugin.

This plugin is used to release a project with Maven, saving a lot of repetitive, manual work.

Releasing a project is made in two steps: prepare and perform.

Try to prepare a release with maven by running the command : # mvn release:prepare

So what happens when preparing a release (release:prepare) ?

Preparing a release goes through the following release phases:

  • Check that there are no uncommitted changes in the local sources
  • Check that there are no SNAPSHOT dependencies
  • Change the version in the POMs from x-SNAPSHOT to a new version
  • Transform the SCM information in the POM to include the final destination of the tag
  • Run the project tests against the modified POMs to confirm everything is in working order
  • Commit the modified POMs
  • Tag the code in the SCM with a new version name
  • Bump the version in the POMs to a new value y-SNAPSHOT
  • Commit the modified POMs

 

After a successful build, maven has prepared your release by:

  1. Creating a new tag in your SCM with the release source code.
  2. Creating a new release version and lock the code for deployment
  3. Create automatically a new development version

 

See more details on the prepare stage of a release on the official maven documentation

Performing a release (mvn release:perform)

After a successful prepare stage, you are now ready to perform your first release by running the command : # mvn release:perform

Performing a release runs the following release phases:

  • Checkout from an SCM URL with optional tag
  • Run the predefined Maven goals to release the project (by default, deploy site-deploy).
  • Upload the artifacts to your configured maven repository

See more details on the perform stage of a release on the official maven documentation

Deploy your artifacts

This is the last step of the release process; it’s where the release manager (will be impersonated by Jenkins) actually deploys the release to the target environment (DEV,QA, PROD), this are the steps that will be performed by Jenkins.

  1. Stops your application server (tomcat)
  2. Unzips the server-configuration.alf
  3. Updates server configuration.
  4. Copy the alfresco artifact war file (with your overlays) to your application server replacing the existing copy.
  5. Copy the share artifact war file (with your overlays) to your application server replacing the existing copy.
  6. Starts your application server (tomcat) with the new release

Note that we have configured (using maven dependency management) both the alfresco and share overlay modules to build their target artifacts (war files) including your amp extensions, meaning that they are already installed on the war artifacts (thanks to the alfresco Sdk).

All this steps are automated in Jenkins as part of the release process, no human intervention is necessary to perform a release.

Automating your build and releases in Jenkins

After configuring Jenkins to automate the main deployment steps (release prepare) and (release perform) we will still need to configure Jenkins to perform the final deployment. The goal is to completely eliminate human intervention in the deployment. We will be using an automated action in Jenkins that will do the following:

  • Stops target tomcat (DEV,PRE-PROD or PROD)
  • Download the artifacts from artifactory with CURL
  • Updates server configuration with alfresco server-configuration.alf
  • Copy the alfresco.war and share.war with scp to <tomcatRoot>/webapps replacing the existing versions.
  • Starts target tomcat (DEV, PRE-PROD or PROD)

NOTE: On the first run with the new release process you should probably backup the existing alfresco.war and share.war as those artifacts will not be on artifactory yet.

Conclusion

In my opinion, a smart application lifecycle and release process is the foundation for any successful project . I hope you’ve enjoyed this post.

Stay tuned for more posts and advices on Alfresco and ECM

Until then, One Love !

“We’re together and we share, that is what makes us Strong”

Alfresco Repository Caches Unfolded

$
0
0

Alfresco Repository Caches Unfolded

During my consulting practice on various customer accounts i’ve faced some challenges regarding the tuning of the repository caches for increased performance, this post is a result of my personal experiences and it attempts to explain the way Alfresco makes usage of the repository caches.

Note : The information on this post applies only to Alfresco versions 4.1.X.

DISCLAIMER : You should not change the cache values unless you have performance issues and you are absolutely sure about what you are doing. Before changing the cache values you should first gather all the relevant information that will support and justify your changes. The repository caches can have a big positive difference on the performance of your Alfresco repository but they consume Java heap memory. Tuning the caches in a wrong way can make your system irresponsive and may lead to Out of Memory issues. The optimal settings to use on the caches depend on your particular use case and the amount of memory available to your Alfresco server.

The Repository Caches

The alfresco repository features various in-memory caches , they are transaction safe and clusterable. There are 2 levels of caching involved.

Level 1 Cache (cache-context.xml) – The in-transaction caches

Sizes in cache-context.xml are for the in-transaction caches (Level 1 cache) i.e. before it gets flushed to permanent storage. TransactionalCache has to hold any values (read, updated, new, removed) local to the transaction.  On commit, the values are flushed to the level 2 cache (EHCache in 4.1), which makes the data available to other transactions.  Of course, a rollback means the values get thrown away and the shared caches are not modified. This gives Alfresco the power of repeatable read as far as cache values are concerned.

So, if there are a lot of transactions that pull in a lot of cache entries, the transaction-local cache can get full, which is  bad for performance as the only way to guarantee correctness of the shared cache is to clear it. When it comes to site membership evaluation, a large number of ACLs are accessed within single transactions, which is why the transactional cache sizes are larger relative to the shared cache sizes.

Level 2 Cache (ehcache-custom.xml)

The Level 2 (L2) cache provides out-of-transaction caching of Java objects inside the Alfresco system. Alfresco provides support for EHCache. Using EHCache does not restrict the Alfresco system to any particular application server, so it is completely portable.
The L2 cache objects are stored in memory attached to the application scope of the server. Sticky sessions must be used to keep a user that has already established a session on one server for the entire session. By default, the cache replication makes use of RMI to replicate changes to all nodes in the cluster using the Peer Cache Replicator. Each replicated cache member notifies all other cache instances when its content has changed.
Level 2 cache is a technology to speed up database access. When the application makes a database query, it does not have to do a (costly) SQL request if the object is already present in the Level 2 cache. For debugging purposes, you can disable the L2 cache. The database will keep working, but at a slower rate.

If you have issues with the replication of information in clustered systems, that is, the cache cluster test fails, you may want to confirm this by setting the following properties to true in the alfresco-global.properties file as follows :

system.cache.disableMutableSharedCaches=true
system.cache.disableImmutableSharedCaches=true

Default Values for the Caches

Currently, out of the box, a vanilla Alfresco comes setup for approximately  512MB of cache heap memory , that’s the recommended default for a Java heap size of 1GB.  In our days were we have much bigger heaps ( I’ve seen heaps from 4GB up to 54 GB ) and also much bigger numbers in terms of users, concurrency and repository sizes.  This means that the cache defaults on ehcache.xml are designed for dev environments (1G heap) and i personally think they can be tuned on every production environment.

All default cache settings are available in the <configRoot>\alfresco\ehcache-default.xml file, but you should not directly modify this file.

Individual Cache Settings for L2 cache

Each cache is configured in an XML block similar to this:
<cache
name=”org.alfresco.cache.node.rootNodesCache”
maxElementsInMemory=”500″
eternal=”true”
overflowToDisk=”false”
statistics=”false”
/>

name
The name attribute is the name of the cache and generally indicates the type of objects being cached.
maxElementsInMemory
The maxElementsInMemory controls the maximum size of the cache. This value can be changed to tune the size of the cache for your system. Ehcache caches are implemented using a linked-map system, which means that memory is only required for objects that are actually in memory. If you set the maxElementsInMemory to a high value, it will not automatically allocate that number of slots. Instead, they are added to the linked list as required. When maxElementsInMemory is reached, the cache discards the oldest objects before adding new objects.
The cached objects will be garbage collected by means of weak referencing should memory become a problem. It should be noted that some object references are effectively shared by the caches e.g. so the amount of memory used is generally not as high as the approximate value may suggest – but it’s best to error on the side of caution.
timeToIdleSeconds 
timeToIdleSeconds and timeToLiveSeconds control the automatic timeout of cached objects.
overflowToDisk
overflowToDisk controls whether the cache should overflow to disk rather than discarding old objects.
statistics
When set to true and the rest of the tracing mechanism is enabled, the alfresco logs will contain usage statistics of this specific cache.

How to Wisely tune the repository caches for your use-case

There are 2 main files that you can edit/enable to tune your repository caches :  ehcache-custom.xml and cache-context.xml. You’ll find the ehcache-custom.xml.sample file on your shared/classes/alfresco/extension directory under your application server installation folder. To change the default values you need to rename this file to ehcache-custom.xml, perform your changes and restart your application server. When you decide that you need to tune a specific cache, you should do it on both the  ehcache-custom.xml and cache-context.xml files.

We strongly advice you not to tune the caches without tracing their current usage first.

The best way to predict and determine the optimal values for your caches is by using a tracing mechanism that will help you to determine which caches fill up quickly for your particular server use-case.  Take a look below on how to enable the cache tracing mechanism.

TRACING YOUR CACHE SIZES

1 – log4j.properties

Edit your log4j.properties and set the following logging category  to DEBUG to output detailed Ehcache usage information.

org.alfresco.repo.cache.EhCacheTracerJob=DEBUG

To target specific caches, you can even append the cache name or package:

org.alfresco.repo.cache.EhCacheTracerJob.org.alfresco

2 – Override the ehCacheTracerJob bean

The configuration file <configRoot>/alfresco/scheduled-jobs-context.xml contains the
ehCacheTracerJob bean configuration. You will need to override this bean to change the trigger schedule. You will do this by enabling the scheduler property to activate the trigger. To override this bean create a <yourname>-context.xml file on the extensions root folder and provide the overriding for the bean as per the instructions below.

<!– enable DEBUG for ‘org.alfresco.repo.cache.EhCacheTracerJob’ and enable scheduler property to activate —>
<bean id=”ehCacheTracerJob”>
<property name=”jobDetail”>
<bean id=”ehCacheTracerJobDetail”>
<property name=”jobClass”>
<value>org.alfresco.repo.cache.EhCacheTracerJob</value>
</property>
</bean>
</property>
  <!– enable this to activate bean
        <property name=”scheduler”>
            <ref bean=”schedulerFactory” />
        </property>
        –>
<!– start after an hour and repeat hourly –>
<property name=”startDelayMinutes”>
<value>60</value>
</property>
<property name=”repeatIntervalMinutes”>
<value>60</value>
</property>
</bean>

When triggered, the job will collect detailed cache usage statistics and output them to the log/console, depending on how logging has been configured for the server.

3 – Set caches to use statistics

In your ehcache-custom.xml, choose the caches you want to monitor and set the statistics property=true.

<cache
name=”org.alfresco.cache.node.rootNodesCache”
maxElementsInMemory=”250000″
eternal=”true”
overflowToDisk=”false”
statistics=“true”
/>

After making those 3 changes and restarting your server, you should start seeing a detailed output with relevant information in regards to your usage of the repository caches. You should get log traces similar to the ones on the example below.

The following example is from a test of the Alfresco Repository running a simple 150 concurrent user test scenario. Randomly selected from a pool of 1000 test users, a user logs in, views their home space, uploads a small file to the repository and logs out. This test ensures that new objects are continually added to the caches as the new files are added by random users.

Some objects are shared between the caches, so the reported sizes are an overestimate in some cases. Nevertheless, they serve as a useful indication of relative sizes. Do not the last statement that gets logged that clearly indicated the estimated size of your heap that is being consumed by the cache.

09:09:34,458 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeImpl.sourceNodeAssocs
Hit Count:                        56245 hits            |         Miss Count:         20102 misses
Deep Size:                       19.62 MB              |         Current Count:       5000 entries
Percentage used:            100.00 percent     |         Max Count:           5000 entries
Estimated maximum size:      19.62 MB

09:10:06,099 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeImpl.targetNodeAssocs
Hit Count:                            56253 hits            |         Miss Count:         20114 misses
Deep Size:                           19.62 MB              |         Current Count:       5000 entries
Percentage used:                100.00 percent     |         Max Count:           5000 entries
Estimated maximum size:      19.62 MB

09:10:06,099 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.VersionCountImpl
Hit Count:                          0 hits                   |         Miss Count:             0 misses
Deep Size:                         0.00 MB               |         Current Count:          0 entries
Percentage used:              0.00 percent        |         Max Count:            100 entries
Estimated maximum size:        NaN MB

09:10:06,115 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  org.alfresco.repo.domain.hibernate.NodeAssocImpl
Hit Count:                          0 hits                |         Miss Count:             0 misses
Deep Size:                         0.00 MB            |         Current Count:          0 entries
Percentage used:              0.00 percent     |         Max Count:           1000 entries
Estimated maximum size:        NaN MB


09:10:31,428 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  permissionsAccessCache
Hit Count:                         2610635 hits        |         Miss Count:       6148423 misses
Deep Size:                        6.02 MB                |         Current Count:      12165 entries
Percentage used:             24.33 percent       |         Max Count:          50000 entries
Estimated maximum size:      24.75 MB

09:10:31,615 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob]    Analyzing EHCache:
===>  PermissionCache
Hit Count:                         9035796 hits        |         Miss Count:      19266775 misses
Deep Size:                        2.55 MB                |         Current Count:       3519 entries
Percentage used:             35.19 percent       |         Max Count:          10000 entries
Estimated maximum size:       7.23 MB

09:10:31,615 DEBUG [org.alfresco.repo.cache.EhCacheTracerJob] EHCaches currently consume 421.84 MB or 28.20% VM size
The criteria to check the cache tracing logs is:

  • (MissCount – CurrentCount) must be as low as possible.
  • (HitCount/MissCount) must be as high as possible.

Estimated maximum size affects the permanent memory taken up by the cache. If the caches grow too large, they may crowd out transient session memory and slow down the system. It is useful to have this running, on occasion, to identify the caches with a low HitCount/MissCount ratio.

An important indicator that you need to increase your caches is when you see messages like the ones below on your alfresco.log file indicating that some specific caches are full.

13:25:12,901 WARN [cache.node.nodesTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.nodesTransactionalCache’ is full (125000).
13:25:14,182 WARN [cache.node.aspectsTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.aspectsTransactionalCache’ is full (65000).
13:25:14,214 WARN [cache.node.propertiesTransactionalCache] Transactional update cache ‘org.alfresco.cache.node.propertiesTransactionalCache’ is full (65000).

After analysing your tracing results and your alfresco log file , if you decide to perform a repository cache tuning, your should enable for editing both the ehcache-custom.xml and cache-context.xml.

I strongly advice you not to change the original cache-context.xml file directly, but to create a new context file where you override the beans that you need. You find the original cache-context files on the alfresco.war webapp, under /classes/alfresco/cache-context.xml. The tuning of a specific cache should occur on both the cache-context file and the ehcache-custom.xml file. For your reference, i’m attaching to this post 2 sample cache configuration files with settings that were  required for to handle a large group structure environment.

custom-cache-context.xml ehcache-custom.xml

Individual Caches Information

This section shows details on the most important repository caches with a brief definition on their purpose.

Default Cache
defaultCache This cache is for when someone forgets to write a cache sizing snippet.
Store Caches
rootNodesCache Primary root nodes by store.  Increase for Multi Tenancy
allRootNodesCache All root nodes by store.  Increase for Multi Tenancy
Nodes Caches
nodesCache Bi-direction mapping between NodeRef and ID.
aspectsCache Node aspects by node ID.  Size relative to nodesCache
propertiesCache Node properties by node ID.  Size relative to nodesCache
childByNameCache Child node IDs cached by parent node ID and cm:name.  Used for direct name lookups and especially useful for heavy CIFS use.
contentDataCache cm:content ContentData storage.  Size according to content in system.
Permissions and ACLs: (Mostly used in Alfresco Share)
authorityToChildAuthorityCache Size according to total number of users accessing system
authorityToChildAuthorityCache Members of each group. Size according to number of groups, incl. site groups
zoneToAuthorityCache Authorities in each zone. Size according to zones.
authenticationCache Scale it according to the number of users, not concurrency, because Share sites will force all of them to be hit regardless of who logs in.
authorityCache NodeRef of the authority.  Size according to total authorities
permissionsAccessCache Size according to total authorities
readersCache Who can read a node, Size according to total authorities.
readersDeniedCache Who can’t read a node,  Size according to total authorities.
nodeOwnerCache cm:ownable mapping, needed during permission checks
personCache Size according to the number of people accessing the system
aclCache ACL mapped by Access control list properties
aclEntityCache DAO level cache bi-directional ID-ACL mapping.
Other Caches 
propertyValueCache Caches values stored for auditing.  Size according to frequency and size of audit queries and audit value generation
immutableEntityCache QNames, etc.  Size according to static model sizes
tagscopeSummaryCache Size according to tag use.  Stores rollups of tags

AVM Caches are related to a deprecated part of Alfresco so you shouldn’t need to tune the following caches :

  • avmEntityCache
  • avmVersionRootEntityCache
  • avmNodeCache
  • avmStoreCache
  • avmNodeAspectsCache

Conclusion

Repository caches play an important role on the repository performance and they should be tuned wisely. Always use the tracing method before you decide to tune your caches and don’t forget to check your logs for signs that can help you on decide that you need to tune your caches.

I hope you enjoyed this article, stay tuned for more information from the field.

 

Alfresco Boxes

$
0
0

Alfresco Boxes – State of the Art Automation to create your alfresco environment

Hi folks, my second post on my new Alfresco blog could not be about a better topic, let me start by telling you that Alfresco Boxes rocks and it just makes me proud to be a personal friend and colleague of its creator, Maurizio Pillitu. It was never so easy to setup and run a fully featured alfresco environment.

Alfresco Boxes is a community-driven effort, currently in experimental phase, not supported and not guaranteed by Alfresco. As usual, if you choose to use this technology and you run into problems, please let us know so we can quickly change the project name and pretend that is someone else’s fault. :-).

Jokes apart, If you want to create your alfresco environments using a fast, intelligent, reusable, automated way then Alfresco Boxes is what you’re looking for.  Let’s start jamming with some boring but necessary theory. I’ll promise i wont take it very far and you’ll be able to start to experiment all of this real quickly.

On the github project page (https://github.com/maoo/alfresco-boxes) Maurizio  starts to define Alfresco boxes as :

“A collection of utilities for Alfresco VM provisioning and deployment that works with either Packer, Vagrant or Docker; the deployment and installation logic is provided by chef-alfresco.”

Packer is a free (open-source ) tool for creating identical machine images for multiple platforms from a single source configuration. It’s easy to use and automate the creation of any type of machine image. It embraces modern configuration management by encouraging you to use automated scripts to install and configure the software within your Packer-made images. Packer brings machine images into the modern age, unlocking untapped potential and opening new opportunities. Out of the box Packer comes with support to build images for Amazon EC2, DigitalOcean, VirtualBox, and VMware.

Vagrant does the same as packer does, only using a specific file extension (.box) that contains all files needed by a local provider (VMWare or VirtualBox) to run the VM; check vagrantcloud and vagrantbox.es to get an idea of the pre-packaged boxes available that you can re-use on your own projects.

Docker is a linux kernel module that provides virtualization “super-powers”. Docker introduces the concept of container, which is a virtualization of the host operating system resources. Docker is a server that can start,stop and kill containers, so its basically a virtualization server with a new approach to the virtualization paradigm. As with Vagrant, Docker delivers an Image Index (Alfresco images soon to come)

In a nutshell you can use either one of the 3 technologies above (depending on what you want to achieve) to automate your alfresco deployment strategy and the creation of your alfresco environments. On this blogpost i will focus exclusively Packer but, from a technical standpoint, the 3 approaches have one thing in common… “Chef Alfresco”.

Chef Alfresco is a Chef Cookbook (a collection of build tasks) that defines the installation steps needed to deliver a fully working Alfresco instance on a given VM. If you’re not familiar with Chef, it is a Build Automation tool that uses a agent-client/server architecture to process and execute build tasks, so called recipes.

Chef Alfresco depends on other cookbooks, such as artifact-deployer, that fetches artifacts from remote Apache Maven repositories and defines default values (i.e. Maven artifact coordinates) for all artifacts (WARs, ZIPs, JARs) involved in the Alfresco deployment process; it also depends on other third-party recipes that install the DB (MySQL), Servlet Container (Tomcat) and transformation tools (ImageMagick, LibreOffice, swftools). If you want to check the full list of dependencies, check the Berkshelf file of alfresco-boxes.

 ! OK, ITS TIME TO GET BUSY, LET’S TRY IT !

Now that you know the basics of the theory its time to have a real taste on the Alfresco Boxes technology. Following is a list of step by step actions and a how to video that will guide you on how to use all of this in a simple way. First think you need is to check if you have the prerequisites necessary to proceed.

MacOS Users : To have Ruby installed, the best way is to install XCODE, preferable the latest version.

Now that the prerequisites are in place, lets start to install Alfresco boxes and build our first Alfresco vmware in a automated-way. The following instructions focused on the Packer approach. You find detailed documentation on other variants such as Docker or Vagrant on Github Alfresco Boxes project.

Installing Alfresco Boxes

1 – Download and install Packer.

  • Go to http://www.packer.io/downloads.html and download the appropriate version for your operating system.

  • Unzip the packer package and make sure the unziped location is part of your local path.

    • For Linux/MacOs : export PATH=<your_path_to_packer>:$PATH

  • Test if packer is installed and available on your system by typing the following command :

    • # packer -v

2 – Download and install virtualBox

3 – Checkout the Alfresco Boxes git project.

  • Checkout alfresco boxes by running the following command on a terminal (you need to have git installed, if you don’t know about git, now its a good time to learn it and start using it )

  • # git clone -b alfresco-boxes-0.5.1 https://github.com/maoo/alfresco-boxes.git

The git clone command created a alfresco-boxes folder on your system and downloaded all the projects content for the specific branch being checked out.

4 – Local configuration adjustments

4.1 Change directory to alfresco-boxes/packer/

# cd alfresco-boxes/packer/

4.2 Edit file precise-alf421.json to choose an IP that can be bridged to one of your host Network Interfaces:

{
  "type": "shell",
  "execute_command": "echo 'vagrant' | sudo -S sh '{{ .Path }}'",
  "inline": ["/tmp/static-ip.sh <your_ip_range_here>.223 192.168.1.1 255.255.255.0"]
}

4.3 Edit the vbox-precise-alf421/data_bags/maven_repos/private.json 

Edit this file to set your access credentials to artifacts.alfresco.com ( access can be requested by Alfresco Customers via the Alfresco Support Portal).

P.S. – If you don’t have credentials to artifacts.alfresco.com you can still test alfresco-boxes  using the Community edition: change the alfresco-allinone.json version attribute from 4.2.1 to 4.2.f

{
  "id":"private",
  "url": "https://artifacts.alfresco.com/nexus/content/groups/private",
  "username":"your_user",
  "password":"your_password"
}

You can optionally use your Maven encryped password and set your Maven Master password in precise-alf421.json:

"maven": {
  "master_password":"{your_mvn_master_password}"
}

4.4 Generate the Virtual Machine box:

cd alfresco-boxes/packer/vbox-precise-421
packer build -only virtualbox-iso precise-alf421.json

This will create a output-virtualbox-iso/.ovf and output-virtualbox-iso/.vdmk, ready to be imported into VirtualBox. You should now have a fully functional version of Alfresco with everything installed and ready to run. :)

The user/password to login (and check the local IP – ifconfig – that is assigned by DHCP) is vagrant/vagrant.

5  Virtual Machine (Ubunto) Notes

Open a terminal with a ssh connection to your virtual machine.

# ssh vagrant@<your_vm_ip>

5.1 – Start by changing/setting the root password

#sudo sh

# passwd root

5.2 Change tomcat7 password

# su –

# passwd tomcat7

5.3 Take a note of your usernames and passwords :

Make notes of your usernames and passwords, you will need them later.

root | <your_root_password>
tomcat7 | <your_tomcat7_password>

5.4 Tomcat 7 locations

CATALINA_BASE=/var/lib/tomcat7
CATALINA_HOME=/usr/share/tomcat7

5.5 To start | stop tomcat ( use sudo and tomcat7)

# sudo service tomcat7 start|stop|restart|status

5.6 Tomcat logs directory

/var/logs/tomcat7

5.7 Alfresco and Alfresco Share logs directory

/var/lib/tomcat7

5.8 Alfresco shared dir location

/var/lib/tomcat7/shared

5.9 Alfresco admin user 

( admin | admin )

Hope you enjoyed the article on its current status, i will be posting a step by step video to illustrate the most relevant steps for your guidance on the near future , so stay tuned.

Love, Passion, Unity and OpenSource can take us further. We’re together, thanks for reading.

 

Solr Tuning – Maximizing your Solr Performance

$
0
0

Solr Tuning Tips

Hi folks, another useful post for your alfresco related knowledge. This time dedicated to Solr tuning tips, using this information wisely can heavily contribute to increased performance on your Solr related topics such as searching and indexing content.

Solr, when properly tuned It’s extremely fast, easy to use, and easy to scale. I wanted to share some lessons learned from my field experience while using Alfresco with Solr.

Solr It’s a search indexer built on top of Lucene , there are two main disciplines to consider for Solr :

* Indexing data (writing/committing)
* Search for data (reading/querying)

Each of those disciplines has different characteristics and considerations while addressing performance.

It’s important to mention that there is no rule of thumb that enables you to maximize your Solr performance for every project. Solr tuning is an exercise that highly depends
on the specific project use cases, architecture and business scenarios.  Depending on the particularities of your project, the actions to perform may vary in terms of what
needs to be done to achieve best Solr performance. This post includes procedures and methodologies that will help you to understand how Solr performance is driven.

Let’s first Analise how does alfresco search and indexing works together with Alfresco

2 main Solr Cores (Live content and archived content)

By default the Solr that comes with Alfresco contains 2 cores, the live content core (workspaces:spacesStore) and the archived content (archive:spacesStore). Each of this cores contains the indexes for their particular set of content.

solr_commns

Alfresco Search ( After a user searches for content what happens behind the scenes ? )

Alfresco sends a secure GET request (https) to the Solr cores and Solr responds with a streaming formatted in JSON or XML with the response to the search request.
This is then interpreted by alfresco and the results are presented in a user-friendly format.

Solr Indexing new items (tracking Requests)

This tracking occurs by default on every 15 seconds (can be configured), Solr asks alfresco for changes in content and newly created documents in order to index those changes on its cores. It also asks for changes on the content models and for changes on the ACLs for documents.

In summary, Solr updates its indexes by looking at the number of transactions that have been committed since it last talked to Alfresco, a bit like index tracking in a cluster. On the diagram above you see several http requests going from Solr to Alfresco, those requests are explained below:

  1. New models and model changes https://localhost:8443/alfresco/service/api/solr/model
    1. Solr keeps track of new custom content models that have been deployed and download them to be able to index the properties in these models.
  2. ACLs changes https://localhost:8443/alfresco/service/api/solr/aclchangesets
      1. Any changes on permission settings will also be downloaded by Solr so it can do query time permission filtering.
  3. Document Content changes
      1.  https://localhost:8443/alfresco/service/api/solr/textContent
  4. New transactions (create, delete, update or any other action that triggers a transaction)
      1. https://localhost:8443/alfresco/service/api/solr/transactions

     

Brief analysis to New document indexing scenario

Let’s check what happens in Solr when we create a new document and Solr executes it’s tracking detecting that a new document has been created.

  1. First Solr requests a list of ids of all new transactions on that document (create, update, delete, … ) https://localhost:8443/alfresco/service/api/solr/transactions
  2. Transactions and acl changesets are indexed in parallel, and for each transactionId, Solr requests, on this order:
    1. Document metadata
    2. Document Content

Solr Architecture variations methods

There are 3 different architecture variations than can be considered while using Solr with Alfresco on a cluster. For the scope of this post  i will only be addressing cluster-based configurations that include the following advantages:

Alfresco  – >  Solr search load balancing

This is the most obvious use case for scalability purposes. Search requests are directed from Alfresco to a pool of Solr instances, each of which contains a full copy of the index and is able to service requests in a purely stateless fashion.

Solr -> Alfresco index tracking balancing

In the other direction, Solr nodes use a load balancer to redirect their index tracking requests to one or multiple dedicated/shared Alfresco nodes. This is useful in case of large indexing load, due to a heavy concurrent write/update scenario.

Option 1 – Solr in the same machine as alfresco, non dedicated tracking

Screen Shot 2014-06-10 at 22.44.49

On this architecture we have a solr instance deployed on the same application server of both alfresco and share web-applications.

Advantages

  • Easy to maintain / backup.

Disadvantages

  • Shared JVM, if Solr crashes both Alfresco and Share become unavailable.
  • Shared hardware, Memory is shared between all layers of the application
  • When Solr downloads content, there is transformation load to generate the indexing text file (CPU/Memory intensive) on the Alfresco side, having everything on the same box impacts both search and indexing as all the applications are on the same application server sharing its resources like the connection pools, threads, etc.
  • Only possible to scale vertically (Only possible to add more CPU and Memory)

Option 2 – Solr separated from alfresco – Non-Dedicated tracking

Screen Shot 2014-06-10 at 22.47.36

On this architecture variation we have the Solr instances deployed on separated machines and application servers from alfresco and share.

Advantages

  • Simple upgrade, administration and maintenance of Solr server’s farm
  • Allows for vertical and horizontal scalability
  • Introduces the ability to load balance the queries
  • Ready for Future Solr sharding feature
    • It’s expected that alfresco will support, on a near future the ability to slit the index on different solr instances that will lead to increased performance, this architecture is ready to implement that feature.

Disadvantages

Remote Tracking can stress the network, if network problems occur solr performance gets affected.

Option 3  Solr server with a dedicated tracking Alfresco instance

Screen Shot 2014-06-10 at 22.50.49

On this architecture variation we use dedicated alfresco instances on the solr servers that are only used for indexing and do not receive or process any user’s requests. These local alfresco instances take care of important CPU/Memory intensive operations such as the transformations and the overall tracking and indexing actions. With this scenario the repository Tier is released from those operations resulting on a overall performance gain. This Alfresco instances are not part of the cluster and do not require ehcache configuration.

Note:  When Solr requests Alfresco for the content to be indexed, it’s the Alfresco server that is responsible for perform the content transformation onto a plain text file, only then the content is sent to Solr for indexing. This is an IO/CPU/Memory intensive operation that can decrease the overall alfresco performance.

Advantages

  • Indexing operations offloaded from repository and client tier
  • Dedicated Alfresco for transformation operations
    • Allow for specific transformations tuning on the index tier and on the repository tier considering the use cases. Transformation for previews and thumbnails (share related) and transformations for Solr indexing.
  • Allows for Vertical and horizontal scalability
  • General performance increase

Disadvantages

  • None, in my opinion this is the best option :)

Solr Indexing Best practices

Now lets discuss the Solr indexing best practices, if your problem is regarding the indexing performance this is the juice part of the post for you.

General Indexing Golden Rules ( Solr Indexing Tuning )

  • Have local indexes (don’t use shared folders, NFS, use Fast hardware (RAID, SSD,..)
  • When using an alfresco version previous to 4.1.4 you should reduce your caches as the default caches configuration may lead to OOM when solr in under big load.
  • Manage wisely your Ram buffer size (ramBufferSizeMB) on solrconfig.xml, this is set to 32 MB by default, but generally increasing this to 64 or even 128 has proven increased performance. But this depends on the amount of free memory you might have available.
    • ramBufferSizeMB sets the amount of RAM that may be used by Solr indexing for buffering added documents and deletions before they are flushed to disk.
  • Tune the mergeFactor, 25 is ideal for indexing, while 2 is ideal for search. To maximize indexing performance use a mergeFactor of 25.
  • During the indexing, plug in a monitoring tool (YourKit) to check the repository health during the indexing. Sometimes, during the indexing process, the repository layer executes heavy and IO/CPU/Memory intensive operations like transformation of content to text in order to send it to Solr for indexing. This can become a bottleneck when for example the transformations are not working properly or the GC cycles are taking a lot of time.
  • Monitor closely the JVM health of both Solr and Alfresco (GC, Heap usage)
  • Solr operations are memory intensive so tuning the Garbage collector is an important step to achieve good performance.
  • Consider if you really need tracking to happen every 15 seconds (default). This can be configured in Solr configuration files on the Cron frequency property.             alfresco.cron=0/15 * * * * ? *
  • This property can heavily affect performance, for example during bulk injection of documents or during a lucene to solr migration. You can change this to 30 seconds or more when you are re-indexing. This will allow more time for the indexing threads to perform their actions before they get more work on their queue.
  • Increase your index batch counts to get more results on your indexing webscript on the repository side. In each core solrcore.properties, raise the batch count to 2000 or more alfresco.batch.count=2000
  • In solrconfig.xml of each core configure the ramBufferSize to be at least 64 Mb , you can even use 128 if you have enough memory .<ramBufferSizeMB>64</ramBufferSizeMB>
  • In solrconfig.xml of each core configure the mergeFactor to 25, this is the ideal value for indexing. <mergeFactor>25</mergeFactor>
  • Disable Full Text Indexing on archive:SpacesStore Solr, this is done by adding the property alfresco.index.transformContent=false. Alfresco never searches for content inside files that are deleted/archived. This saves on disk space, memory on Solr, Cpu during Indexing and overall resources.
  • Tune the transformations that occur on the repository side, set a transformation timeout.
  • Important must reply project questions:
    • SSL really needed?  If inside the intranet, you should disabled to reduce complexity.
    • Full Text indexing really necessary, some customers do full text index but they don’t actually use it.
    • Is an archive core really necessary for indexing, if you are not making use of this indexing core, it would be beneficial to disable it.

For index updates, Solr relies on fast bulk reads and writes. One way to satisfy these requirements is to ensure that a large disk cache is available. Use local indexes and the fastest disks possible. In a nutshell, you want to have enough memory available in the OS disk cache so that the important parts of your index, or ideally your entire index, will fit into the cache. Let’s say that you have a Solr index size of 8GB. If your OS, Solr’s Java heap, and all other running programs require 4GB of memory, then an ideal memory size for that server is at least 12GB. You might be able to make it work with 8GB total memory (leaving 4GB for disk cache), but that also might NOT be enough.

Solr Indexing Troubleshooting techniques

Troubleshooting Solr indexing performance means finding the bottleneck that is delaying the overall indexing process. Since this is a process that involves at least 2 layers of your application architecture, the best way to troubleshoot is trough a dedicated series of tests measuring performance and comparing results.

First thing to discover is where is the bottleneck occurring, it can be on:

  • Repository layer
    • Database – If it’s a database performance issue, normally adding more connections to the connection pool will increase performance.
    • I/O – If it’s a IO problem, it can normally occur when using virtualized environments, you should use hdparam to check read/write disk speed performance if you are running on a linux based system, there are also some variations for windows. Find the example below:

                  sudo hdparm -Tt /dev/sda

Timing cached reads: 12540 MB in 2.00 seconds = 6277.67 MB/sec  
Timing buffered disk reads: 234 MB in 3.00 seconds = 77.98 MB/sec
  • JVM – Jvm configuration can impact the performance on the repository layer indexing activities.
  • Cpu and Memory usage – monitor the usage of the CPU and Memory on this layer and check for unusual usage of this two components.
  • Tranformations – Set a timeout for the transformations that occur on the repository layer. There is no timeout set by default and sometimes, when there’s a transformation issue the threads are frozen waiting for the transformations to occur.
  • SOLR Indexing layer
    • Number of threads for indexing – You can add more threads to the indexing processes if you detect that indexing is slow on the Solr side.
    •  Solr caches  – There are several caches that you can configure to increase indexing performance.
    • JVM – Jvm configuration can impact the performance on the Solr layer indexing activities. Focus your efforts on analyzing and tuning the Garbage collector, check for big GC pauses by analyzing the gc logs.
    • Hardware scalability – If none of the above actions improve your performance you may need to increase memory and CPU power on the Solr layer. Also consider horizontal scaling when appropriate.

The rule for troubleshooting involves testing and measuring initial performance, apply some tuning and parameter changes and retest and measure again until we reach the necessary performance. I strongly advice you to plugin a profiling tool such as Yourkit to both the repository and Solr servers to help with the troubleshooting.

Solr Search Best practices

This section is about tuning the search performance while using Solr, in general it will be sufficient to follow the golden rules below, if applying those does not solve your problem you might need to scale your architecture.

General Search Golden Rules

  • Use local folders for the indexes (don’t use shared folders, NFS)
  • Use Fast hardware (RAID, SSD,..)
  • Tune the mergeFactor, a mergeFactor of 2 is ideal for search.
  • Decrease the Solr caches, specially when running an Alfresco version prior to 4.1.4.
  • Increase your query caches and the RAMBuffer.
  • Avoid path search queries, those are know to be slow.
  • Avoid using sort, you can sort your results on the client side using js or any client side framework of your choice.
  • Avoid * search, avoid ALL search
  • Tune your Garbage collector policies and JVM memory settings.
  • Consider lowering your memory on the JVM if the total heap that you are assigning is not being used. Big JVM heap sizes lead to bigger GC pauses.
  • Get the fastest CPU you can, search is CPU intensive rather then RAM intensive.
  • Separate search and indexing tiers. If you can have 2 separate solr server farms, you can dedicate one to the indexing and the other to search. This will increase your global performance ( Only available since alfresco 4.2.X )
  • Optimize your ACL policy, re-use your permissions, use inherit and use groups. Don’t setup specific permissions for users or groups at a folder level. Try to re-use your Acls.
  • Upgrade your Alfresco release with the latest service packs and hotfixes. Those contain the latest Solr improvements and bug fixes that can have great impact on the overall search performance.
  • Make sure you are using only one transformation subsystem. Check the alfresco-global.properties and see if you are using either OooDirect or JodConverter, never enable both sub-systems.

Typical issues with Searching

It can happen that you are searching and indexing on the same time, this causes concurrent accesses to the indexes and that is known to cause performance issues. There are some workarounds for this situation. To start you should Plugin a profiler and search for commit Issues (I/O locks), this will allow you to check if you are facing this problem.

Solr Search Troubleshooting techniques

To troubleshoot your Solr search problems you should start by choosing a load testing tool such as SolrMeter or Jmeter and design your testing scenario with one of those tools. You can also choose to use the default alfresco benchmark scenario. The second step is to attach a profiler like Yourkit or other java profiler of your choice and records search performances snapshots for analysis.

Apply the tunings suggested on this document (specially on the golden rule section) and retest until you reach the necessary performance.

Solr usage in Share

If your project relies on the share client offered by Alfresco, you should know that tuning your Solr indexing and search performance will affect positively the overall share performance.

Share relies on Solr in the following situations:

  • Full Text Search (search field in top right corner)
  • Advanced Search
  • Filters
  • Tags
  • Categories (implemented as facets)
  • Dashlets such as the Recently Modified Documents
  • Wildcard searches for People, Groups, Sites (uses database search if not wildcard)

Overall Best Practices Technical Details

This section contain important technical details that will allow you to implement the various best practices mentioned previously on this post.

1 – Turn on Logging During Search

If you want to have a look at the queries that Alfresco is running against Solr when you click around in Alfresco Share. You can enable debug logging as follows in log4j.properties.

log4j.logger.org.alfresco.repo.search.impl.solr.SolrQueryHTTPClient=debug

A log for a full text search on “Alfresco” looks like this:

2014-01-17 08:21:15,696  DEBUG [impl.solr.SolrQueryHTTPClient] [http-8080-26] Sent :/solr/alfresco/afts?q=%28%28PATH%3A%22%2Fapp%3Acompany_home%2Fst%3Asites%2Fcm%3Atest2%2F*%2F%2F*%22+AND+%28Alfresco++AND+%28%2BTYPE%3A%22cm%3Acontent%22+%2BTYPE%3A%22cm%3Afolder%22%29%29+%29+AND+-TYPE%3A%22cm%3Athumbnail%22+AND+-TYPE%3A%22cm%3AfailedThumbnail%22+AND+-TYPE%3A%22cm%3Arating%22%29+AND+NOT+ASPECT%3A%22sys%3Ahidden%22&wt=json&fl=*%2Cscore&rows=502&df=keywords&start=0&locale=en_GB&fq=%7B%21afts%7DAUTHORITY_FILTER_FROM_JSON&fq=%7B%21afts%7DTENANT_FILTER_FROM_JSON

How to disable SSL communication between Solr and Alfresco

By default, the communication between Solr and Alfresco is encrypted, if you don’t need this encryption it’s a good idea to disable this in order to reduce complexity that can contribute to increased performance.

On the Alfresco server, edit the alfresco-global.properties and set:

  • solr.secureComms=none
  • On the alfresco webapp deployment descriptor web.xml, comment out the security constraint.

<!-­‐-­

<security-­‐constraint>
<web-­‐resource-­‐collection>
<url-­‐pattern>/service/api/solr/*</url-­‐pattern>
</web-­‐resource-­‐collection>
<auth-­‐constraint>
<role-­‐name>repoclient</role-­‐name>
</auth-­‐constraint>
<user-­‐data-­‐constraint>
<transport-­‐guarantee>CONFIDENTIAL</transport-­‐guarantee&gt
</user-­‐data-­‐constraint>
</security-­‐constraint>
<login-config>
<auth-method>CLIENT-CERT</auth-method>
<realm-name>Repository</realm-name>
</login-config>
<security-role>
<role-name>repoclient</role-name>
</security-role>

-->

  • For every Solr core that you have configured set alfresco.secureComms=none on the solcore.properties file.
  • On the alfresco Solr deployment descriptor web.xml or solr.xml under  Catalina/conf/localhost/solr.xml, comment out the security constraint as previously shown.

Detailed information can be found in the Alfresco customer Portal

How to set a transformation Timeout on Alfresco

To set a timeout limit (that it’s not set by default) can help you with your tuning and troubleshooting activities.

Timeout (ms) Use this limit to set timeout on reading data from the source file to be transformed. This limit works with transformers that don’t bulk read their source data, as it is enforced by a modified InputStream that either throws an exception or returns an End of file (EOF) early. The property associated with this transformation limit is timeoutMs.

You can set this property on your alfresco-global.properties as the following example:

content.transformer.default.timeoutMs=180000

How to set transformation limits on Alfresco

Setting appropriate transformation limits can help you to fine-tune your transformations and to improve indexing performance.

In Alfresco 4.2d much of the configuration of transformers is done using Alfresco global properties. In the case of the Enterprise edition these may be changed dynamically via JMX without stopping the server. Prior to this it was possible to control Content Transformer limits to a more limited extent using Spring XML and a few linked Alfresco global properties

You can find detailed information on transformation limits at http://wiki.alfresco.com/wiki/Content_Transformation_Limits#Introduction

  • How to Rebuild Solr Indexes

One useful action that is sometimes required is to rebuild the indexes from scratch. In order to rebuild the solr indexes, proceed as follows :

Do as follows:

1.      Stop Tomcat that runs Solr web application

2.      Remove index data of archive core at alf_data/solr/archive/SpacesStore

3.      Remove index data of workspace core at alf_data/solr/workspace/SpacesStore

4.      Remove cached content model info of archive core at alf_data/solr/archive-SpacesStore/alfrescoModels/*

5.      Remove cached content model info of workspace core at alf_data/solr/workspace-SpacesStore/alfrescoModels/*

6.      Restart Tomcat that runs Solr web application

7.      Wait a bit for Solr to start catching up…

Note : index.recovery.mode=FULL is not used by Solr – only by Lucene

About Sizing on the Solr servers

Sizing your Solr servers depends a lot on the specific search requirements of your project. The important factors you need to consider are :

  • Search Ratio, to get the search ratio you should divide the typical usage of the system in Read/Write/Search. Start with 100% and give a % to each of the operations.
  • Number of Documents in the repository
  • Group Hierarchy
  • Number of Acls
  • Amount of Cpu Cores you have available ( the more the better :))
Solr can have high memory requirements. You can use a formula to calculate the memory needed for the Alfresco internal data structures used in Solr for PATH queries and read permission enforcement. By default, there are two cores in Solr: WorkspaceSpacesStore and ArchiveSpacesStore. Normally, each core has one searcher but can have a maximum of two searchers.

Alfresco provides a formula that helps you to calculate the ammount of memory needed on your Solr servers, check the following url for guidance. http://docs.alfresco.com/community/concepts/solrnodes-memory.html

Below you find an excell file that will help you with the calculations ( you need to rename the extension from .txt to xlsx )

Calculate_Memory_Solr Beta 0.2_xlsx

I hope you enjoyed this post, i’ve surely enjoyed writing it and i hope it can help you with your projects. More interesting posts from my field experience are coming to the blog, so stay tuned.

“The greatest ideas are opensource, together we are Stronger”

 

 

 

The Science of Capacity Planning

$
0
0

Sizing and Architecture – The Science of Capacity Planning

Hi folks, decided to write an article about Capacity Planning because it has always been one of my passions. I’ve 14 years of consulting experience across major accounts in EMEA and I’ve been involved on hundreds of ECM related IT projects. I’ve found adequate capacity planning mechanisms on just a few of those, and by “magic” those were/are the most successful and long lasting projects.

What is Capacity Planning on a ECM context

Capacity Planning is the science and art of estimating the space, computer hardware, software and connection infrastructure resources that will be needed over some future period of time. It’s a mean to predict the types, quantities, and timing of critical resource capacities that are needed within an infrastructure to meet accurately forecasted workloads

Capacity planning is critical to ensure success of any ECM implementation. The predicting and sizing of a system is impossible without a good understanding of user behavior as well as an understanding of the complete deployment architecture including network topology and database configuration.

A high level description on my concept of capacity planning is shown below. Basically a good capacity planning mechanism/application implements each one of the phases outlined below, considering a customized Peak Period Methodology explained below.

The capacity planning approach that I’m refereeing to is done after general deployment, so prior to this capacity planning you’ll need a good sizing exercise to define the initial architectural requirements for your ECM platform.

In this article we are assuming that we have a fully deployed production environment where we will focus our capacity planning efforts.

Peak Period Methodology

I consider that the Peak period Methodology is the most efficient way to implement a capacity planning strategy as it gathers vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.

The peak period may be an hour, a day, 15 minutes or any other period that is used to collect utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.

On my personal approach to ECM capacity planning implementation, i focus my efforts on 6 key layers, obtaining specific metrics during a defined peak period:

  • Web Servers machines ( Apaches/WebServer for static content )

HTTP hits/Sec – Useful for measuring the load on the Web servers.

  • Application Servers machines Holding the client application

Page Views / Second – Understand the throughput of client applications

  • Application Servers machines holding the Ecm Server ( Alfresco/FileNet/Documentum/SharePoint )

Transactions / Second – Understand the throughput of the ECM server

  • LDAP Servers machines ( LDAP )

Activities ( reads ) / Second – Understand the throughput of LDAPS

  • Database Servers machines ( Oracle )

Database Transactions/Sec – Measuring the load on the database servers.

  • Network

KB/Sec – A measure of the raw volume of data received and sent by the application. Useful for measuring the load on the network and machine network card.

On top of that i also collect a very important aspect on the main application client , the response time ( time taken for a client application do respond to a user request ). The values i take in consideration for capacity are :

A.R.T – Average response time

M.R.T – Maximum response time

How to implement capacity Planning ?

I normally use a Collector agent that collects the necessary data from the various sources during the defined peak period. The collector runs daily and stores its data on Elastic Search for peak period analysis. The more data gets inside elastic search along the application live cycle, the more accurate are the capacity predictions because they do represent the “real” application usage during the defined peak period.

The Collector Agent uses zookeeper to store important information and definitions regarding repositories, machines, peak period definition, urls and other environment related constants.

To minimize impact on overall system performance the collector executes every day at a chosen period (outside business hours). That is configured at the OS level (using the crontab functionality or similar)

Integration Capacity Planning with monitoring systems

This approach is designed to integrate with most of the existing system monitoring software units such as HPOpenView, JavaMelody. I’m currently implementing this approach to perform capacity planning on Alfresco installations and I’m integrated it with our existing open source stack for motorization framework (great job of Miguel Rodiguez from Alfresco support ).

Capacity Planning Troubleshooting

Gathering this relevant data represents an important role on troubleshooting, on the capacity planning implementations I’ve seen, to Analise capacity data represents a crucial role while troubleshooting an application.

Data Analysis to predict architecture changes

By performing regular analysis to our capacity planning data, we know exactly when and how we need to scale our architecture, this represents a very important role when modeling and sizing our architecture for the future business requirements.

What’s next

In next September and October i will be speaking in the Alfresco Summit in San Francisco and London on how to appropriate size an Alfresco Installation. This presentation will also include relevant information about capacity planning and the implementation of this approach on a real life scenario. Consider yourself invited to join the Alfresco Summit and to assist my presentation, http://summit.alfresco.com/2014-speakers/luis-cabaceira

Until them, all the best. One Love.

Luis

Sizing and tuning your Alfresco Database

$
0
0

In a human body the heart and the brain are the 2 most important organs, if those are not performing well, nothing else is. The Alfresco database and filesystem where the content store resides are the brain and heart of Alfresco. Those are the 2 most important layers of every Alfresco architecture.

Get to know the Alfresco Database throughput

If your project will have lots of concurrent users and operations or the number/estimate number of documents is very big (> 1M)) you need to be informed about your database throughput.

database_performance_ui

Most common throughput factor of the database is transactions per second.

DB performance in a transactional system are usually the underlying database files and the log file. Both are factors because they require disk I/O, which is slow relative to other system resources such as CPU.

In the worst-case scenario,  in big alfresco databases with a big number of documents ):

Database access is truly random and the database is too large for any significant percentage of it to fit into the cache, resulting in a single I/O per requested key/data pair.

Both the database and the log are on a single disk. This means that for each transaction, the AlfrescoDB is potentially performing several filesystem operations:

  • Disk seek to database file
  • Database file read
  • Disk seek to log file
  • Log file write
  • Flush log file information to disk
  • Disk seek to update log file metadata (for example, inode information)
  • Log metadata write
  • Flush log file metadata to disk

Faster Disks normally can help on such sittuations but there are lots of ways (scale up, scale out) to increase transactional throughput.

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

  • I/O bottlenecks in the RDBMS throughput
  • Excessive queue for transactions due to overload of connections
  • On active-active cluster configurations, excessive latency

Alfresco Database treads pool.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

http://docs.oracle.com/cd/E17076_04/html/programmer_reference/transapp_throughput.html

 

How to calculate the size of your Alfresco database

All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.

To size your alfresco database in terms of space we’ve done a series of tests by creating content (and metadata) on an empty repository and analyzing the database growth.

Be aware that:

  • Content is not stored in the database but is directly stored on the disk
  • Database size is un-affected by size of the documents or the document’s content
  • Database size is affected by the number/type of metadata fields of the document

Hi all, back with another Alfresco related post, this time to show you how to size your alfresco database in terms of space.

The following factors are relevant to calculate the approximate size for an Alfresco database

  • Number of meta data fields
  • Permissions
  • Number of folders
  • Number of documents
  • Number of versions
  • Number of users

I’ve made a series of tests where i could verify how the Alfresco database grows.

I’ve made a bulk import with the following data.

Document creation method In-place bulk upload
Number of Documents Ingested 148
Total Size of Documents 929.14 MB
Number of metadata fields per document 13 fields
Total number of metadata fields 1924

The table below shows the types of documents and its average sizes

Document Type Extension Average Size (KB)
MS Word Document .doc 1024
Excell Sheet .xls 800
Pdf document .pdf 10240
PowerPoint presentation .ppt 5120
Jpg image .jpg 2048

Checking the diagram below we can see that the database indexes grow more than the data itself. By observing the growth of the database size we’ve concluded that the average metadata field occupation on the Alfresco database is approximately 5.5 K per metadata field

Screen Shot 2014-09-16 at 21.54.17

 

Also interesting is to verify the tables that grow in size(KB) after the content ingestion. Note that we are not applying any permission.

Screen Shot 2014-09-16 at 21.55.08

 

To size your database appropriately you must ask the right questions whose answers will help you to determine the database sizing.

  1. Estimated number of users in Alfresco
  2. Estimated number of groups in Alfresco
  3. Estimated number of documents on the first year
  4. Documents growth rate
  5. Average number of versions per document
  6. Average number of meta-data fields per document
  7. Estimated number of folders
  8. Average number of meta-data fields per folder
  9. Estimated number of concurrent users
  10. Folder based permissions (inherited to child documents)?

Database sizing formulas

Consider to following figures to determine your approximate database size.

–       DV = Average number of document versions

–       F = Estimated number of folders

–       FA = Estimated number of folder metadata fields (standard + custom)

–       D = Number of Documents * DV – Estimated number of documents including the versions

–       DA =Estimated number of documents metadata fields (standard + custom)

The number of records on specific alfresco tables is calculated as follows:

–       Number of records on alf_node (TN = F + D * DV)

–       Number of records records on node_properties (TNP = F * FA + D * DA)

–       Number of records records on node_status = (TNS = F + D)

–       Number of records records on alf_acl_member= (TP = D) assuming permission will be set at the folder level and inherited

The approx. number of records in the database will be TRDB = TN + TNP + TNS + TP

The following formula is based on the number of database records. On our benchmarks we’ve observed that each database record takes about 4.5k of db space.

Formula #1 Database size = TRDB * 4.5K

Alternatively, we can base our calculations on the number of metadata fields of the documents,  considering 5.5k for each metadata field and use the following formula.

Formula #2 Database size = (D * DA + F * FA) * 5.5K

The 2 formulas provided are only approximations on the size that your database will need and are based on benchmarks executed against a vanilla Alfresco version 4.2.2.

If we wish to consider users and groups add consider 2k for each user and 5k for each group.

Note that the formulas are not taking in consideration additional space for logging, rollback, redolog, etc.

Tuning your Alfresco database

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

•       Wrong or improper support for ACID transaction properties

•       I/O bottlenecks in the RDBMS throughput

•       Excessive queue for transactions due to overload of connections

•       On active-active cluster configurations, excessive latency

Considering that your database layer will be used under concurrent load I’ve come up with a set of hints that will contribute to maximize your Alfresco database performance.

Database Thread pool configuration

A default Alfresco instance is configured to use up to a maximum of forty (40) database connections.  Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

It’s recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.  If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

After increasing the size of the Alfresco database connection pool, you must also increase the number of concurrent connections your database can handle, to at least the size of the Alfresco connection pool. Alfresco recommends configuring at least 10 more connections to the database than is configured into the Alfresco connection pool, to ensure that you can still connect to the database even if Alfresco saturates its connection pool.

Database Validation query

By default Alfresco does not periodically validate each database connection retrieved from the database connection pool.  Validating connections is, however, very important for long running Alfresco servers, since there are various ways database connections can unexpectedly be closed (for example by transient network glitches and database server timeouts). Enabling periodic validation of database connections involves adding the db.pool.validate.query property to alfresco-global.properties and the query is specific for your database type.

Database Value for db.pool.validate.query
MySQL[1] SELECT 1
PostgreSQL SELECT VERSION()
Oracle SELECT 1 FROM DUAL

Database Scaling

Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. Considering our existing customers, the biggest running repositories are under Oracle (most of them RAC).

If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines. This can be achieved using Oracle RAC or a Mysql based solution with haproxy[1] (opensource solution) or a commercial solution like MariaDB[2] or Percona[3].

[1]http://haproxy.1wt.eu

[2]https://mariadb.org

[3]http://www.percona.com

You can use either solution, depending also you the knowledge that you have in-house. The golden rule is that the response time from the DB in general, should be around 4ms or lower. At this layer i don’t recommend to use any virtualization technology. The database servers should be physical servers.

The high availability and scalability for database is vendor dependent and should be addressed with the chosen vendor to achieve the maximum performance possible.

Database Monitoring

Monitoring your database performance is very important as it can detect some possible performance problems or scaling needs.

I’ve identified the following targets that should be monitored and analysed on a regular base.

  • Transactions
  • Number of Connections
  • Slow Queries
  • Query Plans
  • Critical DM database queries ( # documents of each mime type, … )
  • Database server health (cpu, memory, IO, Network)
  • Database sizing statistics (growth, etc)
  • Peak Period of resource usage
  • Indexes Size and Health

I hope this post can help you to understand the importance of the Alfresco database and that you can make use of it on your sizing exercise. Stay tuned for more Alfresco related posts.

All the best, One love,

Luis


Monitoring your Alfresco solution

$
0
0

Hi folks, this post follows my previous post about capacity planning and it provides you with the tools (and a vmware image ready to run) for your to implement it.

Would like to start with a huge thank you message to Miguel Rodriguez (Alfresco Support Engineer ). He is the creator of this monitoring solution and also the person responsible for setting up the  vmware image with all the tools, scripts, etc. My hero list just got bigger, Miguel got a place just after Spider Man and the Silver Surfer :)

Monitoring Alfresco with OpenSource tools

Monitoring your Alfresco architecture is a know best practice. It allows you to track and store all relevant system metrics and events that can help on:

  • Troubleshooting possible problems
  • Verify system Heath
  • Check user behavior
  • Build a robust historical data-warehouse to later analysis and capacity planning

This posts explains a typical monitoring scenario over an Alfresco deployment, using only opensource tools.

I’m proposing a fully opensource stack of monitoring tools that build the global monitoring solution. The  solution will make use of the following opensource products.

The solution will be monitoring all layers of the application, producing valuable data on all critical aspects of the infrastructure. This will allow a pro-active system administration opposed to a reactive way of facing possible problems by predicting the problems before they happen and take the necessary measures to maintain a healthy system on all layers.

I see this approach as as both a monitoring and capacity planning system allowing to provide “near” real time information updates,  customize reporting and to provide custom search mechanism over the collected data.

The diagram below shows how the different components of the solution integrate. Note that we centralize data from all nodes and the various layers of the application in a single location.

monitoring_diagram

The sample architecture being monitored consists on a cluster of two Alfresco/Sharenodes for serving user requests and two Alfresco/Solr nodes for indexing/searching content.

Consider 3 major components of the monitoring solution

  • Logstash file tailing to monitor Alfresco log files and logstash command execution to monitor specific components i.e. processes, memory, disk,java stack traces, etc.
  • JavaMelody to monitor applications running in a JVM and other system resources.
  • Icinga to send jmx requests to Alfresco servers.

Dedicated Monitoring Server Download

All software components of the monitoring server are installed Vmware image that we offer for free (within the opensource spirit :)).

You can download your copy of this monitoring server in  http://eu.dl.alfresco.com.s3.amazonaws.com/release/Support/AlfrescoMonitoringVirtualServer/v1.0/AlfrescoMonitoringVirtualServer-1.0.tar 

The ElasticSearch server that will collect all the logs from the various components of the application and will host the graphical user interfaces (Kibana and Grafana) to view the monitoring data.

About JavaMelody

JavaMelody is used to monitor Java or Java EE application servers in QA and production environments. It is a tool to measure and calculate statistics on real operation of an application depending on the usage of the application by users. Very easy to integrate in most applications and is lightweight with mostly no impact to target systems.

This tool is mainly based on statistics of requests and on evolution charts, for that reason it’s one important add on to our benchmarking project, as it allow us to see in real time the evolution charts of the most important aspects of our application.

It includes summary charts showing the evolution over time of the following indicators:

  • Number of executions, mean execution times and percentage of errors of http requests, sql requests, jsp pages or methods of business façades (if EJB3, Spring or Guice)
  • Java memory
  • Java CPU
  • Number of user sessions
  • Number of jdbc connections

These charts can be viewed on the current day, week, month, year or custom period.

You can have detailed information about javamelody at https://code.google.com/p/javamelody/

Installing JavaMelody

Its really easy to attach javamelody monitor to all alfresco applications (alfresco.war or share.war) and every other web-application that is deployed on your application server.

Step 1

Configure the JavaMelody monitorization on Alfresco tomcat by copying the itextpdf-5.5.2.jar, javamelody.jar and jrobin-1.5.9.1 to the tomcat shared libfolder under <tomcat_install_dir>\shared\lib or your application server (if not tomcat) global classloader location.

Step 2

Edit the global tomcat web.xml (D:\alfresco\tomcat\conf\web.xml) file to enable javamelody monitorization on every application. Add the following filter :

<filter>
<filter-name>monitoring</filter-name>
<filter-class>net.bull.javamelody.MonitoringFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>monitoring</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<listener>
<listener-class>net.bull.javamelody.SessionListener</listener-class>
</listener>

And that’s about it, after restarting you can access the monitorization of every application in http://<your_host>:<server_port>/<web-app-context>/monitoring, for example http://localhost:8080/alfresco/monitoring

monitor_jm

Monitoring Stages Breakdown

Stage 1 – Data Capturing(Logstash)

monitor_datacap

We capture monitoring data using different procedures.

  • Scheduled Jobs (Db queries, Alfresco jmx Beans queries, OS level commands)
  • Logs indexing with logstash. We use logstash it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)
  • The Alfresco audit log (when configured) is also parsed and indexed by elastic search, proving all the enabled audit statistics.
  • Metrics with JavaMelody

Stage 2 – Monitoring Data Archiving(ElasticSearch)

monitor_archiving

On the diagram above we can see the flow of data capturing using logstash and Elastic Search. Let’s see some details on each of the boxes on the diagram.

Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)

Redis is a logs data broker, receiving data from log “shippers” and sending the data to a log “indexer”

ElasticSearch is a distributable, RESTful, free Lucene powered search engine/server

Kibana3 is a tool for displaying and interacting with your data.

Stage 3 – Trending and Analysis (Kibana,Grafana)

monitor_trending

To analyze the data and the trends we use install 2 different GUIs on the monitorization server (Kibana and Grafana).

Kibana allows us to check the indexed logs with metadata, and to troubleshoot on specific log traces. It provides a very robust search mechanism on top of the elasticsearch indexes. It provides strategic technical insights with an global overview on all layers of the platform delivering actionable insights in real-time from almost any type of structured and unstructured data source.

On the flow above we can see how the information and statistics get to Grafana.

Grafana is a beautiful dashboard for displaying various Graphite metrics through a web browser. It has enormous potential, it’s easy to setup and to customize for different business needs.

Let’s have a closer look on the remaining components on the flow diagram.

Statsd is a network daemon that listens for statistics, like counters and timers sent over UDP and sends them to Carbon.

Carbon accepts metrics over various protocols and caches them in RAM as they are received, flushing them to disk on an interval using the underlying whisper library.

Whisper provides fast, reliable storage of numeric data over time

Grafana is an easy to use and feature rich Graphite dashboard

Stage 4 – Monitoring

monitor_ui

We use scheduled commands and index data on elastisearch checking the following monitoring information from the Alfresco and Solr servers.

  • JVM Memory Usage
  • Server Memory
  • Alfresco Cpu utilization
  • Overall Server Cpu utilization
  • Solr Indexing Information
  • Number of documents on Alfresco “live” store
  • Number of documents on Alfresco “archive” store
  • Number of concurrent users on Alfresco repository
  • Alfresco Database pool occupation
  • Number of active sessions on Alfresco Share
  • Number of active sessions on Alfresco Workdesk
  • Number of busy tomcat threads
  • Number of current tomcat threads
  • Number of maximum tomcat threads

Those can be extended at any time, performing monitorization on any target relevant on your use case.

Stage 5 – Troubleshooting

trouble

While troubleshooting we use Kibana/Grafana and JavaMelody.

Kibana allow’s us to check the “indexed” logs with meta-data and verify exactly what classes are related with the problem as well as the number of occurrences and root of the exceptions.

Grafana show us what/how/when the server resources are being affected by the problem.

JavaMelody provides detailed information on crucial sections of the application. The goal of JavaMelody is to monitor Java or Java EE application servers in QA and production environments.

It produces graphs for Memory, CPU, HTTP Sessions, Threads, GC, JDBC Connections, SQL Hits, Open Files, Disk Space, Network I/O, Statistics for HTTP traffic, Statistics for SQL queries, Thread dumps, JMX Beans information and overall System Information. Java Melody has a Web interface to report on data statistics.

Using these 3 tools, troubleshooting a possible problem becomes an friendly task and it boosts the speed of the investigations, that normally would take ages to gather all the necessary information to get to the root cause of the issue.

Stage 6 – Notification and Reporting

notifica

We use Icinga in order to notify the delegated alfresco administrator (email) when there is some problem with the Alfresco system. Icinga is an enterprise grade open source monitoring system that keeps watch over networks and resources, notifies the user of errors and recoveries and generates performance data for reporting.

Icinga Web is highly dynamic and laid out as a dashboard with tabs which allow the user to flip between different views that they need at any one time

icinga

Stage 7 – Sizing Adjustments

Sizing will be a human action on the capacity and monitoring solution. Performing a regular analysis to the monitorization/capacity planning data, we will know exactly when and how we need to scale our architecture.

The more data gets inside elastic search along the application life cycle, the more accurate are the capacity predictions because they represent the “real” application usage during the defined period.

This represents a very important role when modeling and sizing the architecture for the future business requirements.

7.1 – Peak Period Methodology

The Peak period Methodology is the most efficient way to implement a capacity planning strategy as it allows to analyze vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.

The peak period may be an hour, a day, 15 minutes or any other period that is used to analyze the collected utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.

Your monitoring Targets on a Alfresco installation

I’ve identified the following targets to be candidates to participate on the Monitoring system and have their data indexed and stored on elastic search.

Database

  • Transactions
  • Number of Connections
  • Slow Queries
  • Query Plans
  • Critical DM database queries ( # documents of each mime type, … )
  • Database server earth ( cpu, memory, IO, Network)
  • Database statistics integration
  • Database sizing statistics ( growth, etc)
  • Peak Period

Application Servers (Tomcats)

  • Request and response times
  • Access Logs ( number of concurrent requests, number concurrent users , etc)
  • Cpu
  • Io
  • Memory
  • Disk Space Usage
  • Peak period
  • Longest Request
  • Threads ( Concurrent Threads, Busy Threads )

Application JVM

  • Jvm Settings Analysis
  • GC Analysis
  • Log Analysis (Errors, Exceptions, Warnings, Class Segmentations(Authorization, Permissions, Authentication)
  • Auditing Enabling and Analysis (Logins, Reads, Writes, Changed Permissions, Workflows Running, Workflows States)
  • Caches Monitoring (Caches usage, invalidation, cache sizes )
  • Protocol Analysis (FTP, CMIS; Sharepoint, WEBDAV, IMAP, CIFS )
  • Architecture analysis

Search Subsystem(Solr)

  • Jmx Beans Monitorization
  • Caches ( Configuration, Utilization, Tuning, Inserts, Evictions and Hits )
  • Indexes Health
  • Jvm Settings Analysis
  • Jvm Health Analysis
  • Garbage collection Analysis
  • Query Debug (Response times, query analysis, slow queries, Peak periods)
  • Search and Index Memory Usage

Network

  • Input/Output
  • High availability
  • Tcp Errors / Network errors at Network protocol level
  • Security Analysis ( Ports open, Firewalls, network topology , proxies, encryption )

Shared File Systems

  • Networking to clients hosts
  • Storage Type ( SAN, NAS )
  • I/O

Clustering

  • Cluster members subscription analysis (subscription analysis)
  • Cluster cache invalidation strategy and shared caches performance
  • Cluster load balancing algorithm performance (cluster nodes load distribution)

The Alfresco Audit Trail

The monitoring solution also uses and indexes the Alfresco audit trail log, when audit is enabled. Alfresco audit should be used with caution as auditing too many events may have a negative impact on performance.

Alfresco has the option of enabling and configuring an audit trail log. It stores specific user actions (configurable) on a dedicated log file (audit trail).

Building on the auditing architecture the data producer org.alfresco.repo.audit.access.AccessAuditor gathers together lower events into user recognizable events. For example the download or preview of content are recorded as a single read. Similarly the upload of a new version of a document is recorded as a single create version. By contrast the AuditMethodInterceptor data producer typically would record multiple events.

A default audit configuration file located at <alfresco.war>/WEB-INF/classes/alfresco/audit/alfresco-audit-access.xml is provided that persists audit data for general use. This may be enhanced to extract additional data of interest to specific installations. For ease of use, login success, login failure and logout events are also persisted by the default configuration.

Default audit filter settings are also provided for the AccessAuditor data producer, so that internal events are not reported. These settings may be customized (by setting global properties) to include or exclude auditing of specific areas of the repository, users or some other value included in the audit data created by AccessAuditor.

No additional functionality is provided for the retrieve of persisted audit data, as all data is stored in the standard way, so is accessible via the AuditService search, audit web scripts, database queries and Alfresco Explorer show_audit.ftl preview.

Detailed information on Audit possibilities available at:

And that’s about it folks, i hope you liked this article and that it can help you on monitoring your projects. More articles with relevant information from the field are coming up, so stay tuned.

All the best, One Love,

Luis

 

A formula for your alfresco troughtput

$
0
0

Hi all, back with another interesting topic, this time with a methodology to help you determine the throughput of your Alfresco repository server. I call it the CAR :)

Screen Shot 2014-09-30 at 23.05.52

C.A.R = Capacity of an alfresco repository server

If you think about it A ECM repository is very similar to a database in regards to the type of events that they manage and execute, specially when we think about transactions being considered as a basic repository operation such as  create, browse, download, update and delete. To know the capacity an  Alfresco repository server we can introduce the concept of “transactions per second” where  a “transaction” is considered to be a basic repository operation (create, browse, download, update and delete) .

Screen Shot 2014-09-23 at 14.19.12

The “C.A.R.” methodology

Aim is to define common standard figure that can empirically define the capacity of a repository instance. The C.A.R. methodology is based on the following sentence and its represented in Transactions per second.

The capacity of an Alfresco repository server is determined by the number transactions it can handle in a single second before degrading the expected performance.

Screen Shot 2014-09-23 at 14.53.00

To create a formula that reflects that sentence we need to introduce the 2 important figures :

EC = The expected concurrency represented in number of users.

TT = user think time represented in seconds, means that in average for a period of time (“The think time”) the system will receive requests from N different users.

ERT = Expected response times object.

Decreasing ERT generally means the necessity to increase the capacity of the alfresco repository server.

This a complex object represented is key value pairs with the types of response times being considered, the weight of each type and the correspondent value in seconds. It takes expected user behavior as arguments.

Sample ERT

Screen Shot 2014-09-30 at 23.07.38When we decrease our ERT arguments values we normally will need to scale (up/out) our Alfresco and database Servers.

Introducing those 3 attributes (EC,TT and ERT) we can say that the C.A.R.of an Alfresco repository server is :

Number of transactions that that the server can handle in one second under the expected concurrency(EC) with the agreed Think Time (TT) ensuring the expected response times(ERT) .

Shape shifting – a flexible formula approach

The C.A.R. formula is not deterministic as it cannot reflect the variables on different use cases. Its dynamic and specific to each use case and it is built on a system of attributes, values, weights and affected areas.

To have a definition of the formula that really represents the capacity of the alfresco servers on your infra-structure you need to consider one or more ERT(expected response times) objects, representing the use case expected response times on use case specific operations. Those objects act as increasers (+) and removers(-) on the server throughput.

The formula can shift and may be adapted with more ERT Objects that define the use case for fine tuned predictions.

The Heartbeat of the Alfresco server

The easiest way is to enable the audit trail and parse that tail computing for the transactions that are occurring on the repository. With the new reporting and analysis features coming up on Alfresco One 5.0 it will be even simpler to get access to this information.

Some initial lab tests

We’ve executed some simple lab tests configured with one server running Alfresco and another running the database. and observed that a single server.

Alfresco Server Details

  • Processor: 64-bit Intel Xeon 3.3Ghz (Quad-Core)
  • Memory: 8GB RAM
  • JVM: 64-bit Sun Java 7 (JDK 1.7)
  • Operating System: 64-bit Red Hat Linux

Test Details

  • ERT = The sample ERT values shown on this post
  • Think Time = 30 seconds.
  • EC = 150 users

The C.A.R. of the server was between 10-15 TPS during usage peaks. Through JVM tuning along with network and database optimizations, this number can rise over 30 TPS.

I think this is a very reliable form of definition on the capacity on an Alfresco repository that can be used as a support for a sizing exercise. What do you think ? Opinions are welcome and highly appreciated, use the comments area on the blog to add yours !

OpenSource – Together we are Stronger ! One Love“,

Luis

Super Sizing your Alfresco Repository

$
0
0

Hi everyone, i’m back to share with you a very interesting tool that can help you on your tests and benchmarks.

“have you ever wanted to test/benchmark your alfresco project implementation with millions  of documents before you deliver it for its go-live stage ?”

I’m sure you did, it’s normally not that easy to create a big number of dummy documents and correspondent meta-data fields that can emulate what will be present in production. To solve this paradigm we’ve created a tool (opensource as allways :) ) that can help you to do just that. Many thanks to Alex Strachan from Alfresco support that wrote the user interface for this tool.

The tool is named “SuperSizeMyRepo” and its available at https://github.com/lcabaceira/supersizemyrepo. It’s a multi-thread tool that enables you to create a huge amount (Millions) of (bulk-import-ready) content and metadata for your alfresco repository.

Types of documents created

  • MS Word Documents (.doc) with an average size of 1024k
  • MS Excel Documents(.xls) with  average size of 800k
  • Pdf documents(.pdf) with average size of 10MB
  • MS PowerPoint Presentation Documents(.ppt) with average size of 5MB
  • Jpeg images(.jpg) with average size of 2MB

All documents are created with their correspondent meta-data xml properties file.

ui

Configuring the Documents meta-data

As you can see from the UI screenshot above, you can configure manually the values of the meta-data fields that will created for the documents, but even more interesting is the ability to inject aspects directly into the document creation.

Injecting Aspects

You can also edit the field names, meaning that if you specify custom aspects, you can configure the remaining fields to have the properties names of the attributes present on your custom aspects.

How about Indexing ?

We’ve also thought about testing the search sub-system (Solr or Lucene) with big amounts of data. For this reason the documents are created with lots of random words that will get indexed into Solr or Lucene. This way you can test both the repository and the search layer.

What are the images for ?

To have the documents created with the sizes announced we also needed to include some random images. We provide you with a set of images that you can download and use as your local library for the documents creation. This set of random images is available here. You can also use your own set of images as long as they are all JPGS and they are present on the images folder root.

What is the deployment folder ?

The deployment folder is where your documents will be created, normally this is a place inside your contentStore, this way you can perform a in-place bulk import, one of the fastest ways to inject lots of content on your Alfresco repository. You can specify any folder for the documents creation.

Maximum number of files per folder

When you import the documents the folder structure (if any) will also be imported. According to Alfresco best practices, having a huge number of documents on the same folder can lead to performance degradation, mainly because of acl permission checking that happens when a user browsers a folder. Alfresco needs to determine what documents can be shown to the user and for that he needs to verify the permissions of each item on that directory. To reduce this overload, we’ve introduced the option to specify a maximum number of documents that the tool can create on a single folder, when this number is reached the tool will create new folders and the new documents will be created on those folders.

JumpStart with the compiled version

If you wish to run the compiled version (available in the uiJars folder) there are no pre-requirements apart from having java installed on your server to be able to execute a jar file.

Download the jar file for your OS, currently the UI is released for 3 different OS.

Note for MacOs users :  To execute the jar you should open a terminal and run it with the -XstartOnFirstThread option like the example below :

java -XstartOnFirstThread -jar ./ssmr-ui-1.0.3-osx-jar-with-dependencies.jar

Want to take the Deep dive approach ?

Great, i would like to take this opportunity to invite you to participate on this project and to contribute with new features and your own ideas. This section provides guidance for you to download the source code and build it yourself.

1 – Software requirements

  • JDK 1.7
  • Apache Maven 3.0.4+

2 – Configuration requirements

During the installation of maven, a new file name settings.xml was created. This file is our entry point to the your local maven settings configuration, including the remote maven repositories. Edit your settings.xml file and update the server’s section including the alfresco server id and your credentials.

Note that the root pom.xml references 2 different repositories : alfresco-public, alfresco-public-snapshots . The id of each repository must match with a server id on your settings.xml (where you specify your credentials for that server).

Section from configuration settings.xml

        <server>
            <id>alfresco-public</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-public-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>

Section from pom.xml

 <repository>
            <id>alfresco-public</id>
            <url>https://artifacts.alfresco.com/nexus/content/groups/public</url>
 </repository>
  <repository>
            <id>alfresco-public-snapshots</id>
            <url>https://artifacts.alfresco.com/nexus/content/groups/public-snapshots</url>
   </repository>

3 – Location/Path Where to create the files

Edit the src/main/java/super-size-my-repo.properties and configure your deployment location and the images location.

files_deployment_location : Should be a in a place inside your contentStore. This will be the root for the in-place bulkImport.

images_location : The tool randomly chooses from a folder of local images to include on the various document types. You need to set the images_location to a folder where you have jpg images. You can use the sample images by pointing the images_location to your /images. The bigger your images are, the bigger your target documents will be. For the sizes of the documents considered we expect jpg images with aprox 1.5MB

Tool Configuration files and options

You find the tool configuration file under src/main/java/super-size-my-repo.properties This configuration file contains the following self-explanatory properties.

files_deployment_location=<PATH_WHERE_THE_FILES_WILL_BE_CREATED>
images_location=<DEFAULT_LOCATION_FOR_BASE_IMAGES>
num_Threads=<NUMBER_OF_THREADS_TO_EXECUTE>
threadPoolSize=<SIZE_OF_THE_THREAD_POOL>
max_files_per_folder=<NUMBER_OF_MAX_FILES_IN_A_SINGLE_FOLDER>

The only 2 properties that are mandatory to adjust are files_deployment_location and images_location All of the other properties have default running values.

How to run with maven ?

Issue the following maven command to generate the targets (executable jar) from the project root.

P.S. – Don’t forget to configure your properties file.

# mvn clean install

This will build and generate the executable jar on the target directory.

To run this jar, just type :

java -jar super-size-my-repo-<YOUR_VERSION>-SNAPSHOT-jar-with-dependencies.jar

Next Steps ?

After running the tool, you will have lots of documents to import using the Alfresco bulk importer. To perform a in-place import, you need to define the files_deployment_location to a location inside your contentstore.

Now you can execute the in-place-bulk import action to add all the documents and correspond ant meta-data to a target Alfresco repository.

The Streaming bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport

The in-place bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport/inplace

Note that you may need to adjust localhost and the 8080 port with your server details if you not running alfresco locally or you’re not running alfresco on the default 8080 port.

Check http://wiki.alfresco.com/wiki/Bulk_Importer for more details.

And that is it folks, if you like to contribute to the evolution of this tool, send me an email and i will add you as a contributor with commit rights to the github repository.

I hope you enjoyed this article as much as i enjoyed writing it. I wish you can make use of this nice tool. Stay tuned for more Alfresco related articles and don’t forget to support open-source projects.

OpenSource – Together we are stronger, One Love

Luis

Alfresco Behaviours and Policies

$
0
0

The power of Alfresco Behaviours and Policies

Hi all, back with another post after some busy months on the field that i just could not find time to share a post or two. Since i truly believe that the technology know-how is to be shared, this time i will speak about a very powerful feature of Alfresco. The Alfresco Behaviours and Policies.

I believe that an effective ECM project should be more focused on how creative we are while interacting with the technology than on the technology itself. It may sound weird at first , but the idea behind that thought is  “Thrust the technology. Use it Creatively”. This invites us to redirect most of our implementation focus on how creative and effective we are using the technology while implementing our business requirements and facing our project challenges.

Because the Alfresco technology was designed having extensibility and integration  in consideration, there are lots of different ways to reach the same business goals. Choosing what component to implement specific goals can be hard. On my years of consulting practice,  i’ve seen lots of “killing flies with machine guns” scenarios that could have been implemented with a more simplistic (sometimes much cheaper) approach. When i joined Alfresco, after my technology a deep dive,i felt like a a kid in a party with a huge table full of delicious flavours but just can’t make his mind on what to eat. 

If you think about it, the success factors of a project are not just “how good and optimised is my code, how fast are my servers performing, how effective are my processes”. With new technologies such as Alfresco we must factor in “how creative and efective were my choices inside my chosen technology”. I could write a very long post only on this topic, but actually, you are reading a post about Alfresco Behaviours and Policies , a very powerful alfresco feature, sometimes forgotten or not considered in important implementation decisions.

Imagine a business requirement defining that specific mime-types (such as for example big video files and audio files) should be automatically stored on a different (cheaper) disk than all the other contents. How can we accomplish this ? Hopefully i will be able to explain you (and provide you with the code for it) during this post.

Alfresco allows you to fire automated actions over content on its repository. There are many ways of automating those actions on content (rules, scheduled tasks, policies…). For this post we will focus only on behaviours that are parts of business logic binded to repository policies and events.

Screen Shot 2015-04-07 at 13.48.27

There are a set of policies that are called from the alfresco services. For example, the policies available in NodeService are listed on the table below. Note the self-explanatory Inner Interfaces names that help deducting the events that trigger them.

Interface Inner Interface
NodeServicePolicies BeforeCreateStorePolicy
OnCreateStorePolicy
BeforeCreateNodePolicy
OnCreateNodePolicy
BeforeMoveNodePolicy
OnMoveNodePolicy
BeforeUpdateNodePolicy
OnUpdateNodePolicy
OnUpdatePropertiesPolicy
BeforeDeleteNodePolicy
OnDeleteNodePolicy
BeforeAddAspectPolicy
OnAddAspectPolicy
BeforeRemoveAspectPolicy
OnRemoveAspectPolicy
OnRestoreNodePolicy
BeforeCreateNodeAssociationPolicy
OnCreateNodeAssociationPolicy
OnCreateChildAssociationPolicy
BeforeDeleteChildAssociationPolicy
OnDeleteChildAssociationPolicy
OnCreateAssociationPolicy
OnDeleteAssociationPolicy
BeforeSetNodeTypePolicy
OnSetNodeTypePolicy

There are also Policies for  ContentService,  CopyService and VersionService and some others. If you search the Alfresco source code using “*Policies” pattern you will also find policies like CheckOutCheckInServicePolicies, LockServicePolicies, NodeServicePolicies, TransferServicePolicies, StoreSelectorPolicies , AsynchronousActionExecutionQueuePolicies and RecordsManagementPolicies

An alfresco behaviour is simply a Java class that implement one of the policy interfaces. One advantage on using behaviours  over rules is that the behaviours are  applied globally to the repository while rules can be disabled by configuration (such as in bulk import scenarios). In comparison to scheduled task the behaviour is applied in real-time, while a scheduled task executes at a configured timestamp.

In a nutshell :

  • The Alfresco repository lets you inject behaviour into your content.
  • Custom behaviours serve as method handlers for node events (policies), they can make the repository react to changes
  • Behaviours can be bind to event (policy) for particular Class (Type, Aspect, Association)
  • Policies can extend Alfresco beyond content models make extensions smarter by Encapsulating features and business logic.

Screen Shot 2015-04-07 at 14.00.37

At the end of this post i will provide deeper technical details on Policy and Behaviours, but now lets focus on our practical and usable example that you can actually use on your projects.

A practical example

Let’s get back to our business requirement that says specific mime-types (such as for example big video files and audio files) should be stored on a different (cheaper) disk than all the other contents.

Step 1 – The content store selector facade

Since we will have more than one content store, we will make use of the well know content store selector facade. Alfresco manages the storage of binaries through one or more content stores, each of which is associated with a single location on a file system accessible to Alfresco e.g. //data/alfresco_content. The Alfresco Content Store Selector allows for the direction of content to specific physical stores based upon the appropriate criteria (folder rules or policies).

Full documentation of the Content Store Selector can be found below with the appropriate configuration examples: http://docs.alfresco.com/4.2/concepts/store-manage-content.html

1.1 – Creating of a new content store

Let’s start by creating a new content store for the mediaFiles. Create a new directory under your <alf_data> directory, or mount a file system that you want to use to the media files. We will call this store mediaStore. Now we need to make alfresco aware of the new store by performing some configuration to alfresco.

In <tomcat_dir>/shared/classes/alfresco/extension create a new spring context file and name it as content-store-selector-context.xml
Paste the following configuration xml
<?xml version='1.0' encoding='UTF-8'?>
 <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
 <beans>
 <bean id="mediaSharedFileContentStore" class="org.alfresco.repo.content.filestore.FileContentStore">
 <constructor-arg>
 <value><PATH_TO_YOUR_MEDIA_STORE></value>
 </constructor-arg>
 </bean>
 <bean id="storeSelectorContentStore" parent="storeSelectorContentStoreBase">
 <property name="defaultStoreName">
 <value>default</value>
 </property>
 <property name="storesByName">
 <map>
 <entry key="default">
 <ref bean="fileContentStore" />
 </entry>
 <entry key="mediastore">
 <ref bean="mediaSharedFileContentStore" />
 </entry>
 </map>
 </property>
 </bean>
 <!-- Point the ContentService to the 'selector' store -->
 <bean id="contentService" parent="baseContentService">
 <property name="store">
 <ref bean="storeSelectorContentStore" />
 </property>
 </bean>
 <!-- Add the other stores to the list of stores for cleaning -->
 <bean id="eagerContentStoreCleaner" class="org.alfresco.repo.content.cleanup.EagerContentStoreCleaner" init-method="init">
 <property name="eagerOrphanCleanup" >
 <value>${system.content.eagerOrphanCleanup}</value>
 </property>
 <property name="stores" >
 <list>
 <ref bean="fileContentStore" />
 <ref bean="mediaSharedFileContentStore" />
 </list>
 </property>
 <property name="listeners" >
 <ref bean="deletedContentBackupListeners" />
 </property>
 </bean>
 </beans>

1.2 – Configuration of the cm:storeSelector aspect

Now we need to make alfresco share aware of the multiple content stores, to do that we need to configure the cm:storeSelector aspect.
In <tomcat_dir>/shared/classes/alfresco/web-extension rename the spring context file web-client-config-custom.xml.sample to web-client-config-custom.xml and configure the cm:storeSelector aspect as follows:

<!-- Configuring in the cm:storeSelector aspect -->
<config evaluator="aspect-name" condition="cm:storeSelector">
 <property-sheet>
 <show-property name="cm:storeName" component-generator="StoreSelectorGenerator" />
 </property-sheet>
</config>
<config evaluator="string-compare" condition="Action Wizards">
 <aspects>
 <aspect name="cm:storeSelector"/>
 </aspects>
</config>

Next we need to merge the following xml snippet on our share-config-custom.xml file to make the content store selector aspect visible in share.

<!-- Configuring in the cm:storeSelector aspect -->
 <config evaluator="node-type" condition="cm:content">
 <forms>
 <form>
 <field-visibility>
 <!-- aspect: cm:storeSelector -->
 <show id="cm:storeName" />
 </field-visibility>
 <appearance>
 <!-- Store Selector -->
 <field id="cm:storeName" label="Store Name" description="Content Store Name" />
 </appearance>
 </form>
 </forms>
 </config>
 <config evaluator="string-compare" condition="DocumentLibrary" replace="true">
 <aspects>
 <!-- Aspects that a user can see -->
 <visible>
 <aspect name="cm:storeSelector" />
 </visible>
 </aspects>
 </config>

1.3 – Some Simple ways of using the new content store

The new content store is set using the cm:storeName property.  The cm:storeName property can be set in number of ways:

  • Manually, by exposing this property so its value can be set by either Explorer or Share
  • Running a script action that sets the cm:storeName property value within the script
  • Using a rule that runs a script action to set the property
  • Using a Behaviour that automates the choice of the store based on the mime-type of the content (we will see how during this post)

The default behaviour is as follows:
• When the cm:storeSelector aspect is not present or is removed, the content is copied to a new location in the ‘default’ store
• When the cm:storeSelector aspect is added or changed, the content is copied to the named store
• Under normal circumstances, a trail of content will be left in the stores, just as it would be if the content were being modified. The normal processes to clean up the orphaned content will be followed.

To automate the store classification we can write a simple script in JavaScript and call it action_mediastore.js. The script contents would be:

var props = new Array(1);
props[“cm:storeName”] = “mediastore”;
document.addAspect(“cm:storeSelector”, props);
document.save();

We would then save the script in Data Dictionary/Scripts. Note that the script above is adding the storeSelector aspect and assigning a value (in this case mediastore) to the property.
Now we can execute the action over any file or folder and we select “Execute Script”. We then select our script action_mediastore.js

Step 2 – Coding the new behaviour

We will build a custom content policy (Behaviour) that depending on the documents mime-type will apply the content-store-selector aspect to it and choose the appropriate contentStore.

Using the Alfresco sdk, start a new repository amp project (I will focus on how to use the alfresco sdk on a different post).

The first thing we need is to create a behavior binded to the OnContentUpdatePolicy. Note that metadata detection (and mime-type) is post commit, this is why we needs to use a OnContentUpdatePolicy as the method ContentServicePolicies.OnContentUpdatePolicy is fired *after* Tika detection of mimetype. We will be using onContentUpdate, when newContent = true for our use case

2.1 – Behaviour class implementing OnContentUpdatePolicy

Our class definition will be as follows :

public class SetStoreByMimeTypeBehaviour extends TransactionListenerAdapter
 implements ContentServicePolicies.OnContentUpdatePolicy {

Check that we are implementing one of the ContentService policies , in this case the OnContentUpdatePolicy

Next we define the properties that we will inject via Spring .

 private PolicyComponent policyComponent;

 private ServiceRegistry serviceRegistry;

 private Map<String, String> mimeToStoreMap;

We need to provide our class with the correspondent getters and setters for the properties that will be injected.

 

public PolicyComponent getPolicyComponent() {
 return policyComponent;
 }

 public void setPolicyComponent(PolicyComponent policyComponent) {
 this.policyComponent = policyComponent;
 }

 public ServiceRegistry getServiceRegistry() {
 return serviceRegistry;
 }

 public void setServiceRegistry(ServiceRegistry serviceRegistry) {
 this.serviceRegistry = serviceRegistry;
 }

 public void setMimeToStoreTypeMap(Map<String, String> mimeToModelTypeMap) {
 this.mimeToStoreMap = mimeToModelTypeMap;
 }

 public Map<String, String> getMimeToStoreTypeMap() {
 return mimeToStoreMap;
 }

The init method is one of the methods that we need to override to implement the behavior. This method initiates the behavior and registers the class with the chosen policy. In this case we are doing it for all content nodes (ContentModel.TYPE_CONTENT=cm:content).

Note the important NotificationFrequency.FIRST_EVENT. Behaviours can be defined with a notification frequency – “every event” (default), “first event”, “transaction commit”. In this case, we want the behavior to fire only on first event. Consider that during a given transaction, certain policies may fire multiple times (ie. “every event”).

 public void init() {
 if (log().isDebugEnabled()) {
 log().debug("Initializing Behavior");
 }
 this.onContentUpdate = new JavaBehaviour(this, "onContentUpdate", NotificationFrequency.FIRST_EVENT);
 this.policyComponent.bindClassBehaviour(QNAME_ONCONTENTUPDATE, ContentModel.TYPE_CONTENT, onContentUpdate);
 }

Last, but not least we need to override the onContentUpdate method to implement our logic. We get the mimeType of the node and assign the content a specific store depending on that. Note that the store name is coming from the mapping implemented on the spring bean configuration (explained on the next section)

 @Override
 public void onContentUpdate(NodeRef nodeRef, boolean newContent) {
 if (log().isDebugEnabled()) {
 log().debug("onContentUpdate, new[" + newContent + "]");
 }

 NodeService nodeService = serviceRegistry.getNodeService();
 ContentData contentData = (ContentData) nodeService.getProperty(nodeRef, ContentModel.PROP_CONTENT);
 String nodeMimeType = contentData.getMimetype();
 log().debug("nodeMimeType is " + nodeMimeType);

 QName storeName = getQNameMap().get(nodeMimeType);

 if (storeName != null) {
 log().debug("storeName is " + storeName.toString());
 String name = storeName.toString().substring(2,storeName.toString().length());
 log().debug("Stripped storeName is " + name);
 // add the aspect
 Map storeSelectorProps = new HashMap(1, 1.0f);
 storeSelectorProps.put(QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI,"storeName"), name);
 nodeService.addAspect(nodeRef, QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "storeSelector"), storeSelectorProps);

 // Extract meta-data here because it doesn't happen automatically when imported through FTP (for example)
 ActionService actionService = serviceRegistry.getActionService();
 Action extractMeta = actionService.createAction(ContentMetadataExtracter.EXECUTOR_NAME);
 actionService.executeAction(extractMeta, nodeRef);
 }
 else {
 log().debug("No specific store configured for mimetype [" + nodeMimeType + "]");
 }
 }

The full source code for our class is the following:

package org.alfresco.consulting.behaviours.mimetype;

import java.io.Serializable;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

import org.alfresco.model.ContentModel;
import org.alfresco.repo.action.executer.ContentMetadataExtracter;
import org.alfresco.repo.content.ContentServicePolicies;
import org.alfresco.repo.policy.Behaviour;
import org.alfresco.repo.policy.Behaviour.NotificationFrequency;
import org.alfresco.repo.policy.JavaBehaviour;
import org.alfresco.repo.policy.PolicyComponent;
import org.alfresco.repo.transaction.TransactionListenerAdapter;
import org.alfresco.service.ServiceRegistry;
import org.alfresco.service.cmr.action.Action;
import org.alfresco.service.cmr.action.ActionService;
import org.alfresco.service.cmr.repository.ContentData;
import org.alfresco.service.cmr.repository.NodeRef;
import org.alfresco.service.cmr.repository.NodeService;
import org.alfresco.service.namespace.NamespaceService;
import org.alfresco.service.namespace.QName;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

/**
 * Behavior that casts the node type to the appropriate store
 * based on the applied mime type.
 * @author Luis Cabaceira
 *
 */
public class SetStoreByMimeTypeBehaviour extends TransactionListenerAdapter
 implements ContentServicePolicies.OnContentUpdatePolicy {

 private static final QName QNAME_ONCONTENTUPDATE = QName.createQName(NamespaceService.ALFRESCO_URI, "onContentUpdate");

 private Behaviour onContentUpdate;

 private PolicyComponent policyComponent;

 private ServiceRegistry serviceRegistry;

 private Map<String, String> mimeToStoreMap;

 private Map<String, QName> qnameMap;

 public void init() {
 if (log().isDebugEnabled()) {
 log().debug("Initializing Behavior");
 }
 this.onContentUpdate = new JavaBehaviour(this, "onContentUpdate", NotificationFrequency.FIRST_EVENT);
 this.policyComponent.bindClassBehaviour(QNAME_ONCONTENTUPDATE, ContentModel.TYPE_CONTENT, onContentUpdate);
 }

 /*
 * (non-Javadoc)
 * @see org.alfresco.repo.content.ContentServicePolicies.OnContentUpdatePolicy#onContentUpdate(org.alfresco.service.cmr.repository.NodeRef, boolean)
 */
 @Override
 public void onContentUpdate(NodeRef nodeRef, boolean newContent) {
 if (log().isDebugEnabled()) {
 log().debug("onContentUpdate, new[" + newContent + "]");
 }

 NodeService nodeService = serviceRegistry.getNodeService();
 ContentData contentData = (ContentData) nodeService.getProperty(nodeRef, ContentModel.PROP_CONTENT);
 String nodeMimeType = contentData.getMimetype();
 log().debug("nodeMimeType is " + nodeMimeType);

 QName storeName = getQNameMap().get(nodeMimeType);

 if (storeName != null) {
 log().debug("storeName is " + storeName.toString());
 String name = storeName.toString().substring(2,storeName.toString().length());
 log().debug("Stripped storeName is " + name);
 // add the aspect
 Map storeSelectorProps = new HashMap(1, 1.0f);
 storeSelectorProps.put(QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI,"storeName"), name);
 nodeService.addAspect(nodeRef, QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "storeSelector"), storeSelectorProps);

 // Extract meta-data here because it doesn't happen automatically when imported through FTP (for example)
 ActionService actionService = serviceRegistry.getActionService();
 Action extractMeta = actionService.createAction(ContentMetadataExtracter.EXECUTOR_NAME);
 actionService.executeAction(extractMeta, nodeRef);
 }
 else {
 log().debug("No specific store configured for mimetype [" + nodeMimeType + "]");
 }
 }

 /**
 *
 * @return
 */
 private Map<String, QName> getQNameMap() {
 if (qnameMap == null) {
 qnameMap = new HashMap<String, QName>();
 // Pre-resolve QNames...
 for (Entry<String,String> e : mimeToStoreMap.entrySet()) {
 QName qname = this.qnameFromMimetype(e.getKey());
 if (qname != null) {
 qnameMap.put(e.getKey(), qname);
 }
 }
 }
 return qnameMap;
 }

 /**
 *
 * @param mimeType
 * @return
 */
 private QName qnameFromMimetype(String mimeType) {
 QName qname = null;

 String qNameStr = mimeToStoreMap.get(mimeType);
 qname = QName.createQName(qNameStr, serviceRegistry.getNamespaceService());
 return qname;
 }

 public PolicyComponent getPolicyComponent() {
 return policyComponent;
 }

 public void setPolicyComponent(PolicyComponent policyComponent) {
 this.policyComponent = policyComponent;
 }

 public ServiceRegistry getServiceRegistry() {
 return serviceRegistry;
 }

 public void setServiceRegistry(ServiceRegistry serviceRegistry) {
 this.serviceRegistry = serviceRegistry;
 }

 public void setMimeToStoreTypeMap(Map<String, String> mimeToModelTypeMap) {
 this.mimeToStoreMap = mimeToModelTypeMap;
 }

 public Map<String, String> getMimeToStoreTypeMap() {
 return mimeToStoreMap;
 }

 protected Log log() {
 return LogFactory.getLog(this.getClass());
 }

}

2.2 – Registering the behaviour with Spring

Next step is to register our behavior with Spring, for that we will need a context file (service-context.xml) that will register our bean and that has the mapping between the mimetype of the content and the correspondent store. Note the mapping of the several video formats to the new mediaStore.

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
 <!-- -->
 <!-- SetStoreByMimeTypeBehaviour -->
 <!-- -->
 <bean id="mime-based-store-selector-behavior"
 class="org.alfresco.consulting.behaviours.mimetype.SetStoreByMimeTypeBehaviour"
 init-method="init" depends-on="dictionaryBootstrap">
 <property name="policyComponent" ref="policyComponent" />
 <property name="serviceRegistry" ref="ServiceRegistry" />
 <!-- stores to mimetype map -->
 <property name="mimeToStoreTypeMap">
 <map>
 <entry key="video/mpeg"><value>mediaStore</value></entry>
 <entry key="audio/mpeg"><value>mediaStore</value></entry>
 <entry key="audio/mp4"><value>mediaStore</value></entry>
 <entry key="video/mp4"><value>mediaStore</value></entry>
 <entry key="video/x-m4v"><value>mediaStore</value></entry>
 <entry key="video/mpeg2"><value>mediaStore</value></entry>
 <entry key="video/mp2t"><value>mediaStore</value></entry>
 <entry key="video/quicktime"><value>mediaStore</value></entry>
 <entry key="video/3gpp"><value>mediaStore</value></entry>
 <entry key="video/3gpp2"><value>mediaStore</value></entry>
 <entry key="video/x-sgi-movie"><value>mediaStore</value></entry>
 </map>
 </property>
 </bean>
</beans>

2.3 – Deploy and Test

Package your class and your context file into an amp file (using the alfresco sdk) and deploy it to your repository with the apply-amps.sh (alfresco-mmt.jar).

You can now test to upload video or audio files and verify that those are being stored on your new mediaStore.

Summary (Technical deep dive) on Policies and Behaviours

We’ve seen that Policies provide hook points to which we can bind behaviours to events based on class or association

Behaviours are (policy) handlers that execute specific business logic, they can be implemented in Java and/or JavaScript. Behaviours can be bound to a type or aspect

org.alfresco.repo.policy.JavaBehaviour
JavaBehaviour(Object instance, String method, NotificationFrequency frequency)

org.alfresco.repo.jscript.ScriptBehaviour
ScriptBehaviour(ServiceRegistry serviceRegistry, ScriptLocation location, NotificationFrequency frequency)

Screen Shot 2015-04-07 at 16.03.57

We can have several types of Policies
  • ClassPolicy (type or aspect)
  • AssociationPolicy (peer or parent-child)
  • PropertyPolicy (not used)

Screen Shot 2015-04-07 at 16.05.50

There are Different Types Of Bindings for Behaviors

Service – called every time
Class – most common, bound to type or aspect
Association – bound to association, useful for adding smarts to custom hierarchies
Properties – bound to property, too granular

Let’s take a look at a register and invoke pattern for Behaviour components

public interface NodeServicePolicies
{
 public interface OnAddAspectPolicy extends ClassPolicy
 {
 public static final QName QNAME = QName.createQName(NamespaceService.ALFRESCO_URI, "onAddAspect");

 // Called after an <b>aspect</b> has been added to a node
 public void onAddAspect(NodeRef nodeRef, QName aspectTypeQName);
 }
}

public abstract class AbstractNodeServiceImpl implements NodeService
{
 // note: policyComponent is injected … (not shown here)

 public void init()
 {
 // Register the policy
 onAddAspectDelegate = policyComponent.registerClassPolicy
 (NodeServicePolicies.OnAddAspectPolicy.class);
 }

 protected void invokeOnAddAspect(NodeRef nodeRef, QName aspectTypeQName)
 {
 NodeServicePolicies.OnAddAspectPolicy policy = onAddAspectDelegate.get(nodeRef, aspectTypeQName);
 policy.onAddAspect(nodeRef, aspectTypeQName);
 }
}

Build and Implement pattern for Behaviour components

public class XyzAspect implements NodeServicePolicies.OnAddAspectPolicy, ...
{
 // note: policyComponent is injected … (not shown here)

 public void init()
 {
 // bind to the policy
 policyComponent.bindClassBehaviour(
 OnAddAspectPolicy.QNAME,
 ContentModel.ASPECT_XYZ,
 new JavaBehaviour(this, "onAddAspect”,
 Behaviour.NotificationFrequency.TRANSACTION_COMMIT));
 }

 public void onAddAspect(NodeRef nodeRef, QName aspectTypeQName)
 {
 // implement behaviour here … (for when aspect XYZ is added)
 }
}

Conclusion

Alfresco Behaviours and Policies are very powerfull features that can make your extensions very smart. I hope you enjoyed this post and that you can make usage of its contents.

Stay tuned for more posts with my field experiences,

One Love,

Luis

Application Lifecycle Management Methodology for Alfresco

$
0
0

alm

In Wikipedia, Application Lifecycle Management (ALM) is defined as :

“The marriage of business management to software engineering made possible by tools that facilitate and integrate requirements management, architecture, coding, testing, tracking, and release management”

Starting a development effort with the appropriate source control mechanisms and release methodology can save you hundreds of painful hours when managing releases and your source code. Adopting a smart, reliable and robust application lifecycle and release process can exceed your expectations and actually be the foundation for your project success.

PSG – Alfresco Sdk Rock and Roll with Jenkins

jenkins

Following is a list of “mortal sins” of application lifecycle” what we seek to avoid.

– Manual changes in production environments
– Manual error prone testing procedures and stressful UAT phases
– Multiple development standards and unmanaged versioning policies

The main commandments of Application Lifecycle management.

  • Identify and respect your release
  • If it’s not tested (automatically) it’s not working
  • If it’s not documented it doesn’t exist
  • Controlled integration is possible and should not limit business improvement
  • Centralize common configuration while leave projects enough flexibility for special cases

The PSG Methodology

PSG stands for “Plain Simple Goals” and its a application lifecycle and release management methodology focused at Alfresco based installations. It uses the Alfresco Maven SDK and Jenkins as the foundations providing a methodology for Alfresco development, release management and application lifecycle management.

The main goal of this methodology if to provide a reproducible and scalable way to manage application build, test, release, maintenance and integration policies.

My invite to you, on this particular post, is not only to read the post but to actually try the project, following easy and exact step by step instructions along the post, in return you will get :

  • Alfresco Development and Test Infra-structure
  • A Rapid and Smart Alfresco Development methodology
  • A Build Infra-structure
  • An reliable Release Process
  • A robust Alfresco Application lifecycle management approach

Technically speaking you will end up with :

  • In memory H2 database and in-memory application server (tomcat)
  • A extension module for the alfresco repository that creates an Alfresco Module Package(.amp)
  • A extension module for alfresco Share that creates an Alfresco Module Package(.amp)
  • A running instance of the alfresco repository with your overrides and your .amp extension deployed and tested
  • A running instance of the alfresco share application with your overrides and your .amp extension deployed and tested

Let’s start then, PSG is a project hosted on my personal github github and it contains the technical foundations of this methodology.

Step 0 – Download your working copy of the foundation

Start by clicking here to download your copy of the “Plain Simple Goals” methodology. Unzip the contents to a place on your computer. This will be your development environment home, so take note of this path. You will need to come back here later.

Step 1 – Pre-Requirements

1 – Software requirements

If you don’t have Java JDK, Maven and Jenkins installed on your computer/server now is the time to do so. Visit the urls below to install the pre-requirements.

About Maven

Maven’s primary goal is to allow a developer to comprehend the complete state of a development effort in the shortest period of time. In order to attain this goal there are several areas of concern that Maven attempts to deal with:

  • Making the build process easy
  • Providing a uniform build system
  • Providing quality project information
  • Providing guidelines for best practices development
  • Allowing transparent migration to new features

About Jenkins

Jenkins is an award-winning application that monitors executions of repeated jobs, such as building a software project or jobs run by cron. Among those things, current Jenkins focuses on the following two jobs:

  • Building/testing software projects continuously, just like CruiseControl or DamageControl. In a nutshell, Jenkins provides an easy-to-use so-called continuous integration system, making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build. The automated, continuous build increases the productivity.
  • Monitoring executions of externally-run jobs, such as cron jobs and procmail jobs, even those that are run on a remote machine. For example, with cron, all you receive is regular e-mails that capture the output, and it is up to you to look at them diligently and notice when it broke. Jenkins keeps those outputs and makes it easy for you to notice when something is wrong.

About Java 

:) Just kidding

2 – Credentials for Enterprise

If you wish to work with alfresco enterprise you need to have login credentials on the Alfresco Nexus repository (artifacts.alfresco.com). You can request login credentials on the Alfresco support portal. Alternatively you can just build and run the open source version Alfresco Community.

3 – Configuration requirements to build Alfresco Enterprise/Community

During the installation of maven, a new file name settings.xml was created. This file is our entry point to the your local maven settings configuration, including the remote maven repositories. Edit your settings.xml file and update the server’s section including the alfresco server id and your credentials.

Note that the root pom.xml references 2 different repositories : alfresco-privatealfresco-private-snapshots. If you are building Community you should call those alfresco-public and alfresco-public-snapshots. Note that the id of each repository must match with a server id on your settings.xml (where you specify your credentials for that server).

Section from the settings.xml maven configuration file

To build alfresco enterprise

...
   <repository>
     <id>alfresco-private</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/private</url>
   </repository>
   <repository>
     <id>alfresco-public-snapshots</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/private-snapshots</url>
   </repository>
 ...

To build alfresco Community

...
   <repository>
     <id>alfresco-public</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/public</url>
   </repository>
   <repository>
     <id>alfresco-public-snapshots</id>
     <url>https://artifacts.alfresco.com/nexus/content/groups/public-snapshots</url>
   </repository>
 ...

Section from the root pom.xml (Enterprise)

...
        <server>
            <id>alfresco-private</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-private-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
 ...

Section from the root pom.xml (Community)

...
        <server>
            <id>alfresco-public</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
        <server>
            <id>alfresco-public-snapshots</id>
            <username>YOUR_USERNAME</username>
            <password>YOUR_PASSWORD</password>
        </server>
 ...

Step 2 – Source Control Mechanism and Distribution Management

Source Control

The project uses the maven SCM plugin. The SCM Plugin offers vendor independent access to common scm commands by offering a set of command mappings for the configured scm. Each command is implemented as a goal.

Configure the main pom.xml with your own source control mechanism, the example in the download is configured with my github account. If you don’t have one yet, you can create your own free github account and use it.  If you use subversion you can have something like :

...
<scm>
  <connection>scm:svn:http://<YourRepo>/svn_repo/trunk</connection>
  <developerConnection>scm:svn:https://<YourRepo>/svn_repo/trunk</developerConnection>
  <url>http://<YourRepo>/view.cvs</url>
</scm>

 ...

Distribution Management
We need to have configured the repository that will hold the artifacts of the releases. That is configured using the deploy plugin of maven. In the psg project i have my own public-cloudbees repository configured, you can create your own free cloudbees repository and update the pom.xml accordingly so that your releases are stored in your repository. Since during this post we will install a instance of artifactory (see the bottom of this post for installation instructions), we should use it to hold our release artifacts. Configure it as follows:

...
<distributionManagement>
  <repository>
    <id><your-company>-private-release-repository</id>
    <url>dav:https://<YOUR_CI_SERVER_IP>:8080/artifactory/<project_name>/release/</url>
  </repository>
  <snapshotRepository>
   <id><your-company>-private-snapshot-repository</id>
   <url>dav:https://<YOUR_CI_SERVER_IP>:8080/artifactory/<project_name>/snapshot/</url>
  </snapshotRepository>
</distributionManagement>
 ...

Note that the repository id configured on the pom.xml must match a server id on your local maven settings.xml file.

Section from your local settings.xml maven configuration file

...
<server>
        <id><your-company>-private-snapshot-repository</id>
        <username>YOUR_PRIVATE_MAVEN_REPOSITORY_USERNAME</username>
        <password>YOUR_PRIVATE_MAVEN_REPOSITORY_PASSWORD</password>
        <filePermissions>664</filePermissions>
        <directoryPermissions>775</directoryPermissions>
    </server>
    <server>
        <id><your-company>-private-release-repository</id>
        <username>YOUR_PRIVATE_MAVEN_REPOSITORY_USERNAME</username>
        <password>YOUR_PRIVATE_MAVEN_REPOSITORY_PASSWORD</password>
        <filePermissions>664</filePermissions>
        <directoryPermissions>775</directoryPermissions>
    </server>
...

This will enable you to perform releases with :

  • Prepare release : mvn release:prepare
  • Perform release : mvn release:perform
  • Prepare and Perform release : mvn release:prepare release:perform

Step 3 – Finally , let run it ! 

On the projects root folder you have the heart of the project, the parent pom.xml. This is the file that aggregates your full build, including all the modules and overlays to the different applications and generates deployable artifacts ready for your release. Before we run it for the first time let’s review once more What is included in this project build
applications(apps folder)

Alfresco module packages(amps folder)

To run the project for the first time, issue the following maven command to run the project from the root of your development environment.

# mvn clean install -Prun

This will build and run all the modules, when the maven process finishes, open a browser and point it to :

alfresco

SHARE

Step 4 – Jenkins Integration – Build and Release Processes 

Now that we see what our build can do we will go a step further and we’ll perform the integration with Jenkins. This is where the automation fun begins.

Our goals are automating the build and deployment process. The project artifacts are always built after every check-in of new source code, which means early warnings of broken builds.

Process Goals

• Deploy to any environment by the push of a button
• Revert to a previous deployment by the push of a button.
• Deploy automatically every night on the dev environment.
• Log of each build

Our engine for the automated deployments and continues integration (CI) will be Jenkins. Jenkins connects to your SCM (svn, git, …), downloads the source code and then it builds your project artifacts.  When the build has produced the required artifacts ( <your>-alfresco.war, <your>-share.war) those can be deployed automatically to the different environments (e.g. DEV, TEST, PRODUCTION).

Step 1 – Install Jenkins and Artifactory

Let’s check the installation of required software on the Continuous Integration server. This should ideally be a independent server machine that will run:

• Jenkins Server
• Artifactory Repository

We will now cover the installation of those components on the designated CI server box (LINUX). The Jenkins server will act as an Automation tool that :

  • Monitors execution of repeated jobs
  • Allows for Continuous Integration
  • Test orchestration
  • Executes and Tests Releases
  • Rolls back and redeploy of previous builds

The releases can be scheduled and run periodically on early development stages

Jenkins will deploy remotely to any Alfresco environment and run remote integration tests. The release reports should be published on the site and recorded as part of the release documentation. Your tests should be self contained and runnable in CI and the must produce intelligible reports. Every task-development must contain the appropriate tests.

Jenkins installation on CI server box (let’s hire a free buttler :))

Screen Shot 2015-04-07 at 23.40.10

STEP 1)

sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo

STEP 2)

sudo rpm --import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key

STEP 3)

sudo yum install jenkins

Jenkins Details

Jenkins will be launched as a daemon up on start. See /etc/init.d/jenkins for more details.
To Start/Stop or stop Jenkins you can use the following command:

sudo service jenkins start/stop/restart

The ‘jenkins’ user is created to run this service. If you change this to a different user via the config file, you must change the owner of

/var/log/jenkins
/var/lib/jenkins
/var/cache/jenkins

The Log file will be placed in /var/log/jenkins/jenkins.log. Check this file if you are troubleshooting Jenkins.

The configuration file /etc/sysconfig/jenkins will capture configuration parameters for the launch. By default, Jenkins listen on port 8080 but we can change this port to 9090 because the default 8080 may be taken by other tomcat instance.

Note that the built-in firewall may have to be opened to access this port from other computers. (See http://www.cyberciti.biz/faq/disable-linux-firewall-under-centos-rhel-fedora/ for instructions how to disable the firewall permanently)

To test your Jenkins server, just type http://<server_ip>:9090 on your browser.

This concludes the installation of Jenkins, official documentation available:

https://wiki.jenkins-ci.org/display/JENKINS/Installing+Jenkins+on+RedHat+distributions

Artifactory installation on CI server box

Screen Shot 2015-04-07 at 23.28.50

Start by going to http://www.jfrog.com/open-source/#os-arti and download your zip bundle with artifactory. By the time of this post the latest version is available at http://bit.ly/Hqv9aj

Setup is really easy, we just needed to follow the instructions available on their 1 minute setup video : http://www.jfrog.com/video/artifactory-1-min-setup/

After the installation is completed you can access artifactory by going to :

http://<server_ip>:8081

Source code control strategy

Your project should have

  • Standard trunk/branches/tags per project structure
  • Maintenance branches
  • Tagging per release
  • Build number based on SVN revision

This way you can have separate business projects (parties) to run independently on separate SVN roots, whilst allowing the final binary product to be integrated in the main Enterprise Alfresco instance.

Release Artifacts

We will be producing and storing 2 different types of artifacts on our release. The artifacts have 2 categories, deployment artifacts and storage-only.

Deployment artifacts: alfresco.war and share.war and server-configuration.alf

Storage-only artifacts: repo-extension-amp, share-extension-amp, custom-amp

The deployment artifacts can be deployed to any target environment either by the click of a button or by a Jenkins schedule task. Note that storage-only artifacts are self-contained in the deployment artifacts, this is achieved by the dependency management implicit in the alfresco maven Sdk. The next diagram illustrates this.

depen

The server-configuration.alf is a compressed package that follows a specific configuration structure that is part of every deployment. This contains environment specific configuration for the target server, it contains a set of configuration files that are specific to each target server (Dev, Test, Prod) such as, for example, the alfresco-global.properties.

So, before we move on we need to make sure we got that working (Using the alfresco sdk to build the customised war files). Using the psg template you can easily configure this so i will not get on details on this process.

Unit Testing

A set of unit tests is included as part of every code release/deployment to check that code is working and continues to work as intended.

Automated tests must meet very specific objectives:

  • Every developer must be able to run the combined collection of all the developer’s tests.
  • The continuous integration (CI) server must be able to run the entire suite of tests without any manual intervention.
  • The outcome of the tests must be unambiguous and repeatable.

Running this automated unit tests will allows any developer to verify that their current changes do not break existing code-under-test. The team leader or manager showd insist that this happen. It very important as it virtually eliminates the accidental or unintended side-effect problems.

There are 3 key objectives for developers to keep in mind when writing the unit tests:

  • Readability: Write test code that is easy to understand and communicates well.
  • Maintainability: Write tests that are robust and hold up well over time.
  • Automation: Write tests that require little setup and configuration (preferably none).

Integration Testing

Once unit tested components are delivered Jenkins should integrate them together. These “integrated” components are tested to weed out errors and bugs caused due to the integration. This is a very important step in on the Development Life Cycle.

Goal is to avoid programmers developing different components and that some bugs emerge during the integration step. In most projects a dedicated testing team focuses on Integration Testing.

The integration team should be able to

  • Step 1: Create a Test Plan
  • Step 2: Create Test Cases and Test Data
  • Step 3: If applicable create scripts to run test cases
  • Step 4: Once the components have been integrated execute the test cases
  • Step 5: Fix the bugs if any and re test the code
  • Step 6: Repeat the test cycle until the components have been successfully integrated

To write an Integration Test Case you should describe exactly how the test should be carried out. The Integration test cases specifically focus on the flow of data/information/control from one component to the other.

Integration Test cases should focus on scenarios where one component is being called from another. Also the overall application functionality should be tested to make sure the app works when the different components are brought together.

The various Integration Test Cases will be executed as part of the build process.

The release process

Maven maven will be compiling the extension amps and adding them to the correspondent .war files. We will have 3 main artifacts for deployment (alfresco.war and share.war and server-configuration.alf).

The remaining artifacts (.amps) will also be created as part of the release but will not be directly deployed. They are allready part of the war artifacts via the maven dependency management

To execute part of the release process in Jenkins we are using the maven release-plugin.

This plugin is used to release a project with Maven, saving a lot of repetitive, manual work.

Releasing a project is made in two steps: prepare and perform.

Try to prepare a release with maven by running the command : # mvn release:prepare

So what happens when preparing a release (release:prepare) ?

Preparing a release goes through the following release phases:

  • Check that there are no uncommitted changes in the local sources
  • Check that there are no SNAPSHOT dependencies
  • Change the version in the POMs from x-SNAPSHOT to a new version
  • Transform the SCM information in the POM to include the final destination of the tag
  • Run the project tests against the modified POMs to confirm everything is in working order
  • Commit the modified POMs
  • Tag the code in the SCM with a new version name
  • Bump the version in the POMs to a new value y-SNAPSHOT
  • Commit the modified POMs

 

After a successful build, maven has prepared your release by:

  1. Creating a new tag in your SCM with the release source code.
  2. Creating a new release version and lock the code for deployment
  3. Create automatically a new development version

 

See more details on the prepare stage of a release on the official maven documentation

Performing a release (mvn release:perform)

After a successful prepare stage, you are now ready to perform your first release by running the command : # mvn release:perform

Performing a release runs the following release phases:

  • Checkout from an SCM URL with optional tag
  • Run the predefined Maven goals to release the project (by default, deploy site-deploy).
  • Upload the artifacts to your configured maven repository

See more details on the perform stage of a release on the official maven documentation

Deploy your artifacts

This is the last step of the release process; it’s where the release manager (will be impersonated by Jenkins) actually deploys the release to the target environment (DEV,QA, PROD), this are the steps that will be performed by Jenkins.

  1. Stops your application server (tomcat)
  2. Unzips the server-configuration.alf
  3. Updates server configuration.
  4. Copy the alfresco artifact war file (with your overlays) to your application server replacing the existing copy.
  5. Copy the share artifact war file (with your overlays) to your application server replacing the existing copy.
  6. Starts your application server (tomcat) with the new release

Note that we have configured (using maven dependency management) both the alfresco and share overlay modules to build their target artifacts (war files) including your amp extensions, meaning that they are already installed on the war artifacts (thanks to the alfresco Sdk).

All this steps are automated in Jenkins as part of the release process, no human intervention is necessary to perform a release.

Automating your build and releases in Jenkins

After configuring Jenkins to automate the main deployment steps (release prepare) and (release perform) we will still need to configure Jenkins to perform the final deployment. The goal is to completely eliminate human intervention in the deployment. We will be using an automated action in Jenkins that will do the following:

  • Stops target tomcat (DEV,PRE-PROD or PROD)
  • Download the artifacts from artifactory with CURL
  • Updates server configuration with alfresco server-configuration.alf
  • Copy the alfresco.war and share.war with scp to <tomcatRoot>/webapps replacing the existing versions.
  • Starts target tomcat (DEV, PRE-PROD or PROD)

NOTE: On the first run with the new release process you should probably backup the existing alfresco.war and share.war as those artifacts will not be on artifactory yet.

Conclusion

In my opinion, a smart application lifecycle and release process is the foundation for any successful project . I hope you’ve enjoyed this post.

Stay tuned for more posts and advices on Alfresco and ECM

Until then, One Love !

“We’re together and we share, that is what makes us Strong”

Viewing all 21 articles
Browse latest View live