Hi folks, this post follows my previous post about capacity planning and it provides you with the tools (and a vmware image ready to run) for your to implement it.
Would like to start with a huge thank you message to Miguel Rodriguez (Alfresco Support Engineer ). He is the creator of this monitoring solution and also the person responsible for setting up the vmware image with all the tools, scripts, etc. My hero list just got bigger, Miguel got a place just after Spider Man and the Silver Surfer
Monitoring Alfresco with OpenSource tools
Monitoring your Alfresco architecture is a know best practice. It allows you to track and store all relevant system metrics and events that can help on:
- Troubleshooting possible problems
- Verify system Heath
- Check user behavior
- Build a robust historical data-warehouse to later analysis and capacity planning
This posts explains a typical monitoring scenario over an Alfresco deployment, using only opensource tools.
I’m proposing a fully opensource stack of monitoring tools that build the global monitoring solution. The solution will make use of the following opensource products.
- ElasticSearch [http://www.elasticsearch.org/]
- Logstash [http://logstash.net/]
- Redis [http://logstash.net/docs/1.2.1/outputs/redis]
- Kibana [http://www.elasticsearch.org/overview/kibana/]
- Graphite (Grafana) [http://grafana.org/]
- JavaMelody [https://code.google.com/p/javamelody/]
- Icinga [http://www.icinga.org/]
The solution will be monitoring all layers of the application, producing valuable data on all critical aspects of the infrastructure. This will allow a pro-active system administration opposed to a reactive way of facing possible problems by predicting the problems before they happen and take the necessary measures to maintain a healthy system on all layers.
I see this approach as as both a monitoring and capacity planning system allowing to provide “near” real time information updates, customize reporting and to provide custom search mechanism over the collected data.
The diagram below shows how the different components of the solution integrate. Note that we centralize data from all nodes and the various layers of the application in a single location.
The sample architecture being monitored consists on a cluster of two Alfresco/Sharenodes for serving user requests and two Alfresco/Solr nodes for indexing/searching content.
Consider 3 major components of the monitoring solution
- Logstash file tailing to monitor Alfresco log files and logstash command execution to monitor specific components i.e. processes, memory, disk,java stack traces, etc.
- JavaMelody to monitor applications running in a JVM and other system resources.
- Icinga to send jmx requests to Alfresco servers.
Dedicated Monitoring Server Download
All software components of the monitoring server are installed Vmware image that we offer for free (within the opensource spirit :)).
You can download your copy of this monitoring server in http://eu.dl.alfresco.com.s3.amazonaws.com/release/Support/AlfrescoMonitoringVirtualServer/v1.0/AlfrescoMonitoringVirtualServer-1.0.tar
The ElasticSearch server that will collect all the logs from the various components of the application and will host the graphical user interfaces (Kibana and Grafana) to view the monitoring data.
About JavaMelody
JavaMelody is used to monitor Java or Java EE application servers in QA and production environments. It is a tool to measure and calculate statistics on real operation of an application depending on the usage of the application by users. Very easy to integrate in most applications and is lightweight with mostly no impact to target systems.
This tool is mainly based on statistics of requests and on evolution charts, for that reason it’s one important add on to our benchmarking project, as it allow us to see in real time the evolution charts of the most important aspects of our application.
It includes summary charts showing the evolution over time of the following indicators:
- Number of executions, mean execution times and percentage of errors of http requests, sql requests, jsp pages or methods of business façades (if EJB3, Spring or Guice)
- Java memory
- Java CPU
- Number of user sessions
- Number of jdbc connections
These charts can be viewed on the current day, week, month, year or custom period.
You can have detailed information about javamelody at https://code.google.com/p/javamelody/
Installing JavaMelody
Its really easy to attach javamelody monitor to all alfresco applications (alfresco.war or share.war) and every other web-application that is deployed on your application server.
Step 1
Configure the JavaMelody monitorization on Alfresco tomcat by copying the itextpdf-5.5.2.jar, javamelody.jar and jrobin-1.5.9.1 to the tomcat shared libfolder under <tomcat_install_dir>\shared\lib or your application server (if not tomcat) global classloader location.
Step 2
Edit the global tomcat web.xml (D:\alfresco\tomcat\conf\web.xml) file to enable javamelody monitorization on every application. Add the following filter :
<filter>
<filter-name>monitoring</filter-name>
<filter-class>net.bull.javamelody.MonitoringFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>monitoring</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<listener>
<listener-class>net.bull.javamelody.SessionListener</listener-class>
</listener>
And that’s about it, after restarting you can access the monitorization of every application in http://<your_host>:<server_port>/<web-app-context>/monitoring, for example http://localhost:8080/alfresco/monitoring
Monitoring Stages Breakdown
Stage 1 – Data Capturing(Logstash)
We capture monitoring data using different procedures.
- Scheduled Jobs (Db queries, Alfresco jmx Beans queries, OS level commands)
- Logs indexing with logstash. We use logstash it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)
- The Alfresco audit log (when configured) is also parsed and indexed by elastic search, proving all the enabled audit statistics.
- Metrics with JavaMelody
Stage 2 – Monitoring Data Archiving(ElasticSearch)
On the diagram above we can see the flow of data capturing using logstash and Elastic Search. Let’s see some details on each of the boxes on the diagram.
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and send them to ElasticSearch to be stored them for later use (like, for searching)
Redis is a logs data broker, receiving data from log “shippers” and sending the data to a log “indexer”
ElasticSearch is a distributable, RESTful, free Lucene powered search engine/server
Kibana3 is a tool for displaying and interacting with your data.
Stage 3 – Trending and Analysis (Kibana,Grafana)
To analyze the data and the trends we use install 2 different GUIs on the monitorization server (Kibana and Grafana).
Kibana allows us to check the indexed logs with metadata, and to troubleshoot on specific log traces. It provides a very robust search mechanism on top of the elasticsearch indexes. It provides strategic technical insights with an global overview on all layers of the platform delivering actionable insights in real-time from almost any type of structured and unstructured data source.
On the flow above we can see how the information and statistics get to Grafana.
Grafana is a beautiful dashboard for displaying various Graphite metrics through a web browser. It has enormous potential, it’s easy to setup and to customize for different business needs.
Let’s have a closer look on the remaining components on the flow diagram.
Statsd is a network daemon that listens for statistics, like counters and timers sent over UDP and sends them to Carbon.
Carbon accepts metrics over various protocols and caches them in RAM as they are received, flushing them to disk on an interval using the underlying whisper library.
Whisper provides fast, reliable storage of numeric data over time
Grafana is an easy to use and feature rich Graphite dashboard
Stage 4 – Monitoring
We use scheduled commands and index data on elastisearch checking the following monitoring information from the Alfresco and Solr servers.
- JVM Memory Usage
- Server Memory
- Alfresco Cpu utilization
- Overall Server Cpu utilization
- Solr Indexing Information
- Number of documents on Alfresco “live” store
- Number of documents on Alfresco “archive” store
- Number of concurrent users on Alfresco repository
- Alfresco Database pool occupation
- Number of active sessions on Alfresco Share
- Number of active sessions on Alfresco Workdesk
- Number of busy tomcat threads
- Number of current tomcat threads
- Number of maximum tomcat threads
Those can be extended at any time, performing monitorization on any target relevant on your use case.
Stage 5 – Troubleshooting
While troubleshooting we use Kibana/Grafana and JavaMelody.
Kibana allow’s us to check the “indexed” logs with meta-data and verify exactly what classes are related with the problem as well as the number of occurrences and root of the exceptions.
Grafana show us what/how/when the server resources are being affected by the problem.
JavaMelody provides detailed information on crucial sections of the application. The goal of JavaMelody is to monitor Java or Java EE application servers in QA and production environments.
It produces graphs for Memory, CPU, HTTP Sessions, Threads, GC, JDBC Connections, SQL Hits, Open Files, Disk Space, Network I/O, Statistics for HTTP traffic, Statistics for SQL queries, Thread dumps, JMX Beans information and overall System Information. Java Melody has a Web interface to report on data statistics.
Using these 3 tools, troubleshooting a possible problem becomes an friendly task and it boosts the speed of the investigations, that normally would take ages to gather all the necessary information to get to the root cause of the issue.
Stage 6 – Notification and Reporting
We use Icinga in order to notify the delegated alfresco administrator (email) when there is some problem with the Alfresco system. Icinga is an enterprise grade open source monitoring system that keeps watch over networks and resources, notifies the user of errors and recoveries and generates performance data for reporting.
Icinga Web is highly dynamic and laid out as a dashboard with tabs which allow the user to flip between different views that they need at any one time
Stage 7 – Sizing Adjustments
Sizing will be a human action on the capacity and monitoring solution. Performing a regular analysis to the monitorization/capacity planning data, we will know exactly when and how we need to scale our architecture.
The more data gets inside elastic search along the application life cycle, the more accurate are the capacity predictions because they represent the “real” application usage during the defined period.
This represents a very important role when modeling and sizing the architecture for the future business requirements.
7.1 – Peak Period Methodology
The Peak period Methodology is the most efficient way to implement a capacity planning strategy as it allows to analyze vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.
The peak period may be an hour, a day, 15 minutes or any other period that is used to analyze the collected utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.
Your monitoring Targets on a Alfresco installation
I’ve identified the following targets to be candidates to participate on the Monitoring system and have their data indexed and stored on elastic search.
Database
- Transactions
- Number of Connections
- Slow Queries
- Query Plans
- Critical DM database queries ( # documents of each mime type, … )
- Database server earth ( cpu, memory, IO, Network)
- Database statistics integration
- Database sizing statistics ( growth, etc)
- Peak Period
Application Servers (Tomcats)
- Request and response times
- Access Logs ( number of concurrent requests, number concurrent users , etc)
- Cpu
- Io
- Memory
- Disk Space Usage
- Peak period
- Longest Request
- Threads ( Concurrent Threads, Busy Threads )
Application JVM
- Jvm Settings Analysis
- GC Analysis
- Log Analysis (Errors, Exceptions, Warnings, Class Segmentations(Authorization, Permissions, Authentication)
- Auditing Enabling and Analysis (Logins, Reads, Writes, Changed Permissions, Workflows Running, Workflows States)
- Caches Monitoring (Caches usage, invalidation, cache sizes )
- Protocol Analysis (FTP, CMIS; Sharepoint, WEBDAV, IMAP, CIFS )
- Architecture analysis
Search Subsystem(Solr)
- Jmx Beans Monitorization
- Caches ( Configuration, Utilization, Tuning, Inserts, Evictions and Hits )
- Indexes Health
- Jvm Settings Analysis
- Jvm Health Analysis
- Garbage collection Analysis
- Query Debug (Response times, query analysis, slow queries, Peak periods)
- Search and Index Memory Usage
Network
- Input/Output
- High availability
- Tcp Errors / Network errors at Network protocol level
- Security Analysis ( Ports open, Firewalls, network topology , proxies, encryption )
Shared File Systems
- Networking to clients hosts
- Storage Type ( SAN, NAS )
- I/O
Clustering
- Cluster members subscription analysis (subscription analysis)
- Cluster cache invalidation strategy and shared caches performance
- Cluster load balancing algorithm performance (cluster nodes load distribution)
The Alfresco Audit Trail
The monitoring solution also uses and indexes the Alfresco audit trail log, when audit is enabled. Alfresco audit should be used with caution as auditing too many events may have a negative impact on performance.
Alfresco has the option of enabling and configuring an audit trail log. It stores specific user actions (configurable) on a dedicated log file (audit trail).
Building on the auditing architecture the data producer org.alfresco.repo.audit.access.AccessAuditor gathers together lower events into user recognizable events. For example the download or preview of content are recorded as a single read. Similarly the upload of a new version of a document is recorded as a single create version. By contrast the AuditMethodInterceptor data producer typically would record multiple events.
A default audit configuration file located at <alfresco.war>/WEB-INF/classes/alfresco/audit/alfresco-audit-access.xml is provided that persists audit data for general use. This may be enhanced to extract additional data of interest to specific installations. For ease of use, login success, login failure and logout events are also persisted by the default configuration.
Default audit filter settings are also provided for the AccessAuditor data producer, so that internal events are not reported. These settings may be customized (by setting global properties) to include or exclude auditing of specific areas of the repository, users or some other value included in the audit data created by AccessAuditor.
No additional functionality is provided for the retrieve of persisted audit data, as all data is stored in the standard way, so is accessible via the AuditService search, audit web scripts, database queries and Alfresco Explorer show_audit.ftl preview.
Detailed information on Audit possibilities available at:
- https://wiki.alfresco.com/wiki/Content_Auditing
- http://docs.alfresco.com/4.2/concepts/audit-intro.html
And that’s about it folks, i hope you liked this article and that it can help you on monitoring your projects. More articles with relevant information from the field are coming up, so stay tuned.
All the best, One Love,
Luis