Real World Uses Cases: Troubleshooting large GeoServer deployment
in this post we would like to describe how we did help one of our clients to troubleshoot its GeoServer and GeoWebCache deployment in order to quickly respond to a tremendous peak in the number of accesses to the geportal sitting in front of this deployment. Unfortunately we cannot name the client or the portal explicitly, all we can say is that we are talking about a nationwide geoportal for an Eastern European country.
The data that is served by the system comprises of around 2 TB of raster data (mostly nation-wide ortos) plus another 60 GB of vector data (mostly cadastrial parcels).
The initial deployment had the objective to support 100 requests/seconds (with avg response time of 3 seconds and maximum response time of 10 seconds) split among 40% for GeoServer and 60% for GeoWebCache, since most vector layers were visible only at high resolutions with no tile caching due to the extensive usage of labels (with a desire to get the best possible label layout, which is often broken by tiling).
The initial set up we put in place is described in the picture below.
We had deployed two large VMWare virtual machines (see above for the characteristics), each of them had 3 instances of GeoServer, one instance of PostgreSQL and one instance of GeoWebCache. The two instances of PostgreSQL are clustered in Active-Active mode via PGPool while the two instances of GeoWebCache are in Active-Passive mode since we had to enable the diskquota extension.
The requests where balanced using Apache HTTPd on another specific VM.
For this specific project we did investigate deeply performances of ImageIO-Ext with large BigTiff files and we come up with some nice optimizations that are crucial for loading metadata without using a lot of memorys when inner tiling is used; these tweaks also have the side effect of improving performances of read operations on huge files by at least an order of magnitude which brings us a niche scalability improvement in GeoServer which we measure in 40/60% more throughput!
These improvements will be part of GeoServer 2.2.0 and 2.1.5.
The system was launched on a Thursday and we soon realized that we had to endure a load much higher than expected: the peak load we saw at the balancer was above 7000 requests /seconds and the performances of the system were sinking therefore we had to react quickly.
We first analyzed the requests that we where getting and therefore we decide to do what follows:
- We replaced Apache HTTPd as a load balancer with HAProxy since Apache HTTPd was not really able to cope with the load
- Part of the team started to work on refactoring the front-end to use more tiled layers with GeoWebCache
- We increased the amount of disk space available via OCFS2 so that could get rid of DiskQuota for GeoWebCache
- We moved the GeoWebCache instances to their own virtual machines, we removed DiskQuota and MetaStore and made them work in Active-Active mode. With this set up and the usage of the OCFS2 filesystem we were able to have the two instances working even against a a non fully seeded cache
- We moved the PostgreSQL instances to their own VM
- We put 6 instances on each of the two VM we had at the beginning. We also added two more identical VM bringing the number of GeoServer instances in place to 24. Although we decided to work against using GeoWebCache more and GeoServer less from the front-end we needed to account for the seeding process to work as fast as possible hence we needed processing power!
- We configured the GWC instances to go through the load balancer for creating tiles and talk to the GeoServer cluster in order to balance the load as much as possible.
- We tweaked the set up for the control flow plugin in order to better control the load on the server and queue excessive requests to try and conserve a good throughput.
The GeoSolutions team,