Tigase Load Tests again - 500k user connections
I have had a great opportunity and pleasure to use Sun's environment both hardware and software to run load tests on the Tigase server for a couple of last weeks. Sun also ofered me something more. Their knowledge and help during the tests allowed me to improve Tigase and fine tune the system to get the best results. I would like to say great thank you to Deepak Dhanukodi, ISV Engineering from Sun who was a huge help.
Below is a description of the tests I run, environment and results I got.
I know summary should be at the end but I realize that many people may be interested in the results while not being interested in all the details. So here we go.
The main goal for all these tests was to run the Tigase server under the load of 500k active user connections with an empty user roster. This test was going to show how well the Tigase server handles huge number of I/O and huge number of user sessions on a single machine.
Success! The Tigase easily handled the load with CPU usage below 30% and memory consumption at about 50% level. The test was so successful that we tried to run another similar test to get to 1mln online users. This however failed because client machines couldn't generate such a load.
Secondary goal was to run comparison tests with different user roster size for and user connections count above 100k to see how the roster size impacts the load and resources consumption.
This test wasn't the kind of score max but still I think it is also a great success. At the roster size of 40 and above the Tigase server started to behave unstable. Long GC activities impacted overall performance and in some cases leaded to unresponsive service. More details below. I learnt not only that default GC is not a good choice for the Tigase server under a high load but also I found the best GC and GC parameters to get a stable service with even higher load than I planed before. The CMS GC is the one which should be used to run Tigase.
Max connections and roster with 50 elements was the last test I wanted to run. In most XMPP installations I helped to setup, the average roster size was just below 50 elements. So the goal for this test was to see how many connections the Tigase can handle with such a roster.
300k user connections with roster size 50 is the result which is quite good. CPU usage was below 50% and memory consumption below 60%. We could certainly try to handle more connections. Unfortunately I have never expected that the system can handle more than 300k user connections with 50 elements roster so this is what I had in the database prepared for the test.
I had 12 machines to run my tests. One for the Tigase server, second for the database and 10 more machines to generate clients' load:
- Tigase server SPARC Enterprise T5220, 32GB RAM, CPU - UltraSPARC-T2 with 8 Cores and 8 threads on each core which gives 64 processing units, CPU Clock speed - 1165MHz, 146GB 10k HDD SAS and SCSI.
- Database server Sun Fire X4600, 32GB RAM, CPU - 2xAMD Opteron 854 with 4 Cores each which gives 8 processing units, CPU Clock speed - 2.8GHz, 73GB 10k SAS HDD.
- Client machines 10x - Sun Fire V20z, 4GB RAM, CPU AMD Opteron Dual Core 2.1GHz, 36GB 10k SCSI HDD.
- Tigase XMPP Server 4.1.5 as XMPP (Jabber) server.
- TSung 1.3.0 as clients' load generator.
- MySQL 5.1.33 Community Server as a database and the configuration file.
- Solaris 10 Update 6 as OS on the server, Solaris Express Community Edition snv_110 X86 as OS on load generators.
There were 2 main types of tests I ran:
- Standard test when the user session was about 20 minutes length, arrival duration 60 minutes. This test was mainly to compare the server behavior with different user roster sizes. The maximum number of users' connections was tuned by adjusting connections rate. This was however limited by the database which couldn't handle load generated by connection rate above 0.0045 sec.
- Extended test similar script to standard one but the user session time has been extended by putting script body in a loop. This was done to get maximum possible number of user connections in the test to see how Tigase can handle that.
Here is a complete description of the Tigase installation which was fine tuned to get maximum performance during all tests. Please note I am not the MySQL database expert and I couldn't get it working fast enough to not impact performance. Therefore the system was configured in such a way to avoid any writing to the database during the test.
The complete JVM parameters for the tests are:
-XX:+UseLargePages -XX:+UseBiasedLocking -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelCMSThreads=8 -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Djdbc.drivers=com.mysql.jdbc.Driver -server -d64 -Xms28G -Xmx28G -XX:PermSize=64m -XX:MaxPermSize=512m
The Tigase server parameters:
--property-file etc/init.properties --test
The '--test' parameter only excludes offline messages plugin from the installation and also decreases default logging level. This is done to avoid delaying the server with any unecessary IO operation. During the tests Tsung sends lots of messages to online users. In the second phase it happens quite often that the message sent to online users is processed when the user actually is gone and then it goes to database to offline storage. This introduced long delays. Also heavy logging introduces significant delays too and impacts overall performance, therefore it is set to absolute minimum during tests.
The Tigase server configuration properties
config-type=--gen-config-def --email@example.com --virt-hosts = tigase.test --auth-db=tigase-auth --user-db=mysql --user-db-uri=jdbc:mysql://192.168.111.32/tigasedb_20roster?user=tigase_user&password=tigase_passwd --user-repo-pool-size=12 --comp-name-1=srecv --comp-class-1=tigase.server.sreceiver.StanzaReceiver #--debug=server --monitoring=jmx:10000
A few notes to the parameters:
- The 'tigase-auth' was used as authentication connector. It uses stored procedures to perform all user login/logout actions. Normally these procedures also update last login/logout time. For this test however updating user login/logout time was removed from stored procedures to minimize database delays.
- Depending on the roster size a different database was used.
- Database connection pool of 12 was used for user data database. There was only a single database connection for user authentication connector.
- StanzaReceiver was loaded to run Tigase internal monitoring tools detecting system overload, threads dead-locks and other possible problems.
- Monitoring via JMX was enabled and the system was also monitored using JConsole.
The user roster
The user roster was either empty or had a fixed, the same size for all users. It was built in such a way that always exactly half of the buddies were online and the other half was off-line when the user was logging in. Later on the rest of buddies was logging in too so eventualy all budies in the roster were online during the rest of the test.
Tests and tests results
|Name||Roster||Session lentgh||Connections rate||Max connections||CPU usage||RAM usage||Tsung reports||Comments|
|500k||empty||80min||0.005 sec||622k||CPU||Memory||Tsung report||Attempt was also to get to 1mln connections. This however failed due to limitation on the load generating machines. They were maxing resources out over 500k connections.|
|300k*||50||20 min||0.0045 sec*||300k||CPU||Memory||Tsung report||The requirement was to keep user session within 20min so to generate more connections the new connections rate had to be changed. Unfortunately 0.0045sec rate was the highest the database could handle so the 300k was the test limit or the database limit, not the Tigase server limit.|
* - the database limit.
|No||Roster||Session lentgh||Connections rate||Max connections||Tsung reports||Comments|
|1.||Empty||20min||0.015 sec||>100k||Tsung report||Default GC.|
|2.||10||20min||0.015 sec||>100k||Tsung report||Default GC.|
|3.||20||20min||0.015 sec||>100k||Tsung report||Default GC.|
|4.||30||20min||0.015 sec||>100k||Tsung report||Default GC.|
|5.||40||20min||0.015 sec||>100k||Tsung report||Default GC.|
|6.||50||20min||0.015 sec||>100k||Tsung report||GC Settings: XX:+UseLargePages -XX:+UseBiasedLocking -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=32, this didn't help much. At certain load GC delays make Tigase unresponsive.|
|7.||50||20min||0.0045 sec||299k||Tsung report||GC Settings: -XX:+UseLargePages -XX:+UseBiasedLocking -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelCMSThreads=8, this is the secret formula. CMS GC is the one which works well with Tigase and offer stable service even under a very high load.|