Wednesday, March 21, 2007

Windows Server Llimitations

We have run into an interesting scaling problem with Domino on Windows Server 2003. We have 6 Domino mail servers with about 850 users on each, about 5000 users in total. These systems are configured as 3 Windows servers, each supporting two Domino partitions. With a corporate policy of unlimited mail file size this means about 6 TB of data total, or about 2 TB per Windows server. Due to the high volume of open mail files, Windows can't quite cope, and occasionally corrupts mail files. When you see the message "Insufficient system resources..." in the Domino log, you know database corruption is not far away! The problem is caused by the way Notes uses the Windows Page pool to cache files (See IBM technote 1093511 and Microsoft document Q312362), Of course you can disable OS caching altogether, but, as IBM says, this will seriously impact server performance.

Initially we were alerted to the problem last December when massive data corruption occurred on all mail servers in the space of a week or so. That first occurrence was shortly after we completed our server consolidation project, putting two Domino partitions on one Windows server. After this cache problem happens, Domino finds some mail files with thousands of corrupt documents. It then removes those documents (i.e. deletes them without leaving a deletion stub). Missing documents are replaced from the clustered mail server in the next scheduled replication, and life continues.

Since this is a known limitation on Windows, server operations teams patched the Windows NetApp drivers. They also tweaked the servers to increase the size of the Page pool cache. That certainly reduced the problem, but in the first 3 months of this year we had another 3 minor occurrences, seemingly at random. There certainly seems little correlation between server load and the problem happening. In fact, it is possible to go for quite a while and not noticing the problem. Unless it happens to the CEO's mail file, which it did on Monday.

We are tackling the problem on a number of fronts. Short term we are adding a new (clustered) server, and moving about 100 user off each Domino server to this new server. That will reduce the number of open files on each Windows box by about 200. In Q3 we are upgrading to Domino 7.0.2, and we will upgrade Windows to the 64 bit version at the same time, which increases the size of the page pool cache.

I am also curious to see how well Linux fares under a real production load, so long term we are setting up a production mail server on Red Hat Linux. Fortunately Domino supports application level clustering, which means we can leave the current pairs of clustered Windows servers in place, and bring up a Linux server as a 3rd member of the cluster. We will then migrate users over to it slowly (can you tell I am risk adverse?!!). Finally we will restrict access to the current main production server, and run Domino on Linux with a full production load. This is a long term project, and I doubt we will get it complete this year. But it promises to be very interesting comparison to watch.