Ticket #544 (closed defect: fixed)

Opened 12 years ago

Last modified 12 years ago

Email down again

Reported by: caleb@astro.columbia.edu Assigned to: secrest
Priority: critical Component: mail
Keywords: email imap Cc: gbryan, fernando, dkg, jrollins, savin

Description

Email is down again at this time...

Change History

03/12/07 10:17:04 changed by fernando@astro.columbia.edu

  • cc changed from caleb@astro.columbia.edu, gbryan@astro.columbia.edu to gbryan, fernando, dkg.

03/12/07 10:30:20 changed by jrollins@astro.columbia.edu

can I put in a vote for having dspam taken out of the loop permanently? i'm not sure whether we know exactly what the culprit is, but dspam can't be helping.

do we have any idea what is causing these imap crashes?

03/12/07 10:46:00 changed by jrollins@astro.columbia.edu

  • cc changed from gbryan, fernando, dkg to gbryan, fernando, dkg, jrollins.

03/12/07 10:52:25 changed by secrest@astro.columbia.edu

  • priority changed from blocker to critical.

It is back up, started in the same way as in ticket #539. Again, it didn't seem to work unless I also stopped the postfix process and then restarted imap and then postfix, in that order.

Only this time, before restarting it, I ran strace on the cyrmaster process and captured the raw imap data while I connected to the server from an imap client. I will analyze this to see if I can get a better clue as to the source of the problem.

03/12/07 10:55:36 changed by dkg@astro.columbia.edu

When the IMAP service has been going down recently, it has just been cyrus IMAP that is in a bad state. Mail delivery is still going through, but users are unable to access it. I know this is a meaningless distinction for users themselves (they still can't get their mail), but it's useful to know when trying to debug.

dspam isn't even in the loop at the point of imap access. its use is isolated to tagging mails during SMTP delivery time.

03/12/07 11:00:33 changed by jrollins@astro.columbia.edu

you're obviously right, dkg, and I should have know better before making my comment about dspam. sorry to muck the ticket.

03/12/07 18:58:32 changed by secrest@astro.columbia.edu

The strace output and the captured imap data were not of much use. They just showed an inability of the imap client to connect to the server.

In the log file, mail.log, on sedna, however, there was a segfault in one lmptd process at 7:02 this morning:

Mar 12 07:02:08 sedna cyrus/lmtpd[31036]: accepted connection
Mar 12 07:02:08 sedna cyrus/lmtpd[31036]: lmtp connection preauth'd as postman
Mar 12 07:02:08 sedna cyrus/lmtpd[31036]: DBERROR db3: 2 lockers
Mar 12 07:02:08 sedna postfix/qmgr[5558]: 8F49916F821C: from=<logcheck@astro.columbia.edu>, size=1688, nrcpt=1 (queue active)
Mar 12 07:02:08 sedna cyrus/master[4520]: process 31036 exited, signaled to death by 11
Mar 12 07:02:08 sedna cyrus/master[4520]: service lmtpunix pid 31036 in BUSY state: terminated abnormally

Judging from the munin graph for the postfix mail queue, this seems to be the moment when the imap server went down.

There are similar entries that occured about when the imap server crashed yesterday, too:

Mar 11 12:43:21 sedna cyrus/lmtpd[31566]: mystore: starting txn 2147596036
Mar 11 12:43:21 sedna cyrus/lmtpd[31566]: DBERROR db3: 3 lockers
Mar 11 12:43:21 sedna postfix/qmgr[11430]: 987C816F81D0: from=<>, size=4203, nrcpt=1 (queue active)
Mar 11 12:43:21 sedna postfix/smtpd[32071]: disconnect from unknown[59.45.103.93]
Mar 11 12:43:21 sedna cyrus/master[28675]: process 31566 exited, signaled to death by 11
Mar 11 12:43:21 sedna cyrus/master[28675]: service lmtpunix pid 31566 in BUSY state: terminated abnormally

In the Prerequites section of the Installation documentation for cyrus imap, there is a statement:

It is strongly recommended that libsasl be compiled with Berkeley DB support, using the same version of Berkeley DB. (If you have a Berkeley DB version mismatch, somewhat perplexing crashes result.)

I checked the versions of libdb that imapd and libsasl were linked to:

[0 root@sedna log]# ldd /usr/lib/cyrus/bin/imapd
        libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0xb7edc000)
        libresolv.so.2 => /lib/tls/i686/cmov/libresolv.so.2 (0xb7ec9000)
        libdb3.so.3 => /usr/lib/libdb3.so.3 (0xb7e1f000)
        libssl.so.0.9.7 => /usr/lib/i686/cmov/libssl.so.0.9.7 (0xb7dee000)
        libcrypto.so.0.9.7 => /usr/lib/i686/cmov/libcrypto.so.0.9.7 (0xb7cef000)
        libwrap.so.0 => /lib/libwrap.so.0 (0xb7ce6000)
        libnsl.so.1 => /lib/tls/i686/cmov/libnsl.so.1 (0xb7cd2000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7b9c000)
        libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7b99000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xb7ef8000)
[0 root@sedna log]# ldd /usr/lib/sasl2/libsasldb.so
        libdb-4.2.so => /usr/lib/libdb-4.2.so (0xb7e66000)
        libresolv.so.2 => /lib/tls/i686/cmov/libresolv.so.2 (0xb7e53000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7d1e000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000)
[0 root@sedna log]#

In particular, notice that imapd links to libdb3.so.3, and libsasldb.so links to libdb-4.2.so.

03/12/07 19:39:13 changed by fernando@astro.columbia.edu

should ticket #539 be closed, as a 'subset'/duplicate of this one?

03/13/07 01:33:46 changed by kvj@astro.columbia.edu

Not sure if this is relevant to this discussion, but I know I'm missing one (important) email from last week (from Thurs 8th at around 15:10) - a colleague happened to forward it to me. But I can't find it in spam or my main mailbox. Is it obvious what happened, or could there be more missing?

kvj

03/13/07 21:32:01 changed by savin@astro.columbia.edu

  • cc changed from gbryan, fernando, dkg, jrollins to gbryan, fernando, dkg, jrollins, savin.

Just adding myself to the cc on this thread.

03/22/07 12:12:59 changed by secrest@astro.columbia.edu

Greg, thanks for fixing the same problem again this morning (ticket #555). At this point, I am going to start a wiki page devoted to this problem because there are so many tickets that refer to it, and there should be one place to collect all the information. I'll post this here, for now, because this seems to be the one open ticket on this issue.

There is a signature in the /var/log/mail.log file on sedna that, I think, marks the onset of this type of problem:

Mar 22 07:10:00 sedna cyrus/master[32208]: about to exec /usr/lib/cyrus/bin/lmtpd
Mar 22 07:10:00 sedna cyrus/lmtpunix[32208]: executed
Mar 22 07:10:00 sedna cyrus/lmtpd[32208]: accepted connection
Mar 22 07:10:00 sedna cyrus/lmtpd[32208]: lmtp connection preauth'd as postman
Mar 22 07:10:00 sedna cyrus/lmtpd[32208]: DBERROR db3: 2 lockers
Mar 22 07:10:00 sedna postfix/qmgr[23230]: 256D216F81D3: from=<MaiaBrahe1246@yahoo.com>, size=6688, nrcpt=1 (queue active)
Mar 22 07:10:00 sedna cyrus/master[23125]: process 32208 exited, signaled to death by 11
Mar 22 07:10:00 sedna cyrus/master[23125]: service lmtpunix pid 32208 in BUSY state: terminated abnormally

It is the line that says 'signaled to death by 11'. Signal 11 is a segmentation fault which means the process tried to access memory on a page that wasn't allocated.

What causes that, I don't know, but the last log entry for the process that crashes is always the 'DBERROR db3: n lockers' entry. That means that the process tried to obtain a lock on a Berkeley database and failed. That happens regularly during normal operation, but in this case, my guess is that the process died while holding a lock, and that is why one process crashing halts most of the server operations--connections from imap clients trying to retreive or move mail and connections from postfix for receiving new mail.

Postfix does seem to keep running but is not able to deliver the mail, with messages like:

Mar 22 07:31:24 sedna postfix/pipe[32669]: warning: pipe_command_read: read time limit exceeded
Mar 22 07:31:24 sedna postfix/pipe[32669]: 8CFF116F82BF: to=<lidz@astro.columbia.edu>, relay=caldeliver, delay=1002, status=bounced (Command time limit exceeded: "/usr/local/sbin/cal-deliver")
...
Mar 22 07:31:41 sedna postfix/pipe[32675]: warning: pipe_command_read: read time limit exceeded
Mar 22 07:31:41 sedna postfix/pipe[32675]: 7705D16F82C0: to=<djh@astro.columbia.edu>, relay=caldeliver, delay=1001, status=bounced (Command time limit exceeded: "/usr/local/sbin/cal-deliver")

appearing in the log. And cyrus seems to drop out of the log entirely, as if it weren't running, except for cyrus/squatter, which indexes the mail in users' folders.

And after cyrus is restarted, postfix delivers all the mail that it was spooling while cyrus was down.

Of course, what happens afterwards is not as interesting because what we really want to find out is what is causing the seg fault.

But you may be right, Greg, in that dspam_clean has something to do with it. It kicks off every morning at 6:25 and consumes 95-99% of the processor power for about the next five hours. That happens every day, of course, while cyrus doesn't crash every day, but the timing of the cyrus crashes is suspicious.

03/22/07 13:33:28 changed by secrest@astro.columbia.edu

Just a note, last time it crashed I changed the settings for cyrus to increase the logging level. I did that by putting a line:

CYRUS_VERBOSE=2

in /etc/default/cyrus21.

That doesn't take effect until it restarts, but now that Greg restarted it, it is logging more verbosely. Maybe that will provide useful diagnostic information next time it crashes.

03/29/07 12:56:01 changed by secrest@astro.columbia.edu

The imap server crashed again this morning (ticket #563).

I changed dspam_clean to run weekly instead of daily. On sedna, I did

mv /etc/cron.daily/dspam /etc/cron.weekly/dspam

I will analyze the more verbose logs and post my findings here.

04/05/07 00:08:30 changed by secrest@astro.columbia.edu

The log file, /var/log/mail.log, showed a similar signature on the last crash. The line "signaled to death by 11" marks the server crashing:

Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: mystore: starting txn 2147788506
Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: mystore: committing txn 2147788506
Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: duplicate_mark: <E1HWtKQ-00050r-Vp@titan.astro.columbia.edu> user.dkg.logcheck    1175169738
Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: duplicate_check: <E1HWtKQ-00050r-Vp@titan.astro.columbia.edu>-dkg-cal@fifthhorseman.net .dkg.sieve.          0
Mar 29 08:02:24 sedna cyrus/lmtpd[32226]: telling master 1
Mar 29 08:02:24 sedna cyrus/master[13255]: service lmtpunix pid 32226 in BUSY state: now available and in READY state
Mar 29 08:02:24 sedna cyrus/master[13255]: service lmtpunix now has 1 ready workers
...
Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: mystore: starting txn 2147788510
Mar 29 08:02:24 sedna cyrus/lmtpd[32132]: DBERROR db3: 2 lockers
...
Mar 29 08:02:25 sedna cyrus/master[13255]: process 32132 exited, signaled to death by 11
Mar 29 08:02:25 sedna cyrus/master[13255]: service lmtpunix pid 32132 in BUSY state: terminated abnormally
Mar 29 08:02:25 sedna cyrus/master[13255]: service lmtpunix now has 0 ready workers
...
Mar 29 08:02:25 sedna dspam[32239]: Delivery agent returned exit code 75: /usr/sbin/cyrdeliver dkg -f logcheck@astro.columbia.edu

Again, the last entry logged by the process that died (lmtpd, pid=32132) was "DBERROR db3: 2 lockers" which occured right after it was doing a duplicate check. The duplicate database (/var/lib/cyrus/deliver.db) is a db3 (Berkeley DB 3.2) database which you can see from the line that starts with DUPLICATE in the file /usr/lib/cyrus/cyrus-db-types.active:

DBENGINE BerkeleyDB3.2
DUPLICATE db3_nosync
MBOX skiplist
SEEN skiplist
SUBS flat
TLS db3_nosync

This is further evidence that the crash is caused by a process holding a lock on deliver.db and then crashing.

The dspam entry in the portion of the log, above, that says "Delivery agent returned exit code 75" is related. When postfix has mail to deliver locally, it pipes it to a perl script, cal-deliver, that Daniel wrote (see ticket #335) which, in turn, pipes the message to dspam. If it encounters any errors, it tries, instead, to pipe the message directly to cyrdeliver (which in turn sends it to lmtpd), and if that fails, it leaves it in the directory, /var/spool/dspam/delivery-cache/, which, incidentally, has 50,000 messages sitting there right now.

From the file lib/sysexits.h in the source directory for cyrus21-common, the line:

#define EX_TEMPFAIL     75      /* temp failure; user is invited to retry */

indicates that "exit code 75" signals a temporary failure. Since dspam is not running as a daemon, it probably never attempts redelivery.

For the record, there are numerous entries in the mail.err file for "exit code 65" which is a "data format error".

Since the problem is a seg fault, it is clearly a bug in the cyrus code. The way to solve this problem in the most direct way would be to recompile the cyrus server with debugging enabled and get a backtrace from gdb (gnu debugger). That could find the exact line in the source code where the seg fault occured. Of course, that could be in one of the Berkeley database libraries, too.

In the interests of time, however, it might be easier to switch the backend for the duplicates database to be type skiplist. That may not fix the seg fault problem, but it may handle the locking problem better, i.e., one process crashing won't bring down the whole server.

And the kludgiest solution, would be to just have a process monitor the logs, and when it sees that "signaled to death by 11", just restart the server.

04/05/07 21:54:13 changed by gbryan@astro.columbia.edu

Hi David -- Thanks for this. I agree that the middle solution (i.e. switching the duplicates database) seems like the best compromise, assuming, of course, that it is relatively easy to do the switch. Debugging this sort of locking problem in the cyrus and/or berkeley database code sounds like it could be difficult, and the restart kludge doesn't seem appealing.

I was struck by what you said, if I understood correctly, about 50,000 potentially undelivered messages. Do we think these are duplicates, or could there possibly be real, undelivered, unbounced email sitting there? It is true that there have been occasional complaints of lost email (e.g. kvj's comment earlier on this ticket).

04/06/07 14:30:38 changed by secrest@astro.columbia.edu

I will plan on switching the backend for the duplicates database to type 'skiplist' on Mon evening. The default installation, which we still have, uses the 'Berkeley DB (no sync)' backend for the duplicates database. From the cyrus wiki notes, it sounds like skiplist is better able to deal with crashes--exactly what we need if we are not going to fix the seg fault problem. (And if the seg fault is occuring in the Berkeley database libraries, switching to using skiplist could possibly even stop the crashes. That would be a nice bonus. The main issue, though, is that we don't want a single process crashing to bring down the entire server.)

The server will have to be down during the switch, but probably for less than half an hour. I will do it Monday evening, when I can work late. I am normally in early on Tuesdays, so I will be able to monitor it closely the next morning.

In looking at the messages in /var/spool/dspam/delivery-cache/, it seems that there are typically 5-30/day, all the way back to Oct 14, 2006, around the time that the cal-deliver script was first installed (ticket #335). The bulk of the emails in there, though, are from Feb 7-13, which happened to be the time that dspam was temporarily turned off (ticket #488). Those probably are almost all duplicates because the script /usr/local/sbin/cal-deliver-no-dspam (which, presumably, was what ran in the place of cal-deliver during that time) left a copy of all mail messages in there, whether delivery was successful or not, and most of the time, I am sure, it was.

In general, it is hard to tell whether any of the messages in /var/spool/dspam/delivery-cache are duplicates or are "lost" messages because cal-deliver would remove them from there if they were successfully delivered to dspam, but if delivery to dspam failed, it would leave a copy there, regardless of whether or not a subsequent delivery attempt to cyrdeliver succeeded. However, cal-deliver also logged information to /var/spool/dspam/alerts, and combining that information with the delivery-cache information, I can probably figure out which of them are duplicates. The ones that aren't duplicates ("lost" emails) I can re-inject into the system.

04/06/07 21:38:36 changed by gbryan@astro.columbia.edu

OK, this sounds good to me. I wouldn't worry too much about finding and delivering old lost emails. I'm more concerned about this being a possible indication that we are losing emails (without any indication to the sender) on an ongoing basis. Obviously if this were true, it would be very important to fix it. Thanks.

04/10/07 09:49:49 changed by secrest@astro.columbia.edu

The duplicates database for cyrus imap was changed from type 'db3_nosync' to type 'skiplist'. To do that, I brought down the imap server:

/etc/init.d/cyrus21 stop

moved the current duplicates database and the database types file:

mv /var/lib/cyrus/deliver.db /root
mv /usr/lib/cyrus/cyrus-db-types.active /root

changed the line that started with DUPLICATE in the /usr/lib/cyrus/cyrus-db-types.txt file to read:

DUPLICATE skiplist

and ran:

dpkg-reconfigure cyrus-common

That restarted the server and recreated the cyrus-db-types.active and deliver.db files.

Mostly I followed the directions in the README.Debian.database file in the debian source package cyrus-common. I tried to use cvt_cyrusdb like they instructed but that didn't work. (First I converted it from db3 to flat, which seemed to work. But when I then tried to convert it from flat to skiplist, it said the database was empty. When I tried to convert directly from db3 to skiplist--contrary to the instructions--it created a massive file that filled up the /var partition. So I just blew it away. The worst that can happen is that some people will get some duplicate messages.)

04/15/07 09:35:45 changed by savin@astro.columbia.edu

It seems to be down yet again Sunday 2007-04-15 at 9:35 AM. When I try to log in through mail.astro.columbia.edu I get the following error message:

Error connecting to IMAP server: tls://mail.astro.columbia.edu. 115 : Operation now in progress

04/15/07 09:38:56 changed by savin@astro.columbia.edu

BTW, when the email goes down after hours how can one let you folks know. Using the wiki doesn't seem like it would work as the emails sent by the wiki letting folks know that there is a new or updated ticket won't be received if they are being sent to the astro email addresses until the email is back up. Any suggestions?

04/15/07 10:17:50 changed by dkg@astro.columbia.edu

I've restarted the imap service as mentioned in #539.

reporting things here is entirely reasonable, since the RSS feeds of various ticket types are an alternate form of notification. Thank you for the report, Daniel.

04/15/07 10:18:13 changed by dkg@astro.columbia.edu

btw, dspam_clean was indeed running full tilt while this crash happened.

04/15/07 10:53:20 changed by fernando@astro.columbia.edu

David, maybe you can move the dspam_cleaning to, e.g., Monday morning, when you're around, so that you can immediately spot the crash and restart imap? (of course, really one should figure out what the problem with dspam_clean is, but in the meantime, since this has been happening for months and months and still hasn't been taken care of, mitigating the hassle as much as possible would be very nice). thanks.

04/17/07 18:59:58 changed by dkg@astro.columbia.edu

Trying to recompile cyrus21-imapd against berkeley DB version 4.2 is proving tougher than i expected, because the autoconf scripts don't seem to detect libdb4.2 properly.

A couple things to note:

  • cyrus-imapd-2.2 is now in etch, and is built natively against libdb4.2. It is also in sarge-backports, which is already in sedna's sources.list.
  • the cyrus changelog seems ambiguous about whether db4.2 is supported in cyrus 2.1. the changelog entry for 2.2.2 claims Berkeley DB 4.2 support, but 2.1.1 claims "Now compatible with Berkeley DB4". It's not clear to me if that means that our version (2.1.18) can handle db4.2 or not.

I'm also not sure what the transition from cyrus21-imapd to cyrus-imapd-2.2 would look like.

04/24/07 09:31:41 changed by gbryan@astro.columbia.edu

The IMAP server failed again this morning at about 4 AM (April 24, 2007). I restarted it as in ticket #539.

04/24/07 11:52:52 changed by secrest@astro.columbia.edu

Thanks again, Greg. I checked my mail at 7am this morning, and it was working. I had left myself logged in overnight, though, so I guess it continues to work for users who already had established imap connections. The logs look the same again:

Apr 24 04:11:12 sedna cyrus/lmtpd[15520]: lmtp connection preauth'd as postman
Apr 24 04:11:12 sedna cyrus/lmtpd[15520]: duplicate_check: <0D9EC6E8.8EE26567@italnet.com.br>-Marc.Audard@obs.unige.ch .audard.sieve.       0
...
Apr 24 04:11:12 sedna cyrus/lmtpd[15520]: mystore: starting txn 2147788827
Apr 24 04:11:12 sedna cyrus/lmtpd[15520]: DBERROR db3: 3 lockers
Apr 24 04:11:12 sedna cyrus/master[9813]: process 15520 exited, signaled to death by 11
Apr 24 04:11:12 sedna cyrus/master[9813]: service lmtpunix pid 15520 in BUSY state: terminated abnormally
Apr 24 04:11:12 sedna cyrus/master[9813]: service lmtpunix now has 0 ready workers
Apr 24 04:11:12 sedna dspam[15505]: Delivery agent returned exit code 75: /usr/sbin/cyrdeliver audard -f ffxvhaof@italnet.com.br

Later though, there was:

Apr 24 04:26:55 sedna cyrus/imapd[7862]: imaps TLS negotiation failed: [local]
Apr 24 04:26:55 sedna cyrus/imapd[7862]: Fatal error: tls_start_servertls() failed

And at shutdown:

Apr 24 09:23:18 sedna cyrus/lmtpd[15513]: DBERROR db3: Error: closing the transaction region with active transactions
Apr 24 09:23:18 sedna cyrus/lmtpd[15477]: DBERROR db3: Error: closing the transaction region with active transactions
...
Apr 24 09:23:18 sedna cyrus/imapd[12458]: DBERROR db3: Error: closing the transaction region with active transactions

which seems to indicate that the problem is not with the duplicates backend but with the tls backend, which is also db3. I had been jumping to the conclusion before that since the dying process always logged a duplicate check, then a db3 locking error before seg faulting, that it was related to a lock on the duplicates database. From these other entries in the log that I hadn't noticed before, it seems that the problem is related to a lock on the tls database.

And, for the record, my attempt to change the duplicates database to skiplist didn't seem to work anyway:

[0 root@sedna root]# file /var/lib/cyrus/deliver.db
/var/lib/cyrus/deliver.db: Berkeley DB (Btree, version 8, native byte-order)
[0 root@sedna root]#

I followed the directions in the README.Debian.database file and it even updated the /usr/lib/cyrus/cyrus-db-types.active file, but I guess that wasn't right. I think it is time to include a quote from the README.Debian.database file:

"All systems administrators have their horror stories. For me, it was setting up a HP Color Bubblejet under Linux using ghostscript before linuxprinting.org was alive. Well that was a piece of cake compared to what I am about to describe in this document."

-- "Hosting email for virtual domains using Postfix and Cyrus"

Haim Dimermanas, 2001-08-01

Now, I think, once again, that the problem may be a mismatch in Berkeley database shared libraries between cyrus and libsasl which is mentioned in http://cyrusimap.web.cmu.edu/imapd/install-prereq.html .

At this point, I am not going to analyze this any more. I am just going to write a script that restarts the server when it sees that "signaled to death by 11" in the log and hope that the problem goes away on the next major upgrade of the mail server.

05/02/07 08:09:42 changed by savin@astro.columbia.edu

Seems to be down again at 8:10 AM Wed 2 May.

05/02/07 09:09:11 changed by secrest@astro.columbia.edu

A script has been written to check every five minutes for this problem in the log. For now, it only sends a text message to my phone, which it did at 8:03, the crash occuring at 8:02:22. Unfortunately I was on the subway, but I got in and restarted it at 8:45.

I hadn't included the part of the script that actually restarts the server, yet, because I wanted to make sure that the script didn't have any unexpected behavior. Unexpectedly, it got tested and worked exactly as it was supposed to.

The script is at /usr/local/sbin/cyrus-watch on sedna and is set (in /etc/crontab) to run every five minutes. It logs its activities in /var/log/cyrus-watch.log.

05/02/07 12:22:53 changed by secrest@astro.columbia.edu

I updated the script, /usr/local/sbin/cyrus-watch, to restart the cyrus imap server when it observes the signature in the log for this kind of crash. As I mentioned in the previous comment, it logs its actions in /var/log/cyrus-watch.log.

This does not fix the underlying problem, the cause of the crashes, but from the users' standpoint, it solves the problem. The server should never be down (due to this type of problem) for more than 5 minutes. Since we are planning on upgrading the mail server at some point, I am not going to attempt to solve the underlying problem. I will just deal with it if it still exists in the upgraded version of cyrus imap.

Since I couldn't really fully test this script, I will wait until the next crash before closing this ticket. It will still send a message to my cell phone, so I will know to check it right away.

05/07/07 11:34:28 changed by secrest@astro.columbia.edu

  • status changed from new to closed.
  • resolution set to fixed.

The imap server crashed again at 7:02 am Sun (5/6) morning. At 7:06 am, the script, /usr/local/sbin/cyrus-watch, on sedna noticed the problem in the log, alerted my cell phone, and then completely shut down and restarted the imap server.

This is as far as I am going to take this problem until the next mail server upgrade, so I am closing this ticket.

05/07/07 14:39:11 changed by fernando@astro.columbia.edu

David, that's excellent, thanks.