PostgreSQL Does Not Start

Situations of expected failures in PostgreSQL starting on HA

There may be several situations in which PostgreSQL will fail to start. The status of PostgreSQL can be checked with crm-status.

PostgreSQL Failure on Passive SMW Due to Synchronization

If the output of crm-status is as follows:

2 nodes configured
33 resources configured

Online: [ smw1 smw2 ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP1     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP2     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP3     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP4     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP5     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterMonitor (ocf::smw:ClusterMonitor):      Started smw1
 ClusterTimeSync        (ocf::smw:ClusterTimeSync):     Started smw1
 HSSDaemonMonitor       (ocf::smw:HSSDaemonMonitor):    Started smw1
 Notification   (ocf::heartbeat:MailTo):        Started smw1
 ResourceInit   (ocf::smw:ResourceInit):        Started smw1
 cray-cfgset-cache      (systemd:cray-cfgset-cache):    Started smw1
 dhcpd  (systemd:dhcpd.service):        Started smw1
 fsync  (ocf::smw:fsync):       Started smw1
 hss-daemons    (lsb:rsms):     Started smw1
 stonith-1      (stonith:external/ipmi):        Started smw2
 stonith-2      (stonith:external/ipmi):        Started smw1
 Resource Group: HSSGroup
     mysqld     (ocf::heartbeat:mysql): Started smw1
 Resource Group: IMPSGroup
     cray-ids-service   (systemd:cray-ids-service):     Started smw1
     cray-ansible       (systemd:cray-ansible): Started smw1
     IMPSFilesystemConfig       (ocf::smw:FileSystemConfig):    Started smw1
 Resource Group: LogGroup
     cray-syslog        (systemd:llmrd.service):        Started smw1
     LogFilesystemConfig        (ocf::smw:FileSystemConfig):    Started smw1
 Resource Group: SharedFilesystemGroup
     homedir    (ocf::heartbeat:Filesystem):    Started smw1
     md-fs      (ocf::heartbeat:Filesystem):    Started smw1
     imps-fs    (ocf::heartbeat:Filesystem):    Started smw1
     ml-fs      (ocf::heartbeat:Filesystem):    Started smw1
     repos-fs   (ocf::heartbeat:Filesystem):    Started smw1
 Resource Group: SystemGroup
     NFSServer  (systemd:nfsserver):    Started smw1
     EnableRsyslog      (ocf::smw:EnableRsyslog):       Started smw1
     syslog.socket      (systemd:syslog.socket):        Started smw1
 Clone Set: clo_PostgreSQL [PostgreSQL]
     Started: [ smw1 ]
     Stopped: [ smw2 ]

The last line indicates that PostgreSQL is not running on SMW2. If SMW2 is the passive SMW, this may indicate that the SMW2 is still synchronizing with the active SMW. Once synchronization is complete, PostgreSQL will have started.

To confirm that synchronization is the issue, run journalctl -u pmdb_util. The following lines should appear near the bottom of the output:

Nov 30 14:43:27 smw1 pmdb_util[40749]: [init_standby()]:     
INFO: Initializing HA standby system...
Nov 30 14:43:40 smw1 pmdb_util[40749]: [init_standby()]:     
INFO: Old data directory removed.
Nov 30 14:43:40 smw1 pmdb_util[40749]: [init_standby()]:     
INFO: Synchronizing this standby with master. This might take a while!

This indicates that synchronization is not yet complete, and more wait time is needed before verifying the status of PostgreSQL.

Reported Failures in journalctl -u pmdb_util

If the last line of journalctl -u pmdb_util is Initial replication NOT successful!, this indicates that synchronization has failed. Look over the entire output in order to find where the problem is. Resolution of the issue depends on the problem listed.

Reported Failures in crm_status on the Passive SMW

Problems with PostgreSQL may also be found in the output of crm_status. PostgreSQL may have timed out, as seen below:

PostgreSQL_start_0 on smw2 'unknown error' (1): call=615, status=Timed Out, exitreason='none',
    last-rc-change='Fri Dec  8 11:57:41 2017', queued=0ms, exec=200002ms

It may also have failed, as seen in the following output:

PostgreSQL_start_0 on ethel 'not running' (7): call=7480, status=complete, exitreason='none',
    last-rc-change='Wed Feb 21 12:29:12 2018', queued=0ms, exec=3141ms

Since smw2 is the passive SMW, the most likely reason for this failure is it was not able to replicate the data from the master fast enough. The administrator will need to re-initialize the passive SMW manually using pmdb_util ha --init_standby. Once this is complete, run clear_failcounts to remove the reported failure from crm-status. The cluster is now in a healthy state.

Investigate PostgresSQL failure on the Active SMW

If the output of crm-status is as follows:

2 nodes configured
33 resources configured

Online: [ smw1 smw2 ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP1     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP2     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP3     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP4     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterIP5     (ocf::heartbeat:IPaddr2):       Started smw1
 ClusterMonitor (ocf::smw:ClusterMonitor):      Started smw1
 ClusterTimeSync        (ocf::smw:ClusterTimeSync):     Started smw1
 HSSDaemonMonitor       (ocf::smw:HSSDaemonMonitor):    Started smw1
 Notification   (ocf::heartbeat:MailTo):        Started smw1
 ResourceInit   (ocf::smw:ResourceInit):        Started smw1
 cray-cfgset-cache      (systemd:cray-cfgset-cache):    Started smw1
 dhcpd  (systemd:dhcpd.service):        Started smw1
 fsync  (ocf::smw:fsync):       Started smw1
 hss-daemons    (lsb:rsms):     Started smw1
 stonith-1      (stonith:external/ipmi):        Started smw2
 stonith-2      (stonith:external/ipmi):        Started smw1
 Resource Group: HSSGroup
     mysqld     (ocf::heartbeat:mysql): Started smw1
 Resource Group: IMPSGroup
     cray-ids-service   (systemd:cray-ids-service):     Started smw1
     cray-ansible       (systemd:cray-ansible): Started smw1
     IMPSFilesystemConfig       (ocf::smw:FileSystemConfig):    Started smw1
 Resource Group: LogGroup
     cray-syslog        (systemd:llmrd.service):        Started smw1
     LogFilesystemConfig        (ocf::smw:FileSystemConfig):    Started smw1
 Resource Group: SharedFilesystemGroup
     homedir    (ocf::heartbeat:Filesystem):    Started smw1
     md-fs      (ocf::heartbeat:Filesystem):    Started smw1
     imps-fs    (ocf::heartbeat:Filesystem):    Started smw1
     ml-fs      (ocf::heartbeat:Filesystem):    Started smw1
     repos-fs   (ocf::heartbeat:Filesystem):    Started smw1
 Resource Group: SystemGroup
     NFSServer  (systemd:nfsserver):    Started smw1
     EnableRsyslog      (ocf::smw:EnableRsyslog):       Started smw1
     syslog.socket      (systemd:syslog.socket):        Started smw1
 Clone Set: clo_PostgreSQL [PostgreSQL]
     Stopped: [ smw1, smw2 ]
     Stopped: [ smw2 ]

The last line indicates that PostgreSQL is not running on either SMW, which means PostgreSQL is not running on the active SMW. This requires further research to find the error. Investigate the logs with journalctl. Resolution of the issue depends on the problem listed.

Resolve PostgresSQL failure on the Active SMW

If the problem with PostgreSQL can't be found, the active SMW will need to be reinitialized manually. For this procedure, see Reinitialize the PMDB.

For further assistance with PostgreSQL, contact a Cray representative.