PostgreSQL Does Not Start
Situations of expected failures in PostgreSQL starting on HA
There may be several situations in which PostgreSQL will fail to start. The status of PostgreSQL can be checked with crm-status.
PostgreSQL Failure on Passive SMW Due to Synchronization
If the output of crm-status is as follows:
2 nodes configured
33 resources configured
Online: [ smw1 smw2 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP1 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP2 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP3 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP4 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP5 (ocf::heartbeat:IPaddr2): Started smw1
ClusterMonitor (ocf::smw:ClusterMonitor): Started smw1
ClusterTimeSync (ocf::smw:ClusterTimeSync): Started smw1
HSSDaemonMonitor (ocf::smw:HSSDaemonMonitor): Started smw1
Notification (ocf::heartbeat:MailTo): Started smw1
ResourceInit (ocf::smw:ResourceInit): Started smw1
cray-cfgset-cache (systemd:cray-cfgset-cache): Started smw1
dhcpd (systemd:dhcpd.service): Started smw1
fsync (ocf::smw:fsync): Started smw1
hss-daemons (lsb:rsms): Started smw1
stonith-1 (stonith:external/ipmi): Started smw2
stonith-2 (stonith:external/ipmi): Started smw1
Resource Group: HSSGroup
mysqld (ocf::heartbeat:mysql): Started smw1
Resource Group: IMPSGroup
cray-ids-service (systemd:cray-ids-service): Started smw1
cray-ansible (systemd:cray-ansible): Started smw1
IMPSFilesystemConfig (ocf::smw:FileSystemConfig): Started smw1
Resource Group: LogGroup
cray-syslog (systemd:llmrd.service): Started smw1
LogFilesystemConfig (ocf::smw:FileSystemConfig): Started smw1
Resource Group: SharedFilesystemGroup
homedir (ocf::heartbeat:Filesystem): Started smw1
md-fs (ocf::heartbeat:Filesystem): Started smw1
imps-fs (ocf::heartbeat:Filesystem): Started smw1
ml-fs (ocf::heartbeat:Filesystem): Started smw1
repos-fs (ocf::heartbeat:Filesystem): Started smw1
Resource Group: SystemGroup
NFSServer (systemd:nfsserver): Started smw1
EnableRsyslog (ocf::smw:EnableRsyslog): Started smw1
syslog.socket (systemd:syslog.socket): Started smw1
Clone Set: clo_PostgreSQL [PostgreSQL]
Started: [ smw1 ]
Stopped: [ smw2 ]The last line indicates that PostgreSQL is not running on SMW2. If SMW2 is the passive SMW, this may indicate that the SMW2 is still synchronizing with the active SMW. Once synchronization is complete, PostgreSQL will have started.To confirm that synchronization is the issue, run journalctl -u pmdb_util. The following lines should appear near the bottom of the output:
Nov 30 14:43:27 smw1 pmdb_util[40749]: [init_standby()]: INFO: Initializing HA standby system... Nov 30 14:43:40 smw1 pmdb_util[40749]: [init_standby()]: INFO: Old data directory removed. Nov 30 14:43:40 smw1 pmdb_util[40749]: [init_standby()]: INFO: Synchronizing this standby with master. This might take a while!This indicates that synchronization is not yet complete, and more wait time is needed before verifying the status of PostgreSQL.
Reported Failures in journalctl -u pmdb_util
If the last line of journalctl -u pmdb_util is Initial replication NOT successful!, this indicates that synchronization has failed. Look over the entire output in order to find where the problem is. Resolution of the issue depends on the problem listed.
Reported Failures in crm_status on the Passive SMW
Problems with PostgreSQL may also be found in the output of crm_status. PostgreSQL may have timed out, as seen below:
PostgreSQL_start_0 on smw2 'unknown error' (1): call=615, status=Timed Out, exitreason='none',
last-rc-change='Fri Dec 8 11:57:41 2017', queued=0ms, exec=200002ms
It may also have failed, as seen in the following output:PostgreSQL_start_0 on ethel 'not running' (7): call=7480, status=complete, exitreason='none',
last-rc-change='Wed Feb 21 12:29:12 2018', queued=0ms, exec=3141msSince smw2 is the passive SMW, the most likely reason for this failure is it was not able to replicate the data from the master fast enough. The administrator will need to re-initialize the passive SMW manually using pmdb_util ha --init_standby. Once this is complete, run clear_failcounts to remove the reported failure from crm-status. The cluster is now in a healthy state.Investigate PostgresSQL failure on the Active SMW
If the output of crm-status is as follows:
2 nodes configured
33 resources configured
Online: [ smw1 smw2 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP1 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP2 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP3 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP4 (ocf::heartbeat:IPaddr2): Started smw1
ClusterIP5 (ocf::heartbeat:IPaddr2): Started smw1
ClusterMonitor (ocf::smw:ClusterMonitor): Started smw1
ClusterTimeSync (ocf::smw:ClusterTimeSync): Started smw1
HSSDaemonMonitor (ocf::smw:HSSDaemonMonitor): Started smw1
Notification (ocf::heartbeat:MailTo): Started smw1
ResourceInit (ocf::smw:ResourceInit): Started smw1
cray-cfgset-cache (systemd:cray-cfgset-cache): Started smw1
dhcpd (systemd:dhcpd.service): Started smw1
fsync (ocf::smw:fsync): Started smw1
hss-daemons (lsb:rsms): Started smw1
stonith-1 (stonith:external/ipmi): Started smw2
stonith-2 (stonith:external/ipmi): Started smw1
Resource Group: HSSGroup
mysqld (ocf::heartbeat:mysql): Started smw1
Resource Group: IMPSGroup
cray-ids-service (systemd:cray-ids-service): Started smw1
cray-ansible (systemd:cray-ansible): Started smw1
IMPSFilesystemConfig (ocf::smw:FileSystemConfig): Started smw1
Resource Group: LogGroup
cray-syslog (systemd:llmrd.service): Started smw1
LogFilesystemConfig (ocf::smw:FileSystemConfig): Started smw1
Resource Group: SharedFilesystemGroup
homedir (ocf::heartbeat:Filesystem): Started smw1
md-fs (ocf::heartbeat:Filesystem): Started smw1
imps-fs (ocf::heartbeat:Filesystem): Started smw1
ml-fs (ocf::heartbeat:Filesystem): Started smw1
repos-fs (ocf::heartbeat:Filesystem): Started smw1
Resource Group: SystemGroup
NFSServer (systemd:nfsserver): Started smw1
EnableRsyslog (ocf::smw:EnableRsyslog): Started smw1
syslog.socket (systemd:syslog.socket): Started smw1
Clone Set: clo_PostgreSQL [PostgreSQL]
Stopped: [ smw1, smw2 ]
Stopped: [ smw2 ]The last line indicates that PostgreSQL is not running on either SMW, which means PostgreSQL is not running on the active SMW. This requires further research to find the error. Investigate the logs with journalctl. Resolution of the issue depends on the problem listed.Resolve PostgresSQL failure on the Active SMW
If the problem with PostgreSQL can't be found, the active SMW will need to be reinitialized manually. For this procedure, see Reinitialize the PMDB.
For further assistance with PostgreSQL, contact a Cray representative.