Oracle Zero Downtime Migration troubleshooting.

The Oracle Zero Downtime Migration (ZDM)  tool has been created to help with Oracle databases migrations. It saves time and automates many tasks. The saved effort is getting bigger and accumulates on scale when you need to move multiple databases. Behind the scenes it uses the very well known Oracle Data Guard. As a result you have good solid technology on the basis but at the same time limited to what the DataGuard can do and what it cannot. All details and documentation are available here.  The tool works fine when all prerequisites are met but when you hit an issue you need to dig in and troubleshoot. Here I will try to share some experience with the ZDM troubleshooting. Please note that the information in the blog is actual for the 19.2 version of ZDM and it is possible that the behaviour will be different in the future versions.

We set up the tool, verified all prerequisites and ran a migration job in evaluation mode using the “-eval” parameter but the job failed.

To monitor a job execution you can use a command like “zdmcli query job -jobid 5”. The output will provide the basic information about the job and result of each phase. In case of success it looks like:

[zdmuser@zdmserver ~]$ /opt/oracle/app/zdmhome/bin/zdmcli query job -jobid 8
zdmserver: Audit ID: 516
Job ID: 8
User: zdmuser
Client: zdmuser
Scheduled job command: "zdmcli migrate database -sourcesid SOURCEDB -sourcenode source.localdomain -srcauth zdmauth -srcarg1 user:zdmuser -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode target.localdomain -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -targethome /u02/app/oracle/product/12.2.0/dbhome1 -rsp /home/zdmuser/zdm_template_sourcedbstg.rsp -eval"
Scheduled job execution start time: 2020-04-20T14:26:25-03. Equivalent local time: 2020-04-20 14:26:25
Current status: SUCCEEDED
Result file path: "/opt/oracle/app/zdmbase/chkbase/scheduled/job-8-2020-04-20-14:26:38.log"
Job execution start time: 2020-04-20 14:26:38
Job execution end time: 2020-04-20 14:32:29
Job execution elapsed time: 5 minutes 50 seconds
ZDM_GET_SRC_INFO ………. COMPLETED
ZDM_GET_TGT_INFO ………. COMPLETED
ZDM_SETUP_SRC …………. COMPLETED
ZDM_SETUP_TGT …………. COMPLETED
ZDM_GEN_RMAN_PASSWD ……. COMPLETED
ZDM_PREUSERACTIONS …….. COMPLETED
ZDM_PREUSERACTIONS_TGT …. COMPLETED
ZDM_VALIDATE_SRC ………. COMPLETED
ZDM_VALIDATE_TGT ………. COMPLETED

But, unfortunately it failed in our case and you could see something like that:

zdmserver: Processing response file …
zdmserver: Starting zero downtime migrate operation …
zdmserver: Executing phase ZDM_GET_SRC_INFO
zdmserver: retrieving information about database "DSST" …
zdmserver: Executing phase ZDM_GET_TGT_INFO
zdmserver: Retrieving information from target node "target.localhost" …
zdmserver: Executing phase ZDM_SETUP_SRC
zdmserver: Setting up ZDM on the source node source.localhost …
zdmserver: Executing phase ZDM_SETUP_TGT
zdmserver: Setting up ZDM on the target node target.localhost …
zdmserver: Executing phase ZDM_GEN_RMAN_PASSWD
zdmserver: Executing phase ZDM_PREUSERACTIONS
zdmserver: Executing phase ZDM_PREUSERACTIONS_TGT
zdmserver: Executing phase ZDM_VALIDATE_SRC
zdmserver: Validating standby on the source node source.localhost …
zdmserver: Executing phase ZDM_VALIDATE_TGT
zdmserver: Validating standby on the target node target.localhost …

It was apparently not enough to troubleshoot the problem. We needed more logs. Luckily we had had full set of the logs on the source and target in the /tmp/zdm-*/log/ directory.

zdmuser@plxde746 ~]$ view /tmp/zdm-237637609/zdm/log/mZDM_oss_standby_validate_src_3119.log
 
 
19:25:17.000: Command received is : mZDM_oss_standby_validate_src -sdbsid SOURCEDB -sdbdomain localdomain -sdbhome /opt/oracle/release/12.2.0.1 -dbid 111111111 -scn 331201554581 -tdbname sourcedb -tdbhome /u02/app/oracle/product/12.2.0/dbhome1 -sdbScanName source.localdomain -tdbScanName test-scan.localdomain -tdbScanPort 1521 -tdatadg +DATAC1 -tredodg +DATAC1 -trecodg +RECOC1 -bkpPath /migration/staging
19:25:17.000: ### Printing the configuration values from files:
19:25:17.000: /tmp/zdm-237637609/zdm/mZDMconfig_params
19:25:17.000: DATA_DG=+DATAC1

And that log provide fully detailed execution with all parameters, commands and values. That helped to nail down the problem and resolve it.

But it doesn’t work for all cases. Sometimes, when by some reasons it cannot even create the /tmp/zdm* directory on source or target you don’t have any logs at all.

For example, the zdm user connected to the source doesn’t have the privilege to execute some commands as root and fails on the very first steps. In such case you don’t have any other option then try to execute the migration job not in evaluation mode (-eval) but in real migration mode. I recommend in such cases put parameter -pauseafter and specify where you want the job to stop. In my case I used “-pauseafter ZDM_SETUP_SRC” . We ran the job and the execution failed on the very first step.

[zdmuser@zdmserver ~]$ /opt/oracle/app/zdmhome/bin/zdmcli query job -jobid 6
zdmserver: Audit ID: 514
Job ID: 6
User: zdmuser
Client: vlxpr1008
Scheduled job command: "zdmcli migrate database -sourcesid SOURCEDB -sourcenode source.localdomain -srcauth zdmauth -srcarg1 user:zdmuser -srcarg2 identity_file:/home/zdmuser/.ssh/id_rsa -srcarg3 sudo_location:/usr/bin/sudo -targetnode target.localdomain -tgtauth zdmauth -tgtarg1 user:opc -tgtarg2 identity_file:/home/zdmuser/.ssh/id_rsa -tgtarg3 sudo_location:/usr/bin/sudo -targethome /u02/app/oracle/product/12.2.0./dbhome1 -rsp /home/zdmuser/zdm_template_sourcedbstg.rsp -pauseafter ZDM_SETUP_SRC"
Scheduled job execution start time: 2020-03-27T09:20:30-03. Equivalent local time: 2020-03-27 09:20:30
Current status: FAILED
Result file path: "/opt/oracle/app/zdmbase/chkbase/scheduled/job-6-2020-03-27-09:20:31.log"
Job execution start time: 2020-03-27 09:20:31
Job execution end time: 2020-03-27 09:20:48
Job execution elapsed time: 16 seconds
ZDM_GET_SRC_INFO ………….. FAILED
ZDM_GET_TGT_INFO ………….. PENDING
ZDM_SETUP_SRC …………….. PENDING
ZDM_SETUP_TGT …………….. PENDING

I checked the log and found the following.

[zdmuser@zdmserver ~]$ cat /opt/oracle/app/zdmbase/chkbase/scheduled/job-6-2020-03-27-09\:20\:31.log
zdmserver: Processing response file …
zdmserver: Starting zero downtime migrate operation …
zdmserver: Executing phase ZDM_GET_SRC_INFO
zdmserver: retrieving information about database "DSST" …
PRCF-2056 : The copy operation failed on node: "source.localdomain". Details:
{1}
PRCZ-4002 : failed to execute command "/bin/cp" using the privileged execution plugin "zdmauth" on nodes "source.localdomain"
[zdmuser@zdmserver ~]$

From the first glance it looked like we were unable to use “sudo cp” but after several tests it was discovered that we lacked privilege to run “/bin/scp” on the source and could not copy the zdm files from zdmserver. After fixing the problem you can either resume the job using command “zdmcli resume job -jobid 6” or destroy it and run the “eval” job again. To destroy a job you need to run “zdmcli abort job -jobid 6”.

In my experience working with ZDM on several different environments the most of the problems boiled down to network, database software, instances configuration, and permission issues. Let me stop on the last category. The ZDM user on source and target is supposed to have full privileges as a superuser. In Oracle cloud it is an “opc” user which can run any command from “sudo”. But if you move the database from on-prem you might encounter some difficulties getting such privileges. In my case we used help from the Oracle ZDM team and from the Oracle product manager for ZDM. We also did some troubleshooting and adjusting by ourselves to put all commands to /etc/sudoers list for the ZDM user on source machines.

A couple of other problems were related to discrepancy in software level between source and target. The documented parameter “-ignore PATCH_CHECK” didn’t work for us and we used “-ignore ALL” instead. Also I found that for 12.1 and 11g the ZDM tool didn’t encrypt tablespaces on the cloud side during standby creation and used “restoreDatabase” subprogram instead “restoreAndEncryptDatabase” which was used for 12.2 and later versions. 

In summary I can say that despite few bumps and problems with the tool the ZDM was able to significantly reduce effort and number of errors during migrations even for cases when it was used only as a part of the migration process. I am looking forward to the new version and hope it provides more options for migrations. Shoot me an email or get me on twitter if you need help with migration or to make ZDM working.

Leave a Reply

Your email address will not be published. Required fields are marked *