Cloudera Enterprise 6.3.x | Other versions

Copying Cluster Data Using DistCp

  Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. You can also use distcp to copy data to and from an Amazon S3 bucket. The distcp command submits a regular MapReduce job that performs a file-by-file copy.

To see the distcp command options, run the built-in help:
hadoop distcp
  Important:
  • Do not run distcp as the hdfs user which is blacklisted for MapReduce jobs by default.
  • Do not use Hadoop shell commands (such as cp, copyfromlocal, put, get) for large copying jobs or you may experience I/O bottlenecks.

Continue reading:

DistCp Syntax and Examples

You can use distcp to copy files between compatible clusters in either direction. The most basic form of the distcp command only requires that information:

hadoop distcp <source> <destination>

The following section provides the basic syntax for different distcp scenarios. For more information, see the relevant section on this page.

Copying Between the Same CDH Version

Use the following syntax:

hadoop distcp hdfs://<namenode>:<port> hdfs://<namenode>

For example, the following command copies data from example-source to example-dest:

hadoop distcp hdfs://example-source.cloudera.com:50070 hdfs://example-dest.cloudera.com

Port 50070 is the default NameNode port for HDFS.

Different but Compatible CDH Major Versions

Run the distcp command on the cluster that runs the higher version of CDH, which should be the destination cluster. Use the following syntax:

hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>

Note the webhdfs prefix for the remote cluster, which should be your source cluster. You must use webhdfs when the clusters run different major versions. When clusters run the same version, you can use the hdfs protocol for better performance.

For example, the following command copies data from a lower CDH source cluster named example-source to a higher CDH version destination cluster namedexample-dest:

hadoop distcp webhdfs://example-source.cloudera.com:50070 hdfs://example-dest.cloudera.com

Copying a Specific Path

You can also use a specific path, such as /hbase to move HBase data:

hadoop distcp hdfs://example-source.cloudera.com:50070/hbase hdfs://example-dest.cloudera.com/hbase

Copying to/from Amazon S3

The following syntax for distcp shows how to copy data to/from S3:

#Copying from S3
hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/
#Copying to S3
hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>

This is a basic example of using distcp with S3. For more information, see Using DistCp with Amazon S3.

Copying to/from ADLS Gen1 and Gen2

The following syntax for distcp shows how to copy data to/from ADLS Gen1:
#Copying from ADLS Gen1
hadoop distcp adl://store.azuredatalakestore.net/src hdfs://hdfs_destination_path
#Copying to ADLS Gen1
hadoop distcp hdfs://hdfs_destination_path adl://store.azuredatalakestore.net/src 
To use distcp with ADLS Gen2, use the Gen2 URI instead of the Gen1 URI, for example:
#Copying from ADLS Gen2
hadoop distcp abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name> hdfs://hdfs_destination_path
#Copying to ADLS Gen2
hadoop distcp hdfs://hdfs_destination_path abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name> 

Using DistCp with Highly Available Remote Clusters

You can use distcp to copy files between highly available clusters by configuring access to the remote cluster with the nameservice ID. To enable support, perform the following steps:
  1. Create a new directory and copy the contents of the /etc/hadoop/conf directory on the local cluster to this directory. The local cluster is the cluster where you plan to run the distcp command.

    Specify this directory for the --config parameter when you run the distcp command in step 5.

    The following steps use distcpConf as the directory name. Substitute the name of the directory you created for distcpConf.

  2. In the hdfs-site.xml file in the distcpConf directory, add the nameservice ID for the remote cluster to the dfs.nameservices property.
      Note:

    If the remote cluster has the same nameservice ID as the local cluster, change the remote cluster’s nameservice ID. Nameservice names must be unique.

    For example, if the nameservice name for both clusters is nameservice1, change the nameservice ID of the remote cluster to a different ID, such as externalnameservice:
    <property>
     <name>dfs.nameservices</name>
     <value>nameservice1,externalnameservice</value>
     </property>
    
  3. On the remote cluster, find the hdfs-site.xml file and copy the properties that refers to the nameservice ID to the end of the hdfs-site.xml file in the distcpConf directory you created in step 1:
    • dfs.ha.namenodes.<nameserviceID>
    • dfs.client.failover.proxy.provider.<remote nameserviceID>
    • dfs.ha.automatic-failover.enabled.<remote nameserviceID>
    • dfs.namenode.rpc-address.<nameserviceID>.<namenode1>
    • dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>
    • dfs.namenode.http-address.<nameserviceID>.<namenode1>
    • dfs.namenode.https-address.<nameserviceID>.<namenode1>
    • dfs.namenode.rpc-address.<nameserviceID>.<namenode2>
    • dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>
    • dfs.namenode.http-address.<nameserviceID>.<namenode2>
    • dfs.namenode.https-address.<nameserviceID>.<namenode2>
    By default, you can find the hdfs-site.xml file in the /etc/hadoop/conf directory on a node of the remote cluster.
  4. If you changed the nameservice ID for the remote cluster in step 2, update the nameservice ID used in the properties you copied in step 3 with the new nameservice ID, accordingly.

    The following example shows the properties copied from the remote cluster with the following values:

    • A remote nameservice called externalnameservice
    • NameNodes called namenode1 and namenode2
    • A host named remotecluster.com
     <property>
    <name>dfs.ha.namenodes.externalnameservice</name>
    <value>namenode1,namenode2</value>
    </property>
    
    <property>
    <name>dfs.client.failover.proxy.provider.externalnameservice</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    
    <property>
    <name>dfs.ha.automatic-failover.enabled.externalnameservice</name>
    <value>true</value>
    </property>
    
    <property>
    <name>dfs.namenode.rpc-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:8020</value>
    </property>
    
    <property>
    <name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:8022</value>
    </property>
    
    <property>
    <name>dfs.namenode.http-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:20101</value>
    </property>
    
    <property>
    <name>dfs.namenode.https-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:20102</value>
    </property>
    
    <property>
    <name>dfs.namenode.rpc-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:8020</value>
    </property>
    
    <property>
    <name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:8022</value>
    </property>
    
    <property>
    <name>dfs.namenode.http-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:20101</value>
    </property>
    
    <property>
    <name>dfs.namenode.https-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:20102</value>
    </property>
    

    At this point, the hdfs-site.xml file in the distcpConf directory should have both clusters and 4 NameNode IDs.

  5. Depending on the use case, the options specified when you run the distcp may differ. Here are some examples:
      Note: The remote cluster can be either the source or the target. The examples provided specify the remote cluster as the source.
    To copy data from an insecure cluster , run the following command:
    hadoop --config distcpConf distcp hdfs://<nameservice>/<source_directory> <target directory>
    To copy data from a secure cluster, run the following command:
    hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=<nameservice>  hdfs://<nameservice>/<source_directory> <target directory>
    For example:
    hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1 hdfs://ns1/xyz /tmp/test

    If the distcp source or target are in encryption zones, include the following distcp options: -skipcrccheck -update. The distcp command may fail if you do not include these options when the source or target are in encryption zones because the CRC for the files may differ.

    For CDH 5.12.0 and later, distcp between clusters that both use HDFS Transparent Encryption, you must include the exclude parameter.

Using DistCp with Amazon S3

You can copy HDFS files to and from an Amazon S3 instance. You must provision an S3 bucket using Amazon Web Services and obtain the access key and secret key. You can pass these credentials on the distcp command line, or you can reference a credential store to "hide" sensitive credentials so that they do not appear in the console output, configuration files, or log files.

Amazon S3 block and native filesystems are supported with the s3a:// protocol.

Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file

S3 credentials can be provided in a configuration file (for example, core-site.xml):
<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>
You can also enter the configurations in the Advanced Configuration Snippet for core-site.xml, which allows Cloudera Manager to manage this configuration. See Custom Configuration.
You can also provide the credentials on the command line:
hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://
For example:
hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey /user/hdfs/mydata s3a://myBucket/mydata_backup
  Important: Entering secrets on the command line is inherently insecure. These secrets may be accessed in log files and other artifacts. Cloudera recommends that you use a credential provider to store secrets. See Using a Credential Provider to Secure S3 Credentials.
  Note: Using the -diff option with the distcp command requires a DistributedFileSystem on both the source and destination and is not supported when using distcp to copy data to or from Amazon S3.

Using a Credential Provider to Secure S3 Credentials

You can run the distcp command without having to enter the access key and secret key on the command line. This prevents these credentials from being exposed in console output, log files, configuration files, and other artifacts. Running the command in this way requires that you provision a credential store to securely store the access key and secret key. The credential store file is saved in HDFS.
  Note: Using a Credential Provider does not work with MapReduce v1 (MRV1).
To provision credentials in a credential store:
  1. Provision the credentials by running the following commands:
    hadoop credential create fs.s3a.access.key -value access_key -provider jceks://hdfs/path_to_credential_store_file
    hadoop credential create fs.s3a.secret.key -value secret_key -provider jceks://hdfs/path_to_credential_store_file
    For example:
    hadoop credential create fs.s3a.access.key -value foobar -provider jceks://hdfs/user/alice/home/keystores/aws.jceks
    hadoop credential create fs.s3a.secret.key -value barfoo -provider jceks://hdfs/user/alice/home/keystores/aws.jceks

    You can omit the -value option and its value and the command will prompt the user to enter the value.

    For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

  2. Copy the contents of the /etc/hadoop/conf directory to a working directory.
  3. Add the following to the core-site.xml file in the working directory:
    <property>
    <name>hadoop.security.credential.provider.path</name>
    <value>jceks://hdfs/path_to_credential_store_file</value>
    </property>
  4. Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
    export HADOOP_CONF_DIR=path_to_working_directory

After completing these steps, you can run the distcp command using the following syntax:

hadoop distcp source_path s3a://destination_path
You can also reference the credential store on the command line, without having to enter it in a copy of the core-site.xml file. You also do not need to set a value for HADOOP_CONF_DIR. Use the following syntax:
hadoop distcp source_path s3a://bucket_name/destination_path
-Dhadoop.security.credential.provider.path=jceks://hdfspath_to_credential_store_file

There are additional options for the distcp command. See DistCp Guide (Apache Software Foundation).

Examples of DistCP Commands Using the S3 Protocol and Hidden Credentials

Copying files to Amazon S3
hadoop distcp /user/hdfs/mydata s3a://myBucket/mydata_backup
Copying files from Amazon S3
hadoop distcp s3a://myBucket/mydata_backup //user/hdfs/mydata
Copying files to Amazon S3 using the -filters option to exclude specified source files
You specify a file name with the -filters option. The referenced file contains regular expressions, one per line, that define file name patterns to exclude from the distcp job. The pattern specified in the regular expression should match the fully-qualified path of the intended files, including the scheme (hdfs, webhdfs, s3a, etc.). For example, the following are valid expressions for excluding files:
hdfs://x.y.z:8020/a/b/c
webhdfs://x.y.z:50070/a/b/c
s3a://bucket/a/b/c
Reference the file containing the filter expressions using -filters option. For example:
hadoop distcp -filters /user/joe/myFilters /user/hdfs/mydata s3a://myBucket/mydata_backup
Contents of the sample myFilters file:
.*foo.*
.*/bar/.*
hdfs://x.y.z:8020/tmp/.*
hdfs://x.y.z:8020/tmp1/file1
The regular expressions in the myFilters exclude the following files:
  • .*foo.* – excludes paths that contain the string "foo".
  • .*/bar/.* – excludes paths that include a directory named bar.
  • hdfs://x.y.z:8020/tmp/.* – excludes all files in the /tmp directory.
  • hdfs://x.y.z:8020/tmp1/file1 – excludes the file /tmp1/file1.
Copying files to Amazon S3 with the -overwrite option.
The -overwrite option overwrites destination files that already exist.
hadoop distcp -overwrite /user/hdfs/mydata  s3a://user/mydata_backup

For more information about the -filters, -overwrite, and other options, see DIstCp Guide: Command Line Options (Apache Software Foundation).

Using DistCp with Microsoft Azure (ADLS)

You can use the distcp command to copy data from/to ADLS Gen1 and Gen2 Preview to/from your cluster :
  Note: The following examples use the ADLS Gen1 URI. To use ADLS Gen2 Preview, replace the Gen1 URI with the Gen2 URI:
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
  1. Configure connectivity to ADLS using one of the methods described in Configuring ADLS Gen1 Connectivity or Configuring ADLS Gen2 Connectivity.
  2. If you are copying data to or from Amazon S3, also configure connectivity to S3 as described above. See Using DistCp with Amazon S3
  3. Use the following syntax to define the Hadoop Credstore:
    export HADOOP_CONF_DIR=path_to_working_directory
    export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
  4. Run distcp jobs using the following syntax:
    ADLS to local cluster:
    hadoop distcp adl://store.azuredatalakestore.net/src hdfs://hdfs_destination_path
    Local cluster to ADLS:
    hadoop distcp hdfs://hdfs_destination_path adl://store.azuredatalakestore.net/src 

    You can also use distcp to copy data between Amazon S3 and Microsoft ADLS.

    S3 to ADLS:
    hadoop distcp s3a://user/my_data adl://Account_Name.azuredatalakestore.net/my_data_backup/
    ADLS to S3:
    hadoop distcp s3a://user/my_data adl://Account_Name.azuredatalakestore.net/my_data_backup/

Note that when copying data between these remote filesystems, the data is first copied form the source filesystem to the local cluster before being copied to the destination filesystem.

Using DistCp with Microsoft Azure (WASB)

You can use the distcp command to copy data from Azure WASB to your cluster :
  1. Configure connectivity to Azure by setting the following property in core-site.xml.
    <property>
      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
      <value>your_access_key</value>
    </property>

    Note that in practice, you should never store your Azure access key in cleartext. Protect your Azure credentials using one of the methods described at Configuring Azure Blob Storage Credentials.

  2. Run your distcp jobs using the following syntax:
    hadoop distcp wasb://<sample_container>@<sample_account>.blob.core.windows.net/ /hdfs_destination_path
Reference

Kerberos Setup Guidelines for Distcp between Secure Clusters

The guidelines mentioned in this section are only applicable for the following sample deployment:
  • Let's assume you have two clusters with the realms: SOURCE and DESTINATION
  • You have data that needs to be copied from SOURCE to DESTINATION
  • Trust exists between SOURCE and Active Directory, and DESTINATION and Active Directory.
If your environment matches the one described above, use the following table to configure Kerberos delegation tokens on your cluster so that you can successfully distcp across two secure clusters. Based on the direction of the trust between the SOURCE and DESTINATION clusters, you can use the mapreduce.job.hdfs-servers.token-renewal.exclude property to instruct ResourceManagers on either cluster to skip or perform delegation token renewal for NameNode hosts.
  Note: For CDH 5.12.0 and later, you must use the mapreduce.job.hdfs-servers.token-renewal.exclude parameter if both clusters use the HDFS Transparent Encryption feature.
Environment Type Kerberos Delegation Token Setting
SOURCE trusts DESTINATION Distcp job runs on the DESTINATION cluster You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Distcp job runs on the SOURCE cluster Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the DESTINATION cluster.
DESTINATION trusts SOURCE Distcp job runs on the DESTINATION cluster Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the SOURCE cluster.
Distcp job runs on the SOURCE cluster You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Both SOURCE and DESTINATION trust each other You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property.
Neither SOURCE nor DESTINATION trusts the other If a common realm is usable (such as Active Directory), set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of hostnames of the NameNodes of the cluster that is not running the distcp job. For example, if you are running the job on the DESTINATION cluster:
  1. kinit on any DESTINATION YARN Gateway host using an AD account that can be used on both SOURCE and DESTINATION.
  2. Run the distcp job as the hadoop user:
    $ hadoop distcp -Ddfs.namenode.kerberos.principal.pattern=*  \
    -Dmapreduce.job.hdfs-servers.token-renewal.exclude=SOURCE-nn-host1,SOURCE-nn-host2   \
    hdfs://source-nn-nameservice/source/path    \
    /destination/path

    By default, the YARN ResourceManager renews tokens for applications. The mapreduce.job.hdfs-servers.token-renewal.exclude property instructs ResourceManagers on either cluster to skip delegation token renewal for NameNode hosts.

Distcp between Secure Clusters in Distinct Kerberos Realms

This section explains how to copy data between two secure clusters in distinct Kerberos realms.

  Note: Both clusters must run a supported JDK version when copying data between Kerberized clusters that are in different realms. For information about supported JDK versions, see Cloudera Enterprise 6 Requirements and Supported Versions.
  1. Specify the Destination Parameters in krb5.conf
  2. Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns
  3. (If TLS/SSL is enabled) Specify Truststore Properties
  4. Set HADOOP_CONF to the Destination Cluster
  5. Launch Distcp

Specify the Destination Parameters in krb5.conf

Edit the krb5.conf file on the client (where the distcp job will be submitted) to include the destination hostname and realm.
[realms]
HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749
default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal
des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } 

[domain_realm]
.domain.com = HADOOP.test.domain.COM
domain.com = HADOOP.test.domain.COM
test03.domain.com = HADOOP.QA.domain.COM

Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns

Set the hadoop.rpc.protection property to authentication in both clusters. You can modify this property either in hdfs-site.xml, or using Cloudera Manager as follows:
  1. Open the Cloudera Manager Admin Console.
  2. Go to the HDFS service.
  3. Click the Configuration tab.
  4. Select Scope > HDFS-1 (Service-Wide).
  5. Select Category > Security.
  6. Locate the Hadoop RPC Protection property and select authentication.
  7. Enter a Reason for change, and then click Save Changes to commit the changes.

The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.

Set the dfs.namenode.kerberos.principal.pattern property to * to allow distcp irrespective of the principal patterns of the source and destination clusters. You can modify this property either in hdfs-site.xml on both clusters, or using Cloudera Manager as follows:
  1. Open the Cloudera Manager Admin Console.
  2. Go to the HDFS service.
  3. Click the Configuration tab.
  4. Select Scope > Gateway.
  5. Select Category > Advanced.
  6. Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
    <property>
      <name>dfs.namenode.kerberos.principal.pattern</name>
      <value>*</value>
    </property>
  7. Enter a Reason for change, and then click Save Changes to commit the changes.

(If TLS/SSL is enabled) Specify Truststore Properties

The following properties must be configured in the ssl-client.xml file on the client submitting the distcp job to establish trust between the target and destination clusters.
<property>
<name>ssl.client.truststore.location</name>
<value>path_to_truststore</value>
</property>

<property>
<name>ssl.client.truststore.password</name>
<value>XXXXXX</value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>

Set HADOOP_CONF to the Destination Cluster

Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.

Launch Distcp

Kinit on the client and launch the distcp job.
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
If launching distcp fails, force Kerberos to use TCP instead of UDP by adding the following parameter to the krb5.conf file on the client.
[libdefaults]
udp_preference_limit = 1

Enabling Fallback Configuration

To enable the fallback configuration, for copying between secure and insecure clusters, add the following to the HDFS configuration file, core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.
<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

Protocol Support for Distcp

To use distcp to copy data between CDH 6 clusters, use the hdfs protocol. When you are copying data between two different but compatible versions, use the webhdfs protocol for the remote cluster.

Page generated August 29, 2019.