Cloudera Enterprise 6.3.x | Other versions

Running Your First Spark Application

The simplest way to run a Spark application is by using the Scala or Python shells.
  Important:

By default, CDH is configured to permit any user to access the Hive Metastore. However, if you have modified the value set for the configuration property hadoop.proxyuser.hive.groups, which can be modified in Cloudera Manager by setting the Hive Metastore Access Control and Proxy User Groups Override property, your Spark application might throw exceptions when it is run. To address this issue, make sure you add the groups that contain the Spark users that you want to have access to the metastore when Spark applications are run to this property in Cloudera Manager:

  1. In the Cloudera Manager Admin Console Home page, click the Hive service.
  2. On the Hive service page, click the Configuration tab.
  3. In the Search well, type hadoop.proxyuser.hive.groups to locate the Hive Metastore Access Control and Proxy User Groups Override property.
  4. Click the plus sign (+), enter the groups you want to have access to the metastore, and then click Save Changes. You must restart the Hive Metastore Server for the changes to take effect by clicking the restart icon at the top of the page.
  1. To start one of the shell applications, run one of the following commands:
    • Scala:
      $ SPARK_HOME/bin/spark-shell
      Spark context Web UI available at ...
      Spark context available as 'sc' (master = yarn, app id = ...).
      Spark session available as 'spark'.
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version ...
            /_/
               
      Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> 
    • Python:
        Note: Spark 2 requires Python 2.7 or higher. You might need to install a new version of Python on all hosts in the cluster, because some Linux distributions come with Python 2.6 by default. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark command.
      $ SPARK_HOME/bin/pyspark
      Python 2.7.5 (default, Jun 20 2019, 20:27:34) 
      [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
      Type "help", "copyright", "credits" or "license" for more information
      ...
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version ...
            /_/
      
      Using Python version 2.7.5 (default, Jun 20 2019 20:27:34)
      SparkSession available as 'spark'.
      >>>

    In a CDH deployment, SPARK_HOME defaults to /usr/lib/spark in package installations and /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.

    For a complete list of shell options, run spark-shell or pyspark with the -h flag.

  2. To run the classic Hadoop word count application, copy an input file to HDFS:
    hdfs dfs -put input
  3. Within a shell, run the word count application using the following code examples, substituting for namenode_host, path/to/input, and path/to/output:
    • Scala
      scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
      scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
      scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
    • Python
      >>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
      >>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
      >>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
Page generated August 29, 2019.