Running Your First Spark Application
By default, CDH is configured to permit any user to access the Hive Metastore. However, if you have modified the value set for the configuration property hadoop.proxyuser.hive.groups, which can be modified in Cloudera Manager by setting the Hive Metastore Access Control and Proxy User Groups Override property, your Spark application might throw exceptions when it is run. To address this issue, make sure you add the groups that contain the Spark users that you want to have access to the metastore when Spark applications are run to this property in Cloudera Manager:
- In the Cloudera Manager Admin Console Home page, click the Hive service.
- On the Hive service page, click the Configuration tab.
- In the Search well, type hadoop.proxyuser.hive.groups to locate the Hive Metastore Access Control and Proxy User Groups Override property.
- Click the plus sign (+), enter the groups you want to have access to the metastore, and then click Save Changes. You must restart the Hive Metastore Server for the changes to take effect by clicking the restart icon at the top of the page.
- To start one of the shell applications, run one of the following commands:
- Scala:
$ SPARK_HOME/bin/spark-shell Spark context Web UI available at ... Spark context available as 'sc' (master = yarn, app id = ...). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version ... /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141) Type in expressions to have them evaluated. Type :help for more information. scala>
- Python:
Note: Spark 2 requires Python 2.7 or higher. You might need to install a new version of Python on all hosts in the cluster, because some Linux distributions come with Python 2.6 by default. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark command.
$ SPARK_HOME/bin/pyspark Python 2.7.5 (default, Jun 20 2019, 20:27:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version ... /_/ Using Python version 2.7.5 (default, Jun 20 2019 20:27:34) SparkSession available as 'spark'. >>>
In a CDH deployment, SPARK_HOME defaults to /usr/lib/spark in package installations and /opt/cloudera/parcels/CDH/lib/spark in parcel installations. In a Cloudera Manager deployment, the shells are also available from /usr/bin.
For a complete list of shell options, run spark-shell or pyspark with the -h flag.
- Scala:
- To run the classic Hadoop word count application, copy an input file to HDFS:
hdfs dfs -put input
- Within a shell, run the word count application using the following code examples, substituting for namenode_host, path/to/input, and path/to/output:
- Scala
scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input") scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
- Python
>>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input") >>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2) >>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
- Scala
<< Spark Guide | ©2016 Cloudera, Inc. All rights reserved | Troubleshooting for Spark >> |
Terms and Conditions Privacy Policy |