Spark QuickStart

Preparation

Hadoop

  • Establish hadoop execution environment. Reference

Spark

cd ~
wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
tar zvxf spark-2.3.1-bin-hadoop2.7.tgz
sudo mv ./spark-2.3.1-bin-hadoop2.7 /opt/
sudo chown hadoopuser:hadoop -R /opt/spark-2.3.1-bin-hadoop2.7
  • Register the environment variables.
vim ~/.bashrc

Add the following settings on the end of file.

# set Spark
export SPARK_HOME=/opt/spark-2.3.1-bin-hadoop2.7/
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

And not to forget to source ~/.bashrc.

  • Install with PyPi.
# python 2.x
# sudo apt-get install python-pip
# pip install pyspark

# python 3.x and pyspark is 2.3.1
sudo apt-get install python3-pip
pip3 install pyspark

Docker

  • Establish docker execution environment. Reference

Example In Scala

  • Spark Operations
spark-shell
  • Example for the text file
# cd ~/spark_example
# execute `spark-shell` first

# the path of file is ~/spark_example/sample.txt
val lines = sc.textFile("sample.txt")
val lineCount = lines.count

# anonymous function
val bsdLines = lines.filter(line=>line.contains("google"))
bsdLines.count    # count the whole filtering

# defined function
def isBSD(line:String) = { line.contains("bank") }
val filterLines = lines.filter(isBSD)
filterLines.count
filterLines.foreach(bLine=>println(bLine))    # print each match
  • RDD Operations
# cd ~/spark_example
# execute `spark-shell` first

# map
val numbers = sc.parallelize(List(1,2,3,4,5))
val numberSquared = numbers.map(num => num*num)
numberSquared.foreach(num => print(num + " "))
numberSquared.map(_.toString.reverse).collect()

# flatMap
val idsStr = lines.flatMap(line=>line.split(","))

# distinct
idsStr.distinct.count
idsStr.distinct.collect()

Example In Python

  • Spark Operations
pyspark

results matching ""

    No results matching ""