Spark QuickStart
Preparation
Hadoop
- Establish hadoop execution environment. Reference
Spark
- Download the corresponding Spark version from https://spark.apache.org/downloads.html.
cd ~
wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
tar zvxf spark-2.3.1-bin-hadoop2.7.tgz
sudo mv ./spark-2.3.1-bin-hadoop2.7 /opt/
sudo chown hadoopuser:hadoop -R /opt/spark-2.3.1-bin-hadoop2.7
- Register the environment variables.
vim ~/.bashrc
Add the following settings on the end of file.
# set Spark
export SPARK_HOME=/opt/spark-2.3.1-bin-hadoop2.7/
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
And not to forget to source ~/.bashrc
.
- Install with PyPi.
# python 2.x
# sudo apt-get install python-pip
# pip install pyspark
# python 3.x and pyspark is 2.3.1
sudo apt-get install python3-pip
pip3 install pyspark
Docker
- Establish docker execution environment. Reference
Example In Scala
- Spark Operations
spark-shell
- Example for the text file
# cd ~/spark_example
# execute `spark-shell` first
# the path of file is ~/spark_example/sample.txt
val lines = sc.textFile("sample.txt")
val lineCount = lines.count
# anonymous function
val bsdLines = lines.filter(line=>line.contains("google"))
bsdLines.count # count the whole filtering
# defined function
def isBSD(line:String) = { line.contains("bank") }
val filterLines = lines.filter(isBSD)
filterLines.count
filterLines.foreach(bLine=>println(bLine)) # print each match
- RDD Operations
# cd ~/spark_example
# execute `spark-shell` first
# map
val numbers = sc.parallelize(List(1,2,3,4,5))
val numberSquared = numbers.map(num => num*num)
numberSquared.foreach(num => print(num + " "))
numberSquared.map(_.toString.reverse).collect()
# flatMap
val idsStr = lines.flatMap(line=>line.split(","))
# distinct
idsStr.distinct.count
idsStr.distinct.collect()
Example In Python
- Spark Operations
pyspark