Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pyspark 최 현 영 컴퓨터학부.

Similar presentations


Presentation on theme: "Pyspark 최 현 영 컴퓨터학부."— Presentation transcript:

1 Pyspark 최 현 영 컴퓨터학부

2 Distributed Processing with the Spark Framework
API Spark Cluster Computing Spark Standalone YARN Mesos Storage HDFS

3 The Hadoop Distributed File System (HDFS)

4 Chapter Topics - The Hadoop Distributed File System
!! Why HDFS? !! HDFS Architecture !! Using HDFS !! Conclusion !! Hands"On Exercise: Using HDFS

5 Big Data Processing with Spark
Three key concepts Distribute data when the data is stored – HDFS Run computation where the data is – HDFS and Spark Cache data in memory – Spark

6 Chapter Topics - The Hadoop Distributed File System
!! Why HDFS? !! HDFS Architecture !! Using HDFS !! Hands"On Exercise: Using HDFS

7 HDFS Basic Concepts HDFS is a filesystem written in Java
Based on Google’s GFS Provides redundant storage for massive amounts of data Using readily-available, industry-standard computers HDFS Native OS filesystem Disk Storage

8 How Files Are Stored Very Large Data

9 Example: Storing and Retrieving Files (1)

10 Example: Storing and Retrieving Files (2)

11 Example: Storing and Retrieving Files (3)

12 Example: Storing and Retrieving Files (4)

13 HDFS NameNode Availability
! The NameNode daemon must be running at all times – If the NameNode stops, the cluster becomes inaccessible ! HDFS is typically set up for High Availability – Two NameNodes: Active and Standby Active Name Node Standby Name Node Small clusters may use ‘Classic mode’ One NameNode One “helper” node called the Secondary NameNode – Bookkeeping, not backup Secondary Name Node Name Node

14 Chapter Topics - The Hadoop Distributed File System
!! Why HDFS? !! HDFS Architecture !! Using HDFS !! Hands"On Exercise: Using HDFS

15 HDFS NameNode Availability
! The NameNode daemon must be running at all times – If the NameNode stops, the cluster becomes inaccessible ! HDFS is typically set up for High Availability – Two NameNodes: Active and Standby Active Name Node Standby Name Node Small clusters may use ‘Classic mode’ One NameNode One “helper” node called the Secondary NameNode – Bookkeeping, not backup Secondary Name Node Name Node

16 hdfs dfs Examples (1) ! Copy file foo.txt from local disk to the user’s directory in HDFS $ hdfs dfs -put foo.txt foo.txt – This will copy the file to /user/username/foo.txt ! Get a directory listing of the user’s home directory in HDFS $ hdfs dfs -ls ! Get a directory listing of the HDFS root directory $ hdfs dfs –ls /

17 hdfs dfs Examples (2) ! Display the contents of the HDFS file /user/fred/bar.txt $ hdfs dfs -cat /user/fred/bar.txt ! Copy that file to the local disk, named as baz.txt $ hdfs dfs -get /user/fred/bar.txt baz.txt ! Create a directory called input under the user’s home directory $ hdfs dfs -mkdir input ! Delete the directory input_old and all its contents $ hdfs dfs -rm -r input_old

18 Example: HDFS in Spark ! Specify HDFS files in Spark by URI
–hdfs://hdfs-host[:port]/path – Default port is 8020 > mydata = sc.textFile \ ("hdfs://hdfs-host:port/user/training/purplecow.txt") > mydata.map(lambda s: s.upper()).\ saveAsTextFile \ ("hdfs://hdfs-host:port/user/training/purplecowuc") Paths are relative to user’s home HDFS directory > mydata = sc.textFile("purplecow.txt") hdfs://hdfs-host:port/user/training/purplecow.txt

19 Chapter Topics - The Hadoop Distributed File System
!! Why HDFS? !! HDFS Architecture !! Using HDFS !! Hands"On Exercise: Using HDFS

20 Hands"On Exercise: Using HDFS(1)
! Upload File $ cd ~/training_materials/sparkdev/data $ hdfs dfs -put weblogs /user/training/weblogs ! List the contents of your HDFS home directory now $ hdfs dfs –ls /user/training

21 Hands"On Exercise: Using HDFS(2)
! Create an RDD base on one of the files you uploaded to HDFS pyspark> logs=sc.textFile("hdfs://localhost/user/training/weblogs/ log") pyspark> logs.filter(lambda s: “.jpg” in s).saveAsTextFile(“hdfs://localhost/user/training/jpgs”) ! View result $ hdfs dfs -ls jpgs

22 Running Spark on a Cluster

23 Distributed Processing with the Spark Framework
API Spark Cluster Computing Spark Standalone YARN Mesos Storage HDFS

24 Chapter Topics - Running Spark on a Cluster
!! Overview !! A Spark Standalone Cluster !! Hands"On Exercise: Running the Spark Shell on a Cluster

25 Environment can run spark
! Locally - No distributed processing ! Locally with multiple worker threads ! On a cluster - Spark Standalone - Apache Hadoop YARN (Yet Another Resource Negotiator) - Apache Mesos

26 Why Run on a Cluster? Run Spark on a cluster to get the advantages of distributed processing Ability to process large amounts of data efficiently Fault tolerance and scalability Local mode is useful for development and testing Production use is almost always on a cluster

27 Cluster architecture

28 Spark Cluster Terminology

29 The Spark Driver Program

30 Starting the Spark Shell on a Cluster

31 Chapter Topics - Running Spark on a Cluster
!! Overview !! A Spark Standalone Cluster !! Hands"On Exercise: Running the Spark Shell on a Cluster

32 Spark Standalone Daemons

33 Running Spark on a Standalone Cluster (1)

34 Running Spark on a Standalone Cluster (2)

35 Running Spark on a Standalone Cluster (3)

36 Running Spark on a Standalone Cluster (4)

37 Running Spark on a Standalone Cluster (4)

38 Hands"On Exercise: Running the Spark Shell on a Cluster
! Start the Spark Standalone Cluster $ sudo service spark-master start $ sudo service spark-worker start ! View the Spark Standalone Cluster UI

39 Chapter Topics - Running Spark on a Cluster
!! Overview !! A Spark Standalone Cluster !! Hands"On Exercise: Running the Spark Shell on a Cluster

40 Hands"On Exercise: Running the Spark Shell on a Cluster
! Start the Spark Standalone Cluster $ MASTER=spark://localhost:7077 pyspark

41 Hands"On Exercise: Running the Spark Shell on a Cluster
! View the sc.master property pyspark> sc.master ! Excute a simple operation pyspark> sc.textFile(“weblogs/*”).count()


Download ppt "Pyspark 최 현 영 컴퓨터학부."

Similar presentations


Ads by Google