spark的core知识之官网分享

it2024-04-22 96

以下的这篇文章是spark官网关于集群规模的一篇概述，以及一些术语的解释，还有一些图解架构

Cluster Mode Overview

群集模式概述

This document gives a short overview of how Spark runs on clusters, to make it easier to understandthe components involved. Read through the application submission guide to learn about launching applications on a cluster.

本文档简要概述了Spark如何在集群上运行，以便更容易理解所涉及的组件。阅读应用程序提交指南，了解有关在群集上启动应用程序的信息。

Components

组件

Spark applications run as independent sets of processes on a cluster, ## sets of processes 进程的集合

Spark应用程序作为集群上的独立进程集运行，

coordinated by the SparkContext object in your main program (called the driver program). ##coordinated by 由。。。协调

由SparkContext 主程序中的对象（称为驱动程序）协调。

Specifically, to run on a cluster,

具体来说，要运行在集群上，

the SparkContext can connect to several types of cluster managers

这个SparkContext能够链接到几种类型的集群管理器

(either Spark’s own standalone cluster manager, Mesos or YARN), ##either or 两者任一

（既可以是Spark自己的独立集群管理器，Mesos也可以是YARN）

which allocate resources across applications.

（解释：这里的which代表那些集群管理器）这些集群管理器能够跨应用程序分配资源

Once connected, Spark acquires executors on nodes in the cluster, ##executors是负数

一但链接成功，spark会再集群的节点上获取执行程序

which are processes that run computations and store data for your application.

（这里的which是指executors）这些executors是一些进程，他们能够为你的应用程序进程计算和存储数据

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

接下来，spark（it代指）会将您的应用程序代码（由传递给SparkContext的JAR或Python文件定义）发送给执行程序（executors）。

Finally, SparkContext sends tasks to the executors to run.

最后，SparkContext将任务发送给执行程序以运行。

There are several useful things to note about this architecture:

关于这种架构有几点有用的注意事项：

1.Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.

每一个应用程序都有他自己的执行程序（executor），这些进程在整个应用程序的持续时间内保持不变并在多个线程中运行任务

This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs).

这样的好处是能够隔离彼此，彼此是指调度方和执行方，（调度方：每个驱动程序都有他自己的任务）（执行方：来自不同应用程序的tasks运行在不同的JVMS上。）

However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.

但是这也意味着不同的spark应用程序无法共享数据，除非把spark应用程序（SparkContext实例）写入到外部存储系统

2.Spark is agnostic to the underlying cluster manager.

Spark与底层集群管理器无关。

As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).

只要它可以获取执行程序进程，并且这些进程相互通信，即使在也支持其他应用程序的集群管理器（例如Mesos / YARN）上运行它也相对容易。

3.The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section).

驱动程序必须在其整个生命周期内监听并且接收来自其执行程序（executors）的传入链接

As such, the driver program must be network addressable from the worker nodes.

例如，驱动程序必须是一个能连接通的网址，切来自于工作节点（自己翻译理解）

谷歌翻译：因此，驱动程序必须是来自工作节点的网络可寻址的。

4.Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network.

因为驱动程序是在集群上调度任务，所以它应该靠近工作节点运行，最好是在同一局域网上运行

If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

如果你想远程向群集发送请求，最好向驱动程序打开RPC并让它从附近提交操作，而不是远离工作节点运行驱动程序。

Glossary

词汇表

The following table summarizes terms you’ll see used to refer to cluster concepts:

TermMeaningApplicationUser program built on Spark. Consists of a driver program and executors on the cluster.Application jarA jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.Driver programThe process running the main() function of the application and creating the SparkContextCluster managerAn external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Deploy modeDistinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.Worker nodeAny node that can run application code in the clusterExecutorA process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.TaskA unit of work that will be sent to one executorJobA parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.StageEach job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

阶段每个作业被分成较小的任务组，称为阶段，彼此依赖（类似于MapReduce中的map和reduce阶段）; 你会在驱动程序的日志中看到这个术语。应用用户程序建立在Spark上。由群集上的驱动程序和执行程序组成。应用jar包含用户的Spark应用程序的jar。在某些情况下，用户需要创建一个包含其应用程序及其依赖项的“超级jar”。用户的jar不应该包含Hadoop或Spark库，但是，这些将在运行时添加。司机程序运行应用程序的main（）函数并创建SparkContext的进程集群管理器用于获取群集资源的外部服务（例如独立管理器，Mesos，YARN）部署模式区分驱动程序进程的运行位置。在“集群”模式下，框架在集群内部启动驱动程序。在“客户端”模式下，提交者在群集外部启动驱动程序。工人节点任何可以在群集中运行应用程序代码的节点执行者为工作节点上的应用程序启动的进程，该进程运行任务并将数据保存在内存或磁盘存储中。每个应用程序都有自己的执行程序。任务将被发送给一个遗嘱执行人的工作单位工作甲并行计算由多个任务的是获取响应于火花动作衍生（例如save，collect）; 你会在驱动程序的日志中看到这个术语。

转载于:https://www.cnblogs.com/xuziyu/p/10845686.html

最新回复(0)