Spark performance optimization: resource allocation

Spark performance optimization: resource allocation

The kingly way of performance tuning is to allocate more resources. When the current resources are sufficient, the more resources are allocated, the more obvious the improvement in performance and speed will be. When resources cannot be allocated more, some subsequent tuning methods will be considered.

1. What resources are allocated?

1. The number of executors allocated;

2. The number of cores required by each executor;

3. The memory size required by each executor;

4. The memory size of the driver (this has little effect);

2. where to allocate these resources?

When submitting a spark job, the spark-submit shell script used has the corresponding parameters:

--Num-executors 3/The number of configured executors

Driver-memory 100m/driver memory configured

Executor-memory 100m/configure the memory size of each executor

Executor-cores 3/The number of cores per executor configured

3. to what extent is the best adjustment?

**The first type, Spark Standalone mode. **On the Spark cluster built by the company, how many machines are there, how much memory does each machine have, and how many cores each machine's cpu has, these parameters must be counted, and then the resource allocation of the Spark job is allocated according to these parameters.

For example, there are 20 machines in the cluster, each with 4g of memory, and each CPU has two cores. Then you can allocate resources like this: 20 executors, each executor allocates 4g of memory and 2 cores.

**The second type is to use Yarn as a resource scheduling cluster. **For this kind of cluster, you need to check the approximate number of resources in the resource queue to which your Spark job is submitted, and then allocate it.

For example, the resource queue has 500g of memory and 100 cpu cores. Then 50 executors can be allocated, and each executor is allocated 10g of memory and 2 cores.

4. the reasons for improving performance

1. The Spark application is started in the Driver process, so a larger driver memory allocation can appropriately improve the execution speed of the Driver process;

2. The Spark application will be split into multiple jobs, each job will be divided into multiple stages, and each stage will be assigned multiple tasks to execute. These tasks will eventually be assigned to the executor process on the worker for execution, and each task is executed by a thread in the executor process.

In this way, it can be seen that if the cluster resources are sufficient, the greater the number of executors, the faster the execution speed.

When the JVM of the process where the executor is located has more cores (to be determined by the number of cpu cores of the worker node), the more efficient it is to improve the execution efficiency, because the task threads that execute tasks inside the executor are executed concurrently. The higher the number of cpu cores, the higher the degree of parallelism.

In addition, because Spark applications sometimes need to cache RDDs, the map and reduce sides of the shuffle operation need to store data, and many objects are created when the task is executed. These three factors require memory, so the memory is large. It will also improve the performance of Spark.