Using Parallel Execution (文件 ID 203238.1)

lovestanford發表於2014-12-07

APPLIES TO:

Oracle Database - Personal Edition - Version 8.1.5.0 and later
Oracle Database - Enterprise Edition - Version 8.1.5.0 and later
Information in this document applies to any platform.

PURPOSE


The document gives a beginner's overview of parallel execution. It does not cover everything in great detail, rather it covers the most important concepts required to get up and running with parallel execution (PX).

SCOPE

For users requiring further understanding of how parallel execution works.

DETAILS

Content
Introduction to Parallel Execution
How Parallel Execution works (very basic!)
Database Parameter Setup for Parallel Execution
     PARALLEL_MAX_SERVERS        
     PARALLEL_MIN_SERVERS
     PARALLEL_MIN_PERCENT
     OPTIMIZER_PERCENT_PARALLEL 
     PARALLEL_ADAPTIVE_MULTI_USER     
     PARALLEL_AUTOMATIC_TUNING 
Optimizer
How to determine if Parallel Execution is being used
Parallel Execution Performance
Performance summary 
References 

Introduction to Parallel Execution
When we have to handle a lot of data in operations, like full table scans of large tables, or the creation of large indexes, we can divide the work. Instead of using a single process for one statement, the work can be spread to multiple processes. This is called parallel execution or parallel processing. Parallel execution normally  reduces response time when you are accessing large amount of data, but it is not always faster than a serial operation.

Parallel execution is useful for many types of operations like:

  • Queries requiring large table scans, joins, or partitioned index scans
  • Creation of large indexes and large tables
  • Bulk inserts, updates and deletes
  • Aggregations 


How Parallel Execution works
A query is submitted to the Oracle server and parsed. During optimization a parallel plan is considered. If a parallel plan is chosen, then at execution time the user shadow process becomes the query coordinator (QC) and slaves are allocated to the session as requested. The QC divides the workload up among the allocated  slaves using ROWID ranges, or since 8i in partitions (for some operations). In an attempt to evenly share the workload, a simple distribution algorithm is used. Producer slaves read the data and place it on table queues ready for consumption by consumer slaves or the QC. 

If sorting is taking place, then these table queues are read by consumer slaves and sorted data is placed on fresh table queues associated with the consumer slaves. These queues are then read by the QC process. 

If no sorting is taking place, then the QC reads the producer queues directly.

 

Sorting No Sorting

Query Coordinator
|
Table Queue
|
Consumer Slaves
|X| |X| |X|
Table Queue
|X| |X| |X|
Producer Slaves
||    ||
Disks Disks

Query Coordinator
|
Table Queue
|
|
|
|
|
Producer Slaves
||    ||
Disks Disks

 

Terminology: 

Query Coordinator - (QC) Foreground process for the driving session. This process receives data from the query slaves. 

Slaves - Slaves read data either straight from disk or from table queue structures (which are populated by other slaves) and write it to their own queues. When the slaves are reading from a disk,  they are doing a direct I/O path read. This means it  bypasses the buffer cache. In fact, slaves force blocks which have been updated, but not yet written to disk, to be flushed from the buffer cache to disk. Then it reads the data using direct path I/O.

Slaves waiting for data poll a message queue and dequeue messages from the queue when they are available. There are 2 types of slaves - 

Producers and Consumers:
Producers slaves are given ROWID ranges or partitions from the query coordinator and visit the data blocks and retrieve relevant data. The data they return is placed (enqueued) onto table queues ready for reading by the query  coordinator or the consumer slaves.

Consumer slaves are only generated when they are required and dequeue data from the table queues are filled by the producer slaves. Any slave associated with the  same QC can read from any table queue. The data is then processed and returned to the query coordinator process. 

In the simplest queries (i.e. those that do not require data to be sorted), the consumer slaves are not required and the QC reads directly from the producer slaves' table queue.

The presence of producers and consumers is the reason why the degree of parallelism sometimes requires double the number of query slaves when sorting is added to a parallel query.

Slaves wait to dequeue messages from the table queues. Waits for 'PX Deque wait' are part of the normal operation of parallel query. The slaves must wait for the messages for be dequeued before they can be worked on.

Table Queues - (TQ) are queues that allow processes to send rows to other processes. Like data are sent from the produces slaves to the consumer slaves. Also the consumer slaves can send data to the query coordinator.


Database Parameter Setup for Parallel Execution

PARALLEL_MAX_SERVERS 
Maximum slaves possible per instance. A value of 0 means no parallel query.

PARALLEL_MIN_SERVERS 
Minimum slaves spawned at instance startup. These will remain for the duration of the instance.

PARALLEL_MIN_PERCENT 
If a the optimizer has determined that a query should run in parallel, but at execution time there is insufficient resource available to meet the desired parallel degree then by default, the query will run in serial with no message presented to the user. This could easily result in very long execution times. provides the facility to prevent the query running and displays an error if insufficient resource is available. It defines the cut off point for the minimum amount of resource that is acceptable for PX usage. It is expressed as a percentage of the available parallel execution slaves.

If this parameter is not set and resources are unavailable then queries will serialize. 
If this parameter is set and the desired percentage of query slaves are unavailable,  then an error 
(ORA-12827) is raised rather than serializing.
If the required resource is unavailable then value of:

0 will serialize queries and no errors will be raised.
50 means that it is acceptable for queries to take up to 2x the best parallel time to complete
100 means that unless full resource is available any attempted parallel query will fail with an error: ORA-12827.

See <39034.1> Init.ora Parameter "PARALLEL_MIN_PERCENT" Reference Note

OPTIMIZER_PERCENT_PARALLEL 
See <39027.1> Init.ora Parameter "OPTIMIZER_PERCENT_PARALLEL" Reference Note 

PARALLEL_ADAPTIVE_MULTI_USER
PARALLEL_ADAPTIVE_MULTI_USER, when set to TRUE, enables an adaptive algorithm designed to improve performance in multi-user environments that use parallel execution. The algorithm automatically reduces the requested degree  of parallelism based on the system load at query startup time. The effective degree of parallelism is based on the default degree of parallelism, or the degree of the table or hints, divided by a reduction factor. 

Example: On a 17 CPU machine the default degree of parallelism could be set to 32. If one user issues a parallel query, that user gets a degree of 32, effectively using all of the CPUs and memory in the system. When a second user enters the system, that user gets a degree of 16. As the number of users on the system increases, this algorithm will continue to reduce the degree until the users are running using degree 1, when there are 32 users on the system. 

PARALLEL_AUTOMATIC_TUNING  introduced in 8i
The setting of parallel_automatic_tuning has effects on PX parameters and an impact on its behaviour. 

Parameter

Default Values 
When parallel_automatic_tuning= 
 FALSE 

Default Values
When parallel_automatic_tuning=
TRUE

parallel_execution_message_size  2KByte 4KByte
parallel_adaptive_multi_user FALSE TRUE
large_pool_size

no effect 

is incerased based on
a complicated computation using 
various other parameters
processes

 no effect

if processes < parallel_max_servers
The processes parameter is increased
parallel_max_servers

5

if parallel_adaptive_multi_user==true
(cpus * parallel_threads_per_cpu * 
_parallel_adaptive_max_users * 5)
else 
(cpus * parallel_threads_per_cpu *
_parallel_adaptive_max_users * 8)

parallel_threads_per_cpu

default value is OS-dependent, usually 2 

A list of all relevant PX parameters and a suggestion for a PX setup will be find in:
Note 201799.1 Relevant Parallel Execution INIT.ora Parameters

PARALLEL_FORCE_LOCAL

Introduced in 11.2, this parameter controls whether the parallel server processes are restricted so that they can only operate on the same instance where the query coordinator resides (the node on which the SQL statement was executed on).

For more information see:

Note 1216278.1 Init.ora Parameter "PARALLEL_FORCE_LOCAL" Reference Note

Note 1260674.1 RAC: How to control parallel execution across instances in 10g?

Note 1207190.1 RAC: How to control parallel execution across instances in 11g?

Note 1264548.1 Oracle 11g Release 2: new parallel query parameters

PARALLEL_DEGREE_POLICY

Introduced in 11.2, this parameter controls whether or not automatic degree of Parallelism, statement queuing, and in-memory parallel execution will be enabled.

For more information see:

Note 1269321.1 Automatic Degree of Parallelism in 11.2.0.2

Note 1542556.1 Automatic Degree of Parallelism (AutoDOP) is enabled when "PARALLEL_DEGREE_POLICY=MANUAL and PARALLEL HINT" is Used at SQL Statement

Note 1380736.1 How to Analyze Why PX Queries are Slow with AutoDOP Feature

Note 1264548.1 Oracle 11g Release 2: new parallel query parameters

In 12.1 a new value 'adaptive' for this parameter was introduced addition to the formerly existing ones manual, limited, auto. See more about this in the .

PARALLEL_MIN_TIME_THRESHOLD

When automatic degree of parallelism is enabled, the parameter gives the time threshold above which the parallelization is worthwhile.

This parameter was introduced in 11.2.

For more information see:

Note 1264548.1 Oracle 11g Release 2: new parallel query parameters

PARALLEL_DEGREE_LIMIT

Introduced in 11.2, the parameter caps on the degree of parallelism of all the parallel statements running in the system.

For more information see:

Note 1216276.1 Init.ora Parameter "PARALLEL_DEGREE_LIMIT" Reference Note

Note 1264548.1 Oracle 11g Release 2: new parallel query parameters

PARALLEL_SERVERS_TARGET

This parameter was introduced in 11.2.

When statement queueing is enabled, the value of this parameter specifies the number of parallel server processes allowed to run parallel statements before statement queuing will be used.

For more information see reference note 1216280.1, Reference document for  for .

Optimizer
Parallel Execution can only be used with the Cost Based Optimizer CBO. In some circumstances, the use of the  CBO will be forced when you have a non-zero degree of parallelism on a table or on an index. See 
Note 73095.1 Optimizer and Parallel Query
Note 196938.1 Why did my query go parallel?

We have the following hints to influence the use of parallel execution

  • PARALLEL
  • NOPARALLEL
  • PQ_DISTRIBUTE
  • PARALLEL_INDEX
  • NOPARALLEL_INDEX


Also note that if the Rule Based Optimizer code has already been entered, then any parallel hints in the query will be ignored


How to determine if Parallel Execution is being used:

1. Check the explain plan.
If there are no details of parallelism, then parallel execution cannot be used. The optimizer must have decided on a parallel plan for a parallel plan to be implemented.

Check the other column in the plan table as this will contain the SQL generated for use by the Slave processes. If this SQL contains hints such as:
/*+  ROWID(A1) */ 
and these hints are clearly not in the original code, then it is likely that  PX is going to be used. The ROWID hint is one of the internal methodologies used to process parallel execution queries. Note that the ROWID hint has a special meaning in as much as it bypasses the buffer cache. In fact it causes the buffers to be flushed to a disk so that direct path I/O can be performed on the base data files.

There are two scripts which can be used to format a query on the plan_table. One formats the output for a serial plan (utlxpls.sql) the other can be used to format a parallel plan (utlxplp.sql). Both of these scripts can be found in $ORACLE_HOME/rdbms/admin. To avoid lines from truncating or wrapping around:

          'set charwidth 108' in svrmgrl
          'set linesize 108' in SQL*Plus

See also <31101.1>SCRIPT: DIAG - Obtaining Formatted Explain Plan Output 


2. From the session where the query run, type:
         select * from v$pq_sesstat; 

            STATISTIC              LAST_QUERY SESSION_TOTAL
            ------------------------------------------------------------
           Queries Parallelized              1                           1 

The above row tells us that the last query run in this session was parallelized.

3. Check in v$ views for slave activity:
Select from v$pq_slave a couple of times:

     SELECT slave_name,status, cpu_secs_total
  FROM v$pq_slave;

If there are no rows selected, there are no slaves running. If there is no difference in CPU usage between the 2 runs, then there has been no CPU activity for the period you sampled over.


v$session wait can also be queried to look for PQ activity.

You can detect  the use with the following statement for a version greater than 8.x

             SELECT sid, event, seq#,p1,p2,p3, wait_time 
      FROM v$session_wait 
      WHERE upper(event) like ('PX%')
      ORDER BY 1;


A really useful script to monitor PX activity  can be found in 
<202219.1>  Script to map parallel query coordinators to slaves


4. Trace the Query coordinator using Event 10046 Level 12:
Look for parallel query wait events in the resultant trace file. Also look for trace files from the query slaves.


Parallel Execution Performance
Running queries in parallel, as opposed to serially, can have benefits on performance. However, for a parallel query to be most effective, there are a number of considerations that must be met. 

Multiple slave processes consume more CPU than single processes. Each slave has its own address space (memory allocation).

Parallel execution was designed to reduce execution time for queries that had no option but to read large quantities of data. It maximizes the data throughput by spreading the read workload across multiple processes or slaves. Sort operations are also managed using the slaves' table queue structures. This has the effect of increasing the CPU load on the system as a whole. If the system is running at maximum CPU utilization (or close to it), parallel query will not get any more out of the query. If no more CPU is available, then the parallel operations will serialize.

The additional I/O requirements of the slave processes reading the data can also stress a heavily loaded I/O subsystem. Disk I/O may become a bottleneck in some cases. Distributing data across multiple disk spindles can help.

Additionally queries may just run quicker in serial. Typically queries that use index look-ups are not suited to PX. 

  •  Nested Loops vs. Hash/Sort Merge Joins
    Typically parallel query tries to use FTS to access data lookups are divided between slaves on a ROWID range basis. Usually nested loop joins are not really efficient with FTS (unless the inputs are small). Hash Joins and Sort Merge are usually much more efficient at processing large amounts of data. However, there is a down side: HJ & SMJ do not allow row elimination based on data from a driving row source. This elimination can drastically reduce the data sets involved. This can mean that a serial access path using index look-ups is quicker than a parallel one simply because of the volume of data eliminated.
  •  Cost of slave generation, dividing the data up, passing it through multiple processes and collating the results, may be greater than retrieving the data serially.
  •  Data skew
    Parallel queries divide data between reader slaves on a ROWID range basis. Just because the same number of blocks are given to each slave does not mean that these blocks contain an identical number of rows. In fact, some of the blocks couple are completely empty. This can be especially problematical where large quantities of data are archived and deleted, as this results in many empty or sparsely filled blocks. The effect of this non-uniform data distribution is to cause queries to run slower than they might otherwise, because one slave does a work (i.e. the data distribution serializes access on one slave causing a bottleneck). There is little that can be done about this other than to reorganize the data. 


Performance summary: 
It cannot be stressed strongly enough that parallel query is not always quicker than serial. Some queries are more suited to parallel operations than others. If you are using parallel execution, then you should endeavour to maximize I/O throughput to your disk. Ensure that:

  •  you have enough slaves to retrieve the data efficiently
  •  you do not have too many slaves (avoid maxing out the CPU)
  •  you set realistic memory parameters (sort_area_size etc.) so that you do not max out memory and cause swapping
  •  the data is spread over multiple disk spindles so that slaves do not contend for I/O resource
  •  the type of query your are attempting to execute in parallel is suitable for parallelisation.
  •  watch for non-uniform loading on slave processes as this could indicate data skew.

Additional Information

To view White Papers visit Note 1329441.1



Selected Webcasts in the Oracle Data Warehouse Global Leaders Webcast Series in Note 1306350.1

 

Still have questions?

Use the communities window below to search for similar discussions or start a new discussion on this subject. 

Note: Window is the live community not a screenshot. Click  to open in main browser window.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/21754115/viewspace-1359873/,如需轉載,請註明出處,否則將追究法律責任。

相關文章