深入理解Oracle的並行操作(原創)

並行（Parallel）和OLAP系統
並行的實現機制是：首先，Oracle會建立一個程式用於協調並行服務程式之間的資訊傳遞，這個協調程式將需要操作的資料集（比如表的資料塊）分割成很多部分，稱為並行處理單元，然後並行協調程式給每個並行程式分配一個資料單元。比如有四個並行服務程式，他們就會同時處理各自分配的單元，當一個並行服務程式處理完畢後，協調程式就會給它們分配另外的單元，如此反覆，直到表上的資料都處理完畢，最後協調程式負責將每個小的集合合併為一個大集合作為最終的執行結果，返回給使用者。
並行處理的機制實際上就是把一個要掃描的資料集分成很多小資料集，Oracle會啟動幾個並行服務程式同時處理這些小資料集，最後將這些結果彙總，作為最終的處理結果返回給使用者。
這種資料並行處理方式在OLAP系統中非常有用，OLAP系統的表通常來說都是非常大，如果系統的CPU比較多，讓所有的CPU共同來處理這些資料，效果就會比序列執行要高得多。
然而對於OLTP系統，通常來講，並行並不合適，原因是OLTP系統上幾乎在所有的SQL操作中，資料訪問路徑基本上以索引訪問為主，並且返回結果集非常小，這樣的SQL操作的處理速度一般非常快，不需要啟用並行。
並行處理的機制
當Oracle啟動的時候，例項會根據初始化引數 PARALLEL_MIN_SERVERS=n的值來預先分配n個並行服務程式，當一條SQL被CBO判斷為需要並行執行時發出SQL的會話程式變成並行協助程式，它按照並行執行度的值來分配程式伺服器程式。
首先協調程式會使用ORACLE啟動時根據引數： parallel_min_servers=n的值啟動相應的並行服務程式，如果啟動的並行伺服器程式數不足以滿足並行度要求的並行服務程式數，則並行協調程式將額外啟動並行服務程式以提供更多的並行服務程式來滿足執行的需求。然後並行協調程式將要處理的物件劃分成小資料片，分給並行服務程式處理；並行服務程式處理完畢後將結果傳送給並行協調程式，然後由並行協調程式將處理結果彙總併傳送給使用者。
剛才講述的是一個並行處理的基本流程。實際上，在一個並行執行的過程中，還存在著並行服務程式之間的通訊問題。
在一個並行服務程式需要做兩件事情的時候，它會再啟用一個程式來配合當前的程式完成一個工作，比如這樣的一條SQL語句：
Select * from employees order by last_name;
假設employees表中last_name列上沒有索引，並且並行度為4，此時並行協調程式會分配4個並行服務程式對錶employees進行全表掃描操作，因為需要對結果集進行排序，所以並行協調程式會額外啟用4個並行服務程式，用於處理4個程式傳送過來的資料，這新啟用的使用者處理傳遞過來資料的程式稱為父程式，使用者傳出資料（最初的4個並行服務程式）稱為子程式，這樣整個並行處理過程就啟用了8個並行服務程式。其中每個單獨的並行服務程式的行為叫作並行的內部操作，而並行服務程式之間的資料交流叫做並行的互動操作。
這也是有時我們發現並行服務程式數量是並行度的2倍，就是因為啟動了並行服務父程式操作的緣故。
讀懂一個並行處理的執行計劃
CREATE TABLE emp2 AS SELECT * FROM employees;
ALTER TABLE emp2 PARALLEL 2;
EXPLAIN PLAN FOR SELECT SUM(salary) FROM emp2 GROUP BY department_id;
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------
Plan hash value: 3939201228
------------------------------------------------------------------------------------------------------------------
| Id | Operation                | Name     | Rows | Bytes | Cost (%CPU)| Time     |    TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |   107 | 2782 |     3 (34)| 00:00:01 |        |      |            |
|   1 | PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |   107 | 2782 |     3 (34)| 00:00:01 | Q1,01 | P->S | QC (RAND) |
|   3 |    HASH GROUP BY         |          |   107 | 2782 |     3 (34)| 00:00:01 | Q1,01 | PCWP |            |
|   4 |     PX RECEIVE           |          |   107 | 2782 |     3 (34)| 00:00:01 | Q1,01 | PCWP |            |
|   5 |      PX SEND HASH        | :TQ10000 |   107 | 2782 |     3 (34)| 00:00:01 | Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY      |          |   107 | 2782 |     3 (34)| 00:00:01 | Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR |          |   107 | 2782 |     2   (0)| 00:00:01 | Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| EMP2     |   107 | 2782 |     2   (0)| 00:00:01 | Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------
Note
-----
   - dynamic sampling used for this statement
19 rows selected.
透過執行計劃，我們來看一下它的執行步驟：
1、並行服務程式對EMP2表進行全表掃描。
2、並行服務程式以ITERATOR（迭代）方式訪問資料塊，也就是並行協調程式分給每個並行服務程式一個資料片，在這個資料片上，並行服務程式順序地訪問每個資料塊（Iterator），所有的並行服務程式將掃描的資料塊傳給另一組並行服務程式（父程式）用於做Hash Group操作。
3、並行服務父程式對子程式傳遞過來的資料做Hash Group操作。
4、並行服務程式（子程式）將處理完的資料傳送出去。
5、並行服務程式（父程式）接收到處理過的資料。
6、合併處理過的資料，按照隨機的順序發給並行協調程式（QC：Query Conordinator）。
7、並行協調程式將處理結果發給使用者。
當使用了並行執行，SQL的執行計劃中就會多出一列：in-out。該列幫助我們理解資料流的執行方法。它的一些值的含義如下：
Parallel to Serial（P->S）: 表示一個並行操作傳送資料給一個序列操作
Parallel to Parallel（P->P）：表示一個並行操作向另一個並行操作傳送資料
Parallel Combined with parent(PCWP): 同一個從屬程式執行的並行操作，同時父操作也是並行的。
Parallel Combined with Child(PCWC): 同一個從屬程式執行的並行操作，子操作也是並行的。
Serial to Parallel（S->P）: 一個序列操作傳送資料給並行操作，如果select部分是序列操作，就會出現這個情況。
並行執行等待事件
在做並行執行方面的效能最佳化的時候，可能會遇到如下等待事件
PX Deq Credit: send blkd
這是一個有並行環境的資料庫中，從statspack 或者AWR中經常可以看到的等待事件。在Oracle 9i 裡面，這個等待時間被列入空閒等待。
一般來說空閒等待可以忽略它，但是實際上空閒等待也是需要關注的，因為一個空閒的等待，它反映的是另外的資源已經超負荷執行了。基於這個原因，在Oracle 10g裡已經把PX Deq Credit: send blkd等待時間不再視為空閒等待，而是列入了Others 等待事件範圍。
PX Deq Credit: send blkd 等待事件的意思是：當並行服務程式向並行協調程式QC（也可能是上一層的並行服務程式）傳送訊息時，同一時間只有一個並行服務程式可以向上層程式傳送訊息，這時候如果有其他的並行服務程式也要傳送訊息，就只能等待了。直到獲得一個傳送訊息的信用資訊（Credit），這時候會觸發這個等待事件，這個等待事件的超時時間為2秒鐘。
如果我們啟動了太多的並行程式，實際上系統資源（CPU）或者QC 無法即時處理並行服務傳送的資料，那麼等待將不可避免。對於這種情況，我們就需要降低並行處理的並行度。
當出現PX Deq Credit：send blkd等待的時間很長時，我們可以透過平均等待時間來判斷等待事件是不是下層的並行服務程式空閒造成的。該等待事件的超時時間是2秒，如果平均等待時間也差不多是2秒，就說明是下層的並行程式“無事所做”，處於空閒狀態。如果和2秒的差距很大，就說明不是下層並行服務超時導致的空閒等待，而是並行服務之間的競爭導致的，因為這個平均等待事件非常短，說明並行服務程式在很短時間的等待之後就可以獲取資源來處理資料。
所以對於非下層的並行程式造成的等待，解決的方法就是降低每個並行執行的並行度，比如物件（表，索引）上預設的並行度或者查詢Hint 指定的並行度。
並行執行的使用範圍
Parallel Query( 並行查詢 )
並行查詢可以在查詢語句，子查詢語句中使用，但是不可以使用在一個遠端引用的物件上（如DBLINK）。
一個查詢能夠並行執行，需要滿足以下條件
1、SQL語句中有Hint提示，比如Parallel或者Parallel_index。
2、SQL語句中引用的物件被設定了並行屬性。
3、多表關聯中，至少有一個表執行全表掃描（Full table scan）或者跨分割槽的Index range SCAN。如：
select /*+parallel(t 4)×/ * from t;
Parallel DDL（並行DDL操作，如建表，建索引等）
表的並行操作
CREATE TABLE table_name parallel 4 AS SELECT ....
ALTER TABLE table_name move partition partition_name parallel 4;
Alter table table_name split partition partition_name ...... parallel 4;
Alter table table_name coalesce partition parallel 4;
DDL操作，我們可以透過trace 檔案來檢視它的執行過程。
alter session set events '10046 trace name context forever,level 12';
alter session set events '10046 trace name context off' ;
建立索引的並行執行
建立索引時使用並行方式在系統資源充足的時候會使效能得到很大的提高，特別是在OLAP系統上對一些很大的表建立索引時更是如此。以下的建立和更改索引的操作都可以使用並行：
Create index index_name on table_name(colum_name) parallel 4;
Alter index index_name rebuild parallel 4
Alter index index_name rebuild partition partition_name parallel 4;
Alter index index_name split partition partition_name .... parallel 4;
注意：索引上的並行度只有在訪問索引的時候才可以被使用。
總結：
使用並行方式，不論是建立表，修改表，建立索引，重建索引，他們的機制都是一樣的，那就是Oracle 給每個並行服務程式分配一塊空間，每個程式在自己的空間裡處理資料，最後將處理完畢的資料彙總，完成SQL的操作。
Parallel DML（並行DML操作，如insert，update，delete等）
Oracle 可以對DML操作使用並行執行，但是有很多限制。如果我們要讓DML操作使用並行執行，必須顯示地在會話裡執行如下命令：
SQL> alter session enable parallel dml;
會話已更改。
只有執行了這個操作，Oracle 才會對之後符合並行條件的DML操作並行執行，如果沒有這個設定，即使SQL中指定了並行執行，Oracle也會忽略它。
delete，update和merge樣例如下
delete /*+ parallel(table_name 4) */ from test;
update/*+ parallel(table_name 4) */ test set id=100;
merge /*+ parallel(table_name 4) */ into table_name ...
Oracle 對Delete,update,merge的操作限制在，只有操作的物件是分割槽表示，Oracle才會啟動並行操作。原因在於，對於分割槽表，Oracle 會對每個分割槽啟用一個並行服務程式同時進行資料處理，這對於非分割槽表來說是沒有意義的。分割槽表的並行屬性只能在表級別設定，不能在分割槽級別設定。
注:經筆者測試中非分割槽表下， Delete,update,merge也可以執行並行操作，不知道是版本原因還是其他原因，待考證
Insert 的並行操作
實際上只有對於insert into … select … 這樣的SQL語句啟用並行才有意義。對於insert into .. values… 並行沒有意義，因為這條語句本身就是一個單條記錄的操作。
Insert 並行常用的語法是：
Insert /*+parallel(t 2) */ into t select /*+parallel(t1 2) */ * from t1;
這條SQL 語句中，可以讓兩個操作insert 和select 分別使用並行，這兩個並行是相互獨立，互補干涉的，也可以單獨使用其中的一個並行。也有如下語法
Insert into t select /*+parallel(t1 2) */ * from t1;
Insert /*+parallel(t 2) */ into t select * from t1;
並行執行的設定
並行相關的初始話引數
parallel_min_servers=n
在初始化引數中設定了這個值，Oracle 在啟動的時候就會預先啟動N個並行服務程式，當SQL執行並行操作時，並行協調程式首先根據並行度的值，在當前已經啟動的並行服務中條用n個並行服務程式，當並行度大於n時，Oracle將啟動額外的並行服務程式以滿足並行度要求的並行服務程式數量。
parallel_max_servers=n
如果並行度的值大於parallel_min_servers或者當前可用的並行服務程式不能滿足SQL的並行執行要求，Oracle將額外建立新的並行服務程式，當前例項總共啟動的並行服務程式不能超過這個引數的設定值。預設值為PARALLEL_THREADS_PER_CPU * CPU_COUNT * concurrent_parallel_users * 5
parallel_adaptive_multi_user=true|false
Oracle 10g R2下，並行執行預設是啟用的。這個引數的預設值為true，它讓Oracle根據SQL執行時系統的負載情況，動態地調整SQL的並行度，以取得最好的SQL執行效能。

PARALLEL_DEGREE_POLICY
PARALLEL_DEGREE_POLICY specifies whether or not automatic degree of Parallelism, statement queuing, and in-memory parallel execution will be enabled.Values:

MANUAL: Disables automatic degree of parallelism, statement queuing, and in-memory parallel execution. This reverts the behavior of parallel execution to what it was prior to Oracle Database 11g Release 2 (11.2). This is the default.
LIMITED: Enables automatic degree of parallelism for some statements but statement queuing and in-memory Parallel Execution are disabled. Automatic degree of parallelism is only applied to those statements that access tables or indexes decorated explicitly with the DEFAULT degree of parallelism using the PARALLEL clause. Statements that do not access any tables or indexes decorated with the DEFAULT degree of parallelism will retain the MANUAL behavior.
AUTO: Enables automatic degree of parallelism, statement queuing, and in-memory parallel execution.

Note: If the table parallel degree has been set as "default", sql would compute its own DOP regardless parallel_degree_policy is MANUAL or LIMITED. When parallel_degree_policy=AUTO, it would trigger automatically parallel degree .
parallel_min_percent
這個引數指定並行執行時，申請並行服務程式的最小值，它是一個百分比，比如我們設定這個值為50. 當一個SQL需要申請20個並行程式時，如果當前並行服務程式不足，按照這個引數的要求，這個SQL比如申請到20*50%=10個並行服務程式，如果不能夠申請到這個數量的並行服務，SQL將報出一個ORA-12827的錯誤。當這個值設為Null時，表示所有的SQL在做並行執行時，至少要獲得兩個並行服務程式。
Note:當parallel_degree_policy=AUTO時，如果沒有獲得申請的程式，則該並行sql則會進入佇列

PARALLEL_FORCE_LOCAL
This parameter specifies whether a SQL statement executed in parallel is restricted to a single instance in an Oracle RAC environment. By setting this parameter to TRUE, you restrict the scope of the parallel server processed to the single Oracle RAC instance where the query coordinator is running.
The recommended value for the PARALLEL_FORCE_LOCAL parameter is FALSE.
PARALLEL_DEGREE_POLICY
該引數為11g的新增引數，有如下3個屬性值：
manual，預設值，表示不自動調節並行度。
auto，自動調節並行度。
limited，對於在相關表或者索引上已經定義了並行度的查詢進行自動並行度調節，沒有在相關表或者索引上已經定義並行度的查詢不進行自動並行度調節。
PARALLEL_THREADS_PER_ CPU
一個CPU 在並行執行過程中可處理的程式或執行緒的數量，並最佳化並行自適應和負載均衡演算法。如果計算機在執行一個典型查詢時有超負荷的跡象，應減小該數值為任何非零值。根據作業系統而定 (通常為 2)
parallel_automatic_tuning
如果設定為 TRUE，Oracle 將為控制並行執行的引數確定預設值。除了設定該引數外，你還必須為系統中的表設定並行性。以在10g中廢棄，只為向下相容保留。

PARALLEL_MIN_TIME_THRESHOLD
This parameter specifies the minimum execution time a statement should have before the statement is considered for automatic degree of parallelism. By default, this is set to 10 seconds. Automatic degree of parallelism is only enabled if PARALLEL_DEGREE_POLICY is set to AUTO or LIMITED. The syntax is:
PARALLEL_MIN_TIME_THRESHOLD = { AUTO | integer }
The default is AUTO.
PARALLEL_SERVERS_TARGET
This parameter specifies the number of parallel server processes allowed to run parallel statements before statement queuing is used. The default value is:
PARALLEL_THREADS_PER_CPU * CPU_COUNT * concurrent_parallel_users * 2
In the formula, the value assigned to concurrent_parallel_users running at the default degree of parallelism on an instance is dependent on the memory management setting. If automatic memory management is disabled (manual mode), then the value of concurrent_parallel_users is 1. If PGA automatic memory management is enabled, then the value of concurrent_parallel_users is 2. If global memory management or SGA memory target is used in addition to PGA automatic memory management, then the value of concurrent_parallel_users is 4.
When PARALLEL_DEGREE_POLICY is set to AUTO, statements that require parallel execution are queued if the number of parallel processes currently in use on the system equals or is greater than PARALLEL_SERVERS_TARGET. This is not the maximum number of parallel server processes allowed on a system (that is controlled by PARALLEL_MAX_SERVERS). However, PARALLEL_SERVERS_TARGET and parallel statement queuing is used to ensure that each statement that requires parallel execution is allocated the necessary parallel server resources and the system is not flooded with too many parallel server processes.
parallel_execution_message_size
指定並行執行 (並行查詢、PDML、並行恢復和複製) 訊息的大小。如果值大於 2048 或 4096，就需要更大的共享池。如果 PARALLEL_AUTOMATIC_TUNING =TRUE，將在大儲存池之外指定訊息緩衝區。如果PARALLEL_AUTOMATIC_TUNING 為 FALSE，通常值為 2148；如果PARALLEL_AUTOMATIC_TUNING 為 TRUE ，則值為 4096 (根據作業系統而定)。

PARALLEL_DEGREE_LIMIT
With automatic degree of parallelism, Oracle automatically decides whether or not a statement should execute in parallel and what degree of parallelism the statement should use. The optimizer automatically determines the degree of parallelism for a statement based on the resource requirements of the statement. However, the optimizer will limit the degree of parallelism used to ensure parallel server processes do not flood the system. This limit is enforced by PARALLEL_DEGREE_LIMIT
Values:

CPU: The maximum degree of parallelism is limited by the number of CPUs in the system. The formula used to calculate the limit is PARALLEL_THREADS_PER_CPU * CPU_COUNT * the number of instances available (by default, all the opened instances on the cluster but can be constrained using PARALLEL_INSTANCE_GROUP or service specification). This is the default.
IO: The maximum degree of parallelism the optimizer can use is limited by the I/O capacity of the system. The value is calculated by dividing the total system throughput by the maximum I/O bandwidth per process. You must run the DBMS_RESOURCE_MANAGER.CALIBRATE_IO procedure on the system in order to use the IO setting. This procedure will calculate the total system throughput and the maximum I/O bandwidth per process.
integer: A numeric value for this parameter specifies the maximum degree of parallelism the optimizer can choose for a SQL statement when automatic degree of parallelism is active. Automatic degree of parallelism is only enabled if PARALLEL_DEGREE_POLICY is set to AUTO or LIMITED.

並行度可以透過以下三種方式來設定：
1、使用Hint 指定並行度。
2、使用alter session force parallel 設定並行度。
3、使用SQL中引用的表或者索引上設定的並行度，原則上Oracle 使用這些物件中並行度最高的那個值作為當前執行的並行度。
示例：
SQL>Select /*+parallel(t 4) */ count(*) from t;
SQL>Alter table t parallel 4;
SQL>Alter session force parallel query parallel 4;
SQL>alter session force parallel query；
Oracle 預設並行度計算方式：
1、Oracle 根據CPU的個數，RAC例項的個數以及引數parallel_threads_per_cpu的值，計算出一個並行度。
2、對於並行訪問分割槽操作，取需要訪問的分割槽數為並行度。
並行度的優先順序別從高到低：
Hint->alter session force parallel->表，索引上的設定-> 系統引數
實際上，並行只有才系統資源比較充足的情況下，才會取得很好的效能，如果系統負擔很重，不恰當的設定並行，反而會使效能大幅下降。
如何啟用並行

可以用hint、alter session或者設定物件並行屬性三種方式設定啟用並行。三種方式任意一種就可以使並行生效，如果多種方式同時存在的話，則優先順序順序是：hint -> alter session -> table/index degree。

alter session force query parallel 8;

alter session enable parallel dml;

alter session force parallel dml parallel 8;

alter session force parallel ddl parallel 8;

注意：上述的alter session enable只是表示讓當前會話支援並行，最終並行需要透過hint或者table/index degree來實現；而alter session force表示強制並行，無需hint等配合使用。

11g中的並行
在Oracle 11g R2以及之前的版本，你的SQL執行可能被延遲，直到有充足的並行伺服器可用。

Oracle有一種輪換IO機制，叫做“直接路徑IO”，如果它判斷到繞過快取區直接執行IO會更快速的話，它就會啟用。例如，Oracle在讀寫臨時段進行排序或者整理中間結果集時就會使用直接IO。從Oracle 11g開始，Oracle有時也優先利用直接路徑IO來處理序列表訪問，而不是正常的快取IO。

在執行並行查詢操作時，Oracle通常會使用直接路徑IO。透過使用直接路徑IO，Oracle可以避免建立快取記憶體競爭，並可以使IO更加最佳化地在並行程式之間分配。此外，對於執行全表掃描的並行操作，在快取記憶體找到匹配資料的機會相當低，因此快取記憶體幾乎沒有增加什麼價值。

在Oracle 10g以及更早的版本，並行查詢總是使用直接路徑IO，而序列查詢將總是使用快取IO。在11g中，Oracle可以對並行查詢利用快取IO(從11g R2以後的版本支援)，而且序列查詢也可能利用直接路徑IO。然而，並行查詢仍然不太可能利用快取IO，因此，可能比序列查詢需要更高的IO成本。當然，更高的IO成本將在所有並行程式之間共享，這樣整體效能仍可能更勝一籌。

並行的最佳化準則
從最佳化序列執行的SQL開始
一個最理想的並行計劃與最佳的序列計劃可能是有區別的。例如，並行處理通常從表或索引的掃描開始，而最佳序列計劃可能是基於索引查詢開始。然而，你應該確保你的查詢在進行並行最佳化之前先對序列執行進行最佳化，原因如下：
1、序列除錯的結構和方法主要針對索引和統計集合，而這些經常對好的並行除錯也非常關鍵。
2、如果請求並行執行的資源不可用，你的查詢可能被序列化(這取決於“PARALLEL_DEGREE_ POLICY”和“PARALLEL_MIN_PERCENT”的設定)。在這種情況下，你要確保你並行查詢的序列計劃足夠好。
缺少調優的SQL甚至可能變成更差的SQL，至少考慮到對其他使用者的影響時是這樣，這使它被允許消耗資料庫伺服器更多的CPU和IO資源。
在為並行執行最佳化SQL語句時，要從未序列執行SQL最佳化開始。
確保該SQL是合適的並行執行SQL
不是每個SQL都能從並行執行中獲益的。下面是一些例子，這些情況的SQL語句可能不應該被並行化。
1、序列執行時，執行時間很短的SQL語句。
2、可能在多個會話中高併發率執行的SQL語句。
3、基於索引查詢的SQL語句。非並行的索引查詢或者範圍掃描不能被並行化。然而，索引全掃描可以被並行化。在分割槽索引上的索引查詢也可以被並行化。
綜上3點，OLTP型別的查詢通常不適合並行化處理。
確保系統適合配置為並行執行
不是所有的SQL都適合並行執行，也不是所有的資料庫伺服器主機適合配置並行處理。在當今世界，大部分物理伺服器主機都滿足如下最小需求：多塊CPU和跨多個物理驅動器的資料帶。然而，一些虛擬主機可能不滿足這些最小需求，而桌面計算機通常只有唯一的磁碟裝置，因此通常不適合調整為並行執行。
不要嘗試在那些不滿足最小需求(多塊CPU和跨多個磁碟驅動器的資料帶)的計算機系統上使用並行執行。
確保執行計劃的所有部分都被並行化了
在複雜的並行SQL語句中，很重要的一點是要確保該查詢執行的所有重要步驟都實現了並行。如果某複雜查詢的其中一個步驟是序列執行的，其他並行步驟可能也不得不等待該序列步驟完成，這樣並行機制的優勢就完全喪失了。“PLAN_TABLE”表中的“OTHER_TAG”列用 “PARALLEL_FROM_SERIAL”標記指定了這樣一個步驟，“DBMS_XPLAN”在“IN-OUT”列中記錄了“S->P”。例如：在下面的例子中表“T1”是並行化的，但是表“T”不是。對兩個表的連線和“GROUP BY”包括許多並行操作，但是對“T”表的全表掃描不是並行化的，串到並(S->P)標記展示了“t”行被序列提取到後續並行操作中：
SQL> ALTER TABLE customers PARALLEL(DEGREE 4);
SQL> ALTER TABLE sales NOPARALLEL ;
SQL> explain plan for select /*+ ordered use_hash(t1) */ t.name,sum(t.id) from t,t1 where t.id=t1.object_id group by t.name;
SQL> 1* select * from table(dbms_xplan.display)
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2575143521
----------------------------------------------------------------------------------------------------------------------
| Id | Operation                    | Name     | Rows | Bytes | Cost (%CPU)| Time     |    TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |    61 | 4758 |    22 (10)| 00:00:01 |        |      |            |
|   1 | PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10002 |    61 | 4758 |    22 (10)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
|   3 |    HASH GROUP BY             |          |    61 | 4758 |    22 (10)| 00:00:01 | Q1,02 | PCWP |            |
|   4 |     PX RECEIVE               |          |    61 | 4758 |    22 (10)| 00:00:01 | Q1,02 | PCWP |            |
|   5 |      PX SEND HASH            | :TQ10001 |    61 | 4758 |    22 (10)| 00:00:01 | Q1,01 | P->P | HASH       |

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   6 |       HASH GROUP BY          |          |    61 | 4758 |    22 (10)| 00:00:01 | Q1,01 | PCWP |            |
|* 7 |        HASH JOIN             |          |    61 | 4758 |    21   (5)| 00:00:01 | Q1,01 | PCWP |            |
|   8 |         BUFFER SORT          |          |       |       |            |          | Q1,01 | PCWC |            |
|   9 |          PX RECEIVE          |          |    61 | 3965 |    12   (0)| 00:00:01 | Q1,01 | PCWP |            |
| 10 |           PX SEND BROADCAST | :TQ10000 |    61 | 3965 |    12   (0)| 00:00:01 |        | S->P | BROADCAST |
| 11 |            TABLE ACCESS FULL | T        |    61 | 3965 |    12   (0)| 00:00:01 |        |      |            |
| 12 |         PX BLOCK ITERATOR    |          | 52078 |   661K|     8   (0)| 00:00:01 | Q1,01 | PCWC |            |
| 13 |          INDEX FAST FULL SCAN| T1_IDX   | 52078 |   661K|     8   (0)| 00:00:01 | Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------
像前面這種情況，部分並行化執行計劃可能會導致兩方面效果都很差：消耗的時間並沒有改善，因為序列操作形成了整個執行的瓶頸。然而，該SQL還捆綁了並行伺服器程式，而且可能影響其他併發執行SQL的效能。
如果我們為表“t”設定一個預設的並行度，該序列瓶頸將會消失。對“t”表的全掃描現在是按並行執行了，而且“串到並S->P”瓶頸將被全並行的“併到並P->P”操作替代
在最佳化並行執行計劃時，要確保所有相關步驟都在並行執行：“DBMS_XPLAN”中的串到並S->P 標記或者“PLAN_TABLE”中的“PARALLEL_FROM_SERIAL”通常指示在並行計劃的某些方面存在序列瓶頸。
Monitor parallel execution
SQL> select * from v$pq_sesstat;
STATISTIC                      LAST_QUERY SESSION_TOTAL
------------------------------ ---------- -------------
Queries Parallelized                    1            13
DML Parallelized                        0             0
DDL Parallelized                        0             0
DFO Trees                               1            13
Server Threads                          6             0
Allocation Height                       3             0
Allocation Width                        1             0
Local Msgs Sent                       362        303740
Distr Msgs Sent                         0             0
Local Msgs Recv'd                     368        303782
Distr Msgs Recv'd                       0             0
11 rows selected.
SQL> select DFO_NUMBER, TQ_ID, SERVER_TYPE, NUM_ROWS ,BYTES,process from v$pq_tqstat order by dfo_number , tq_id , server_type;
DFO_NUMBER      TQ_ID SERVER_TYPE       NUM_ROWS      BYTES PROCESS
---------- ---------- --------------- ---------- ---------- --------
         1          0 Consumer             17151     100454 P001
         1          0 Consumer             17242     100969 P002
         1          0 Consumer             17257     101058 P000
         1          0 Producer              1971       9955 P004
         1          0 Producer             29565     174989 P005
         1          0 Producer             20114     117537 P003
         1          1 Consumer               893       8107 P002
         1          1 Consumer              2914      26341 P001
         1          1 Consumer                 0         60 P000
         1          1 Producer               611       5494 P003
         1          1 Producer              2593      23493 P005
         1          1 Producer               603       5521 P004
         1          2 Consumer              3807      19040 QC
         1          2 Producer              2914      14545 P001
         1          2 Producer               893       4475 P002
         1          2 Producer                 0         20 P000
16 rows selected.
Obviously, from the v$pq_tqstat, the DOP is downgrade to 3, and have two slave sets. From the v$tq_sesstat, we can see that what’s the process. How these processes cooperated together to generated the result. In this sample, P003, P004,P005 generate the data and feed to P000,P001,P002, totally twice. Then P000,P001,P002 feed the data to QC, the query coordinator. Also, we can find that P000 at the first time received 17257 rows, but at the second time it received 0 rows, and lastly, it product 0 rows to QC. Why?
Let’s look at the execution plan for further investigation.
SQL> @utlxplp
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1177066807
-----------------------------------------------------------------------------------------------------------------
| Id | Operation               | Name     | Rows | Bytes | Cost (%CPU)| Time     |    TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          | 3867 |   147K|    13   (8)| 00:00:01 |        |      |            |
|   1 | PX COORDINATOR         |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 | 3867 |   147K|    13   (8)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
|* 3 |    HASH JOIN BUFFERED   |          | 3867 |   147K|    13   (8)| 00:00:01 | Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          | 53332 |   677K|     7   (0)| 00:00:01 | Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 | 53332 |   677K|     7   (0)| 00:00:01 | Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          | 53332 |   677K|     7   (0)| 00:00:01 | Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| PRODUCTS | 53332 |   677K|     7   (0)| 00:00:01 | Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          | 3867 |    98K|     5   (0)| 00:00:01 | Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 | 3867 |    98K|     5   (0)| 00:00:01 | Q1,01 | P->P | HASH       |
| 10 |       PX BLOCK ITERATOR |          | 3867 |    98K|     5   (0)| 00:00:01 | Q1,01 | PCWC |            |
| 11 |        TABLE ACCESS FULL| COSTS    | 3867 |    98K|     5   (0)| 00:00:01 | Q1,01 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("C"."PROD_ID"="P"."PROD_ID")
This is a hash-hash parallel distribute, using hash joining. firstly, Slave set(Q1,00) scan products table and send the results to slave set(Q1,02), and slave set(Q1,01) scan the costs table and send results to slave set(Q1,02), which slave in slave set(Q1,02) to send to is dependent on the hash value of join key. as the table costs only have two distinct values for prod_id, so the rows in costs only have two target slave to send to, resulting one slave of the slave set (Q1,02) receive 0 rows. After the slaves in (Q1,02) received rows from the other two slave sets, they hash join the rows and send the result to QC, obviously, one slave will have no rows to process.
This is why 0 rows in the above output.
Also, you can enable the trace for parallel execution, through this way:
alter session set events '10132 trace name context forever';

Compute the variance of bytes grouped by TQ_ID. Large variances indicate workload imbalances. You should investigate large variances to determine whether the producers start out with unequal distributions of data, or whether the distribution itself is skewed. If the data itself is skewed, this might indicate a low cardinality, or low number of distinct values.

Some Common Sense

One of my favourite descriptions of performance tuning, althou gh I can’t remember where I first heard it, is that it is based on ‘informed common sense’. That really captures my ow n experiences of performance tu ning. Yes, you need to use proper analysis techniques and often a great deal of technical knowledge, but that’s all devalued if you’re completely missing the point . So let’s take a step away from the technical and consider the big picture.
Don’t even think about implementing Parallel Execution un less you are prepared to invest some time in initial testing, followed by ongoing performance monitoring. If you don’t, you might one day hit performance problems either server-wide or on an individual user session that you’d never believe (until it happens to you).
Parallel Execution is designed to utilise hardware as heavily as possible. If you are running on a single-CPU server with two hard disk drives and 512Mb RAM, don’t expect significant perfor mance improvements just because you switch PX on. The more CPUs , disk drives, controllers and RAM you have installed on your server, the better the results are going to be.
Although you may be able to use Parallel Execution to make an inefficient SQL stat ement run many times faster, that would be incredibly stupid. It’s essential that you tune the SQL first . In the end, doing more work than you should be, but more quickly, is still doing more work than you should be! To put it another way, don’t use PX as a dressing for a poorly designed application. Reduce the wo rkload to the minimum needed to achieve the task and then start using the server facilities to make it run as quickly as possible. Seems obvious, doesn’t it?
If you try to use PX to benefit a large number of user s performing online queries yo u may eventually bring the server to its knees. Well, maybe not if you use th e Adaptive Multi-User algorithm, but then it’s essential that both you and, more important, your users unders tand that response time is going to be very variable when the machine gets busy.
Using PX for a query that runs in a few seconds is pointless. You’re just going to use more resources on the server for very little improvemen t in the run time of the query. It might well run more slowly!
Sometimes when faced with a slow i/o subsystem you migh t find that higher degrees of parallelism are useful because the CPUs are spending more ti me waiting for i/o to complete. Ther efore they are more likely to be available for another PX slave (that isn’t waiting on i/o) to use. This was certainly my experience at one site. However, it’s also true that using PX will usually lead to a busier i/o subsystem because the server is likely to favour full scans over indexed retrieva l. There are no easy answers here - you really need to carry out some analysis of overall system resource usage to identify where the bottlenecks are an d adjust the configuration accordingly.
Consider whether PX is the correct parallel solution for overnight batch operations. It may be that you can achieve better performance using multip le streams of jobs, each single-threa ded, or maybe you would be better with one stream of jobs which uses PX. It depends on your application so the only sure way to find out is to try the different approaches .

BTW, introduce two useful link about parallel execution.

Oracle Database Parallel Execution Fundamental

http://www.oracle.com/technetwork/articles/datawarehouse/twp-parallel-execution-fundamentals-133639.pdf

How Parallel Execution Works

參考至:《讓Oracle跑得更快》譚懷遠著
           《讓Oracle跑得更快2》譚懷遠著
           %E6%95%B0%E6%8D%AE%E5%BA%93/137.html
           http://www.cnblogs.com/daduxiong/archive/2010/08/24/1807427.html


           http://blog.csdn.net/tianlesoftware/article/details/5854583

           http://space.itpub.net/17203031/viewspace-696389
           http://os2ora.com/monitor-parallel-execution/

           http://www.oracle.com/technetwork/articles/datawarehouse/twp-parallel-execution-fundamentals-133639.pdf


           http://www.cnblogs.com/luoyx/archive/2011/12/26/2302587.html
本文原創，轉載請註明出處、作者
如有錯誤，歡迎指正
郵箱:czmcj@163.com

來自czmmiao深入理解Oracle的並行操作

相關文章