工作當中碰到的一個UTL_FILE的問題

abstractcyj發表於2013-03-11
附件

2012-03-28 SSO2 DATA Loading Faillure Part 2

General Information

Case provider: Sidney Chen Case Type: Trouble-Shooting
DB Version: 11.2.0.2 First occur time: 2012/Mar/18
OS: Linux First occur project: SSM
RAC: Yes Environment Type: Production
SQL_ID: Application: SSO2 Data Loading

Symptom

Since from Mar 19th. SSO2 data loading failed on the ORA-4030 error. When the loading failed, The db server hklp152p physical free memory is exhausted by the process 13225, which is the server process for the session 181, the session 181 is running the pl/sql function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE, and has eaten up 4G memory.

 1Errors in file /opt/oracle/diag/rdbms/ssmdev/ssmdev/trace/ssmdev_ora_29771.trc  (incident=15891):
 2ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu sessi,pl/sql vc2)
 3ORA-29282: invalid file ID
 4Incident details in: /opt/oracle/diag/rdbms/ssmdev/ssmdev/incident/incdir_15891/ssmdev_ora_29771_i15891.trc
 5
 6launa-root@hklp152p:/root$free -m
 7             total       used       free     shared    buffers     cached
 8Mem:         16050      15960         90          0        126       3777
 9-/+ buffers/cache:      12056       3994
10Swap:         4095          0       4095
11
12  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1313225 oracle    15   0 10.2g 4.0g  51m S  0.0 25.8  29:05.39 oracle
14 1278 root      17   0 1602m 308m 9508 S  0.0  1.9  37:45.32 java
1512987 ssmdbjob  17   0  266m 258m  976 R 53.6  1.6 364:29.86 SlInsCityHubLoa
16
17ssmsupp@SSMPRD> @ses 181 mem
18
19       SID NAME                                VALUE
20---------- ------------------------------ ----------
21       181 session uga memory             4281132768
22       181 session uga memory max         4281656888
23       181 session pga memory             4285223864
24       181 session pga memory max         4288500664
25
26ssmsupp@SSMPRD> @spid 13225
27
28SPID              SID    SERIAL# USERNAME
29---------- ---------- ---------- ------------
3013225             181      22507 SSMWSPLSQL
31
32declare
33v_out number;
34begin
35ssmplsql.UPDATE_CS2_MASTER_CITY_PKG.CLEAR_HAS_SS;
36dbms_output.put_line('[UPDATE_CS2_MASTER_CITY_PKG.CLEAR_HAS_SS] Clear HAS_SS completed.');
37SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE('IBOB_CITY','CSS_SS_ALL_CITY_HUB.data', 'IB_OB_CITY_REJECT_LOG.data', 'CSS_SS_REJECT_LOG.data','CSS_SS_REJECT_CARRIER_CITY.data');
38dbms_output.put_line('[SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE] Handle CSS_IB/OB_CITY_HUB data file completed.');
39end;
40/

Impact

Users can use the SSM online function, but may not see the latest SSM data.

Analysis

The procedure for the data loading is basically as below. the process failed on the step 3. Before calling the function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE, the OOCL data and CONSCO Data files are combined into a single temporary file and processed together.

1. Combine the OOCL and CONSCO data files into single temporary file.
2. update CS2_MASTER_CITY.HAS_SS.
3. Call SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE function, for each records of IB/OB, find city ID and parent city ID, and insert into the table css_Ib_city_hub and css_Ob_city_hub.
4. Remove duplicate records by carr_scac, rte_key, door_ind. (After IB/OB records are loaded successfully and indexes are created. )
Merge records by door_ind.
5. Update CS2_MASTER_CITY(After IB/OB records are loaded successfully and indexes are created.).
Refer to ssmplsql.update_cs2_master_city_pkg.main
Clear CS2_MASTER_CITY.HAS_SS=> UPDATE HAS_SS according to SSC_B2B_VOYAGE and SSC_B2B_ROUTE
=>Generate trimmed city names for city auto complete.
6. Delete duplicates from CSS_FULL_RTE_LOAD.
7. When each shell scripts exit, check if any ORA errors. If yes, then abort whole process.

Guess work

1. The COSCON data size increase.
On Mar 19th, when the first SSO2 data loading failed, the data file size from CONSCO increased from several megabytes to around 40M megabytes, which might be the root cause, So we revert the previsous successfully loaded COSCON data, with the latest OOCL data file and try again, the data loading process failed.

2. Kernel parameter.

From the advice of Raymond@ORASUPP, The symptom of ORA-04030 matches what the description in the note 1325100.1. it's recommended to change the kernel parameter kernel.shmall and vm.max_map_count. Without the change, the memory usage limitation for a oracle session is 4G, it's possible that the data loading session need to allocate more than 4G memory and failed on the 4G limit. We arrange weekend downtime and the two kernel parameter. The SSO2 data loading still failed with ORA-04030 after then kernel parameter change.

1sysctl -w kernel.shmall = 4194304
2sysctl -w vm.max_map_count=200000

Further Investigation

The last time the data loading is successful is on Mar 18th. To reproduce the case, We try different data files on local ssmdev db and the behaviour is same as on production.

1. OOCL DATA (Mar 18th) + CONSCO DATA (Mar 18th), success
2. OOCL DATA (Mar 25th) + CONSCO DATA (Mar 18th), failed
3. OOCL DATA (Mar 25th) + CONSCO DATA (Mar 25th), failed

Changing the OOCL data from mar 18th to Mar 25th make data loading failed, so We suspect the there is something change on the OOCL data files, because the file size is not changed obviously, but the max line size on the data files indicates there a increase trend.

Success loading, Mar 18th. the max line size is 32505.

1[oracle@ssmlqa-zha data]$ wc -L CSS_SS_IB_CITY_HUB.data
228794 CSS_SS_IB_CITY_HUB.data
3[oracle@ssmlqa-zha data]$ wc -L CSS_SS_OB_CITY_HUB.data
432505 CSS_SS_OB_CITY_HUB.data

Failed loading, Mar 19th, the max line size is 32799.

1oracle@cargosmart:~/SSO2/data$ wc -L CSS_SS_IB_CITY_HUB.data && wc -L CSS_SS_OB_CITY_HUB.data
228752 CSS_SS_IB_CITY_HUB.data
332799 CSS_SS_OB_CITY_HUB.data

Failed loading, Mar 25th, the max line size is 37573.

1[oracle@ssmlqa-zha data]$ wc -L CSS_SS_IB_CITY_HUB.data && wc -L CSS_SS_OB_CITY_HUB.data
231468 CSS_SS_IB_CITY_HUB.data
337573 CSS_SS_OB_CITY_HUB.data

The data file is processed by the funcion function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE. So We go through the code for this specific function. When encountering below lines, it crossed my mind if it's possible there is exception or buffer overflow when the linesize is greater than 32K?

 1...
 2v_str_buffer          VARCHAR2(32767);
 3...
 4v_utl_file_city_hub   := utl_file.fopen(p_directory,
 5                                        p_ss_city_hub_file,
 6                                        'r',
 7                                        32767);
 8...    
 9utl_file.get_line(v_utl_file_city_hub, v_str_buffer);
10...

So we create a data file with a single line, exceeding 32K, the issue can be reproduced.

Root cause

It turns out that the application can't handle the data files including a line exceeding 32k. There is a bug causing the ORA-4030 out of memory error. Oracle utl_file.get_line interface can only fetch 32k bytes at one time. If one line exceeds 32k, the call on utl_file.get_line will raise utl_file.read_error exception, the exception is caught as expected, at the same time, the data file is closed as well. The problem is the program does not exit when the exception is caught, it go on to process next line, but the file has been closed, the reading on the close file causes the dead looping and buffer overflow, so called UGA memory leak.

Actually The line size in the OOCL data does not change suddenly, the max line size of previous successfully loaded files has been already approaching 32k, start from Mar 18th, it just cross the 32k threshold and trigger the application bug.

Reproduce

Here is procedure to reproduce the ORA-04030
1. Prepare a file test.data including only one line, the line size is 37769, which is over 32k.
2. Query the free memory available on OS before testing
3. Prepare a session 36, create a directory object utl_dir.
4. Take a snapshot of uga/pga of session 36. before testing.
5. In session 36, run the plsql to simulate the problem. Session 36 hang as expect.
6. Query uga/pga of session 36 in another session.
7. Query OS CPU activity and memory usage

1. Prepare a file test.data including only one line, the line size is 37769, which is over 32k.

1oracle@cargosmart:~/utl$ wc -l test.data
21 test.data
3oracle@cargosmart:~/utl$ wc -L test.data
437769 test.data

2. Query the free memory available on OS. There is 4G memory on the server, 1.6G free and 646M for page cache. when under memory pressure, the page cache will be shrunk and the memory will used as process memory.

1oracle@cargosmart:~/scripts$ free -m
2             total       used       free     shared    buffers     cached
3Mem:          3925       2251       1674          0         98        646
4-/+ buffers/cache:       1506       2419
5Swap:         1992        717       1275

3. Prepare a session 36, create a directory object utl_dir.

 1sid@CS11GR2> CREATE DIRECTORY utl_dir AS '/home/oracle/utl';
 2
 3Directory created.
 4
 5sid@CS11GR2> select sid from v$mystat where rownum=1;
 6
 7       SID
 8----------
 9        36
10        

4. Take a snapshot of uga/pga of session 36, before testing.

1sid@CS11GR2> @ses 36 session%mem
2       SID NAME                                VALUE
3---------- ------------------------------ ----------
4        36 session uga memory                 241748
5        36 session uga memory max             438284
6        36 session pga memory                1007660
7        36 session pga memory max            1007660

5. In session 36, run the plsql to simulate the problem. Session 36 hang as expect and burn the CPU.

 1declare
 2f    varchar2(10) := 'test.data';
 3h    utl_file.file_type := utl_file.fopen('utl_dir', f, 'r', 32767);
 4buf  varchar2(32767);
 5BEGIN
 6    LOOP
 7        BEGIN
 8            BEGIn
 9                utl_file.get_line(h, buf);
10
11                exception
12                when no_data_found then
13                    exit;
14                when utl_file.read_error then
15                    utl_file.fclose_all;
16                    dbms_output.put_line('read error : ' || sqlerrm);
17                    raise;    -- The read error exception is raised here!
18                when others then
19                    utl_file.fclose_all;
20                    dbms_output.put_line('others2 : ' || sqlerrm);
21                    raise;
22            end;
23            exception
24            when others then -- The read_error exception is catched here without exit.
25                dbms_output.put_line('others2 : ' || sqlerrm);
26        end;
27    end loop;
28    utl_file.fclose(h);
29end;
30/

6. Query uga/pga of session 36 in another session. The uga/pga keep climbing up.

 1--The UGA usage climb to 140M
 2sid@CS11GR2> @ses 36 session%mem
 3
 4       SID NAME                                VALUE
 5---------- ------------------------------ ----------
 6        36 session uga memory              141289084
 7        36 session uga memory max          141289084
 8        36 session pga memory              141844524
 9        36 session pga memory max          141844524
10        36 cell num smart IO sessions in           0
11
12--The UGA usage climb to 330M
13sid@CS11GR2> /
14
15       SID NAME                                VALUE
16---------- ------------------------------ ----------
17        36 session uga memory              333239244
18        36 session uga memory max          333239244
19        36 session pga memory              333865004
20        36 session pga memory max          333865004
21
22--The UGA usage climb to 2.4G
23sid@CS11GR2> /
24
25       SID NAME                                VALUE
26---------- ------------------------------ ----------
27        36 session uga memory             2497428164
28        36 session uga memory max         2497428164
29        36 session pga memory             2498846764
30        36 session pga memory max         2498846764

7. Query OS CPU activity and memory usage, the physical free memory reduced from 1.6G to 110M, and page cache reduce from 646M to 361M. From the top command result, the process 31063 is the server process of session 36. The process 31063 has already consumed 2G memory and is looping on the syscall getrusage and burn the CPU.

 1oracle@cargosmart:~/scripts$ free -m
 2             total       used       free     shared    buffers     cached
 3Mem:          3925       3815        110          0          4        361
 4-/+ buffers/cache:       3449        476
 5Swap:         1992       1009        983
 6
 7oracle@cargosmart:~/scripts$ top -d 5
 8top - 13:58:56 up 99 days, 19:24,  6 users,  load average: 1.73, 1.25, 0.73
 9Tasks: 270 total,   2 running, 264 sleeping,   0 stopped,   4 zombie
10Cpu0  : 96.0%us,  4.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
11Cpu1  :  0.8%us,  3.2%sy,  0.0%ni,  0.0%id, 96.0%wa,  0.0%hi,  0.0%si,  0.0%st
12Mem:   4020164k total,  3906520k used,   113644k free,     4032k buffers
13Swap:  2040212k total,  1054512k used,   985700k free,   360780k cached
14
15  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1631063 oracle    20   0 2912m 2.3g  22m R  101 59.3  11:44.25 oraclecs11gR2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
17   33 root      15  -5     0    0    0 D    2  0.0  29:04.40 [kswapd0]

8. Finally, The session 36 abort with ORA-04030 out of memory, which is the same behaviors as the production issue.

 1sid@CS11GR2> declare
 2  2  f    varchar2(10) := 'test.data';
 3  3  h    utl_file.file_type := utl_file.fopen('UTL_DIR', f, 'r', 32767);
 4  4  buf  varchar2(32767);
 5  5  BEGIN
 6  6          LOOP
 7  7                  BEGIN
 8  8                          BEGIn
 9  9                                  utl_file.get_line(h, buf);
10 10
11 11                                  exception
12 12                                  when no_data_found then
13 13                                          exit;
14 14                                  when utl_file.read_error then
15 15                                          utl_file.fclose_all;
16 16                                          dbms_output.put_line('read error : ' || sqlerrm);
17 17                                          raise;
18 18                                  when others then
19 19                                          utl_file.fclose_all;
20 20                                          dbms_output.put_line('others2 : ' || sqlerrm);
21 21                                          raise;
22 22                          end;
23 23                          exception
24 24                          when others then
25 25                                  dbms_output.put_line('others2 : ' || sqlerrm);
26 26                  end;
27 27          end loop;
28 28          utl_file.fclose(h);
29 29  end;
30 30  /
31read error : ORA-29284: file read error
32others2 : ORA-29284: file read error
33others2 : ORA-29282: invalid file ID
34others2 : ORA-29282: invalid file ID
35others2 : ORA-29282: invalid file ID
36others2 : ORA-29282: invalid file ID
37others2 : ORA-29282: invalid file ID
38others2 : ORA-29282: invalid file ID
39others2 : ORA-29282: invalid file ID
40others2 : ORA-29282: invalid file ID
41others2 : ORA-29282: invalid file ID
42others2 : ORA-29282: invalid file ID
43others2 : ORA-29282: invalid file ID
44others2 : ORA-29282: invalid file ID
45others2 : ORA-29282: invalid file ID
46declare
47*
48ERROR at line 1:
49ORA-04030: out of process memory when trying to allocate 16356 bytes (koh-kghu sessi,pl/sql vc2)
50ORA-29282: invalid file ID

Solution

To work around the application bug, we add a pre-process step in the SlDbloader.sh, to split the lines exceeding 32K, thus the following call on the function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE will not encounter any line exceeding 32k.

Here is the change to the SlDbloader.sh, the data is separated by the 32k threshold, the line exceeding 32k will be split by the function split_route_keys.

 1split_route_keys()
 2{
 3cat $1 | while read line
 4      do
 5        if [ -z "$line" ]; then
 6                exit;
 7        fi
 8        CITY_INFO=`echo $line | sed 's/\(.*|\)\(.*\)/\1/'`
 9        ALL_KEYS=`echo $line | sed 's/\(.*|\)\(.*\)/\2/'`
10        echo $ALL_KEYS | awk -v var="$CITY_INFO" '{
11                max=split($0, a, ",");
12                #print a[1],a[2]
13                i=1
14                str = "";
15                while ( i <= max )
16                {
17                        str = sprintf("%s,%s",str,a[i]);
18                        i = i + 1;
19                        if ( i%1000 == 0) {
20                                print var substr(str,2);
21                                str = "";
22                        }
23                }
24                print var substr(str,2);
25        }'
26done
27}
28
29cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp | awk '{if( length($0) >= 32767) print $0}' >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid
30cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp | awk '{if( length($0) < 32767) print $0}' >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid
31
32split_route_keys $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid > $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised
33
34cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid
35
36rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp
37rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid
38rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised
39
40fmove $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid $TMP/CSS_SS_ALL_CITY_HUB.data

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/8520577/viewspace-755737/,如需轉載,請註明出處,否則將追究法律責任。

相關文章