2012-03-28 SSO2 DATA Loading Faillure Part 2¶
General Information¶
Case provider: | Sidney Chen | Case Type: | Trouble-Shooting |
DB Version: | 11.2.0.2 | First occur time: | 2012/Mar/18 |
OS: | Linux | First occur project: | SSM |
RAC: | Yes | Environment Type: | Production |
SQL_ID: | Application: | SSO2 Data Loading |
Symptom¶
Since from Mar 19th. SSO2 data loading failed on the ORA-4030 error. When the loading failed, The db server hklp152p physical free memory is exhausted by the process 13225, which is the server process for the session 181, the session 181 is running the pl/sql function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE, and has eaten up 4G memory.
1Errors in file /opt/oracle/diag/rdbms/ssmdev/ssmdev/trace/ssmdev_ora_29771.trc (incident=15891): 2ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu sessi,pl/sql vc2) 3ORA-29282: invalid file ID 4Incident details in: /opt/oracle/diag/rdbms/ssmdev/ssmdev/incident/incdir_15891/ssmdev_ora_29771_i15891.trc 5 6launa-root@hklp152p:/root$free -m 7 total used free shared buffers cached 8Mem: 16050 15960 90 0 126 3777 9-/+ buffers/cache: 12056 3994 10Swap: 4095 0 4095 11 12 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1313225 oracle 15 0 10.2g 4.0g 51m S 0.0 25.8 29:05.39 oracle 14 1278 root 17 0 1602m 308m 9508 S 0.0 1.9 37:45.32 java 1512987 ssmdbjob 17 0 266m 258m 976 R 53.6 1.6 364:29.86 SlInsCityHubLoa 16 17ssmsupp@SSMPRD> @ses 181 mem 18 19 SID NAME VALUE 20---------- ------------------------------ ---------- 21 181 session uga memory 4281132768 22 181 session uga memory max 4281656888 23 181 session pga memory 4285223864 24 181 session pga memory max 4288500664 25 26ssmsupp@SSMPRD> @spid 13225 27 28SPID SID SERIAL# USERNAME 29---------- ---------- ---------- ------------ 3013225 181 22507 SSMWSPLSQL 31 32declare 33v_out number; 34begin 35ssmplsql.UPDATE_CS2_MASTER_CITY_PKG.CLEAR_HAS_SS; 36dbms_output.put_line('[UPDATE_CS2_MASTER_CITY_PKG.CLEAR_HAS_SS] Clear HAS_SS completed.'); 37SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE('IBOB_CITY','CSS_SS_ALL_CITY_HUB.data', 'IB_OB_CITY_REJECT_LOG.data', 'CSS_SS_REJECT_LOG.data','CSS_SS_REJECT_CARRIER_CITY.data'); 38dbms_output.put_line('[SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE] Handle CSS_IB/OB_CITY_HUB data file completed.'); 39end; 40/
Impact¶
Users can use the SSM online function, but may not see the latest SSM data.
Analysis¶
The procedure for the data loading is basically as below. the process failed on the step 3. Before calling the function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE, the OOCL data and CONSCO Data files are combined into a single temporary file and processed together.
1. Combine the OOCL and CONSCO data files into single temporary file.
2.
update CS2_MASTER_CITY.HAS_SS.
3. Call SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE
function, for each records of IB/OB, find city ID and parent city ID, and insert
into the table css_Ib_city_hub and css_Ob_city_hub.
4. Remove duplicate
records by carr_scac, rte_key, door_ind. (After IB/OB records are loaded
successfully and indexes are created. )
Merge records by door_ind.
5.
Update CS2_MASTER_CITY(After IB/OB records are loaded successfully and indexes
are created.).
Refer to ssmplsql.update_cs2_master_city_pkg.main
Clear
CS2_MASTER_CITY.HAS_SS=> UPDATE HAS_SS according to SSC_B2B_VOYAGE and
SSC_B2B_ROUTE
=>Generate trimmed city names for city auto complete.
6.
Delete duplicates from CSS_FULL_RTE_LOAD.
7. When each shell scripts exit,
check if any ORA errors. If yes, then abort whole process.
Guess work¶
1. The COSCON data size increase.
On Mar 19th, when the
first SSO2 data loading failed, the data file size from CONSCO increased from
several megabytes to around 40M megabytes, which might be the root cause, So we
revert the previsous successfully loaded COSCON data, with the latest OOCL data
file and try again, the data loading process failed.
2. Kernel parameter.
From the advice of Raymond@ORASUPP, The symptom of ORA-04030 matches what the description in the note 1325100.1. it's recommended to change the kernel parameter kernel.shmall and vm.max_map_count. Without the change, the memory usage limitation for a oracle session is 4G, it's possible that the data loading session need to allocate more than 4G memory and failed on the 4G limit. We arrange weekend downtime and the two kernel parameter. The SSO2 data loading still failed with ORA-04030 after then kernel parameter change.
1sysctl -w kernel.shmall = 4194304 2sysctl -w vm.max_map_count=200000
Further Investigation¶
The last time the data loading is successful is on Mar 18th. To reproduce the case, We try different data files on local ssmdev db and the behaviour is same as on production.
1. OOCL DATA (Mar 18th) + CONSCO DATA (Mar 18th),
success
2. OOCL DATA (Mar 25th) + CONSCO DATA (Mar 18th),
failed
3. OOCL DATA (Mar 25th) + CONSCO DATA (Mar 25th),
failed
Changing the OOCL data from mar 18th to Mar 25th make data loading failed, so We suspect the there is something change on the OOCL data files, because the file size is not changed obviously, but the max line size on the data files indicates there a increase trend.
Success loading, Mar 18th. the max line size is 32505.
1[oracle@ssmlqa-zha data]$ wc -L CSS_SS_IB_CITY_HUB.data 228794 CSS_SS_IB_CITY_HUB.data 3[oracle@ssmlqa-zha data]$ wc -L CSS_SS_OB_CITY_HUB.data 432505 CSS_SS_OB_CITY_HUB.data
Failed loading, Mar 19th, the max line size is 32799.
1oracle@cargosmart:~/SSO2/data$ wc -L CSS_SS_IB_CITY_HUB.data && wc -L CSS_SS_OB_CITY_HUB.data 228752 CSS_SS_IB_CITY_HUB.data 332799 CSS_SS_OB_CITY_HUB.data
Failed loading, Mar 25th, the max line size is 37573.
1[oracle@ssmlqa-zha data]$ wc -L CSS_SS_IB_CITY_HUB.data && wc -L CSS_SS_OB_CITY_HUB.data 231468 CSS_SS_IB_CITY_HUB.data 337573 CSS_SS_OB_CITY_HUB.data
The data file is processed by the funcion function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE. So We go through the code for this specific function. When encountering below lines, it crossed my mind if it's possible there is exception or buffer overflow when the linesize is greater than 32K?
1... 2v_str_buffer VARCHAR2(32767); 3... 4v_utl_file_city_hub := utl_file.fopen(p_directory, 5 p_ss_city_hub_file, 6 'r', 7 32767); 8... 9utl_file.get_line(v_utl_file_city_hub, v_str_buffer); 10...
So we create a data file with a single line, exceeding 32K, the issue can be reproduced.
Root cause¶
It turns out that the application can't handle the data files including a line exceeding 32k. There is a bug causing the ORA-4030 out of memory error. Oracle utl_file.get_line interface can only fetch 32k bytes at one time. If one line exceeds 32k, the call on utl_file.get_line will raise utl_file.read_error exception, the exception is caught as expected, at the same time, the data file is closed as well. The problem is the program does not exit when the exception is caught, it go on to process next line, but the file has been closed, the reading on the close file causes the dead looping and buffer overflow, so called UGA memory leak.
Actually The line size in the OOCL data does not change suddenly, the max line size of previous successfully loaded files has been already approaching 32k, start from Mar 18th, it just cross the 32k threshold and trigger the application bug.
Reproduce¶
Here is procedure to reproduce the ORA-04030
1. Prepare a file test.data
including only one line, the line size is 37769, which is over 32k.
2. Query
the free memory available on OS before testing
3. Prepare a session 36,
create a directory object utl_dir.
4. Take a snapshot of uga/pga of session
36. before testing.
5. In session 36, run the plsql to simulate the problem.
Session 36 hang as expect.
6. Query uga/pga of session 36 in another
session.
7. Query OS CPU activity and memory usage
1. Prepare a file test.data including only one line, the line size is 37769,
which is over 32k.
1oracle@cargosmart:~/utl$ wc -l test.data 21 test.data 3oracle@cargosmart:~/utl$ wc -L test.data 437769 test.data
2. Query the free memory available on OS. There is 4G memory on the server, 1.6G free and 646M for page cache. when under memory pressure, the page cache will be shrunk and the memory will used as process memory.
1oracle@cargosmart:~/scripts$ free -m 2 total used free shared buffers cached 3Mem: 3925 2251 1674 0 98 646 4-/+ buffers/cache: 1506 2419 5Swap: 1992 717 1275
3. Prepare a session 36, create a directory object utl_dir.
1sid@CS11GR2> CREATE DIRECTORY utl_dir AS '/home/oracle/utl'; 2 3Directory created. 4 5sid@CS11GR2> select sid from v$mystat where rownum=1; 6 7 SID 8---------- 9 36 10
4. Take a snapshot of uga/pga of session 36, before testing.
1sid@CS11GR2> @ses 36 session%mem 2 SID NAME VALUE 3---------- ------------------------------ ---------- 4 36 session uga memory 241748 5 36 session uga memory max 438284 6 36 session pga memory 1007660 7 36 session pga memory max 1007660
5. In session 36, run the plsql to simulate the problem. Session 36 hang as expect and burn the CPU.
1declare 2f varchar2(10) := 'test.data'; 3h utl_file.file_type := utl_file.fopen('utl_dir', f, 'r', 32767); 4buf varchar2(32767); 5BEGIN 6 LOOP 7 BEGIN 8 BEGIn 9 utl_file.get_line(h, buf); 10 11 exception 12 when no_data_found then 13 exit; 14 when utl_file.read_error then 15 utl_file.fclose_all; 16 dbms_output.put_line('read error : ' || sqlerrm); 17 raise; -- The read error exception is raised here! 18 when others then 19 utl_file.fclose_all; 20 dbms_output.put_line('others2 : ' || sqlerrm); 21 raise; 22 end; 23 exception 24 when others then -- The read_error exception is catched here without exit. 25 dbms_output.put_line('others2 : ' || sqlerrm); 26 end; 27 end loop; 28 utl_file.fclose(h); 29end; 30/
6. Query uga/pga of session 36 in another session. The uga/pga keep climbing up.
1--The UGA usage climb to 140M 2sid@CS11GR2> @ses 36 session%mem 3 4 SID NAME VALUE 5---------- ------------------------------ ---------- 6 36 session uga memory 141289084 7 36 session uga memory max 141289084 8 36 session pga memory 141844524 9 36 session pga memory max 141844524 10 36 cell num smart IO sessions in 0 11 12--The UGA usage climb to 330M 13sid@CS11GR2> / 14 15 SID NAME VALUE 16---------- ------------------------------ ---------- 17 36 session uga memory 333239244 18 36 session uga memory max 333239244 19 36 session pga memory 333865004 20 36 session pga memory max 333865004 21 22--The UGA usage climb to 2.4G 23sid@CS11GR2> / 24 25 SID NAME VALUE 26---------- ------------------------------ ---------- 27 36 session uga memory 2497428164 28 36 session uga memory max 2497428164 29 36 session pga memory 2498846764 30 36 session pga memory max 2498846764
7. Query OS CPU activity and memory usage, the physical free memory reduced from 1.6G to 110M, and page cache reduce from 646M to 361M. From the top command result, the process 31063 is the server process of session 36. The process 31063 has already consumed 2G memory and is looping on the syscall getrusage and burn the CPU.
1oracle@cargosmart:~/scripts$ free -m 2 total used free shared buffers cached 3Mem: 3925 3815 110 0 4 361 4-/+ buffers/cache: 3449 476 5Swap: 1992 1009 983 6 7oracle@cargosmart:~/scripts$ top -d 5 8top - 13:58:56 up 99 days, 19:24, 6 users, load average: 1.73, 1.25, 0.73 9Tasks: 270 total, 2 running, 264 sleeping, 0 stopped, 4 zombie 10Cpu0 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st 11Cpu1 : 0.8%us, 3.2%sy, 0.0%ni, 0.0%id, 96.0%wa, 0.0%hi, 0.0%si, 0.0%st 12Mem: 4020164k total, 3906520k used, 113644k free, 4032k buffers 13Swap: 2040212k total, 1054512k used, 985700k free, 360780k cached 14 15 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1631063 oracle 20 0 2912m 2.3g 22m R 101 59.3 11:44.25 oraclecs11gR2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) 17 33 root 15 -5 0 0 0 D 2 0.0 29:04.40 [kswapd0]
8. Finally, The session 36 abort with ORA-04030 out of memory, which is the same behaviors as the production issue.
1sid@CS11GR2> declare 2 2 f varchar2(10) := 'test.data'; 3 3 h utl_file.file_type := utl_file.fopen('UTL_DIR', f, 'r', 32767); 4 4 buf varchar2(32767); 5 5 BEGIN 6 6 LOOP 7 7 BEGIN 8 8 BEGIn 9 9 utl_file.get_line(h, buf); 10 10 11 11 exception 12 12 when no_data_found then 13 13 exit; 14 14 when utl_file.read_error then 15 15 utl_file.fclose_all; 16 16 dbms_output.put_line('read error : ' || sqlerrm); 17 17 raise; 18 18 when others then 19 19 utl_file.fclose_all; 20 20 dbms_output.put_line('others2 : ' || sqlerrm); 21 21 raise; 22 22 end; 23 23 exception 24 24 when others then 25 25 dbms_output.put_line('others2 : ' || sqlerrm); 26 26 end; 27 27 end loop; 28 28 utl_file.fclose(h); 29 29 end; 30 30 / 31read error : ORA-29284: file read error 32others2 : ORA-29284: file read error 33others2 : ORA-29282: invalid file ID 34others2 : ORA-29282: invalid file ID 35others2 : ORA-29282: invalid file ID 36others2 : ORA-29282: invalid file ID 37others2 : ORA-29282: invalid file ID 38others2 : ORA-29282: invalid file ID 39others2 : ORA-29282: invalid file ID 40others2 : ORA-29282: invalid file ID 41others2 : ORA-29282: invalid file ID 42others2 : ORA-29282: invalid file ID 43others2 : ORA-29282: invalid file ID 44others2 : ORA-29282: invalid file ID 45others2 : ORA-29282: invalid file ID 46declare 47* 48ERROR at line 1: 49ORA-04030: out of process memory when trying to allocate 16356 bytes (koh-kghu sessi,pl/sql vc2) 50ORA-29282: invalid file ID
Solution¶
To work around the application bug, we add a pre-process step in the SlDbloader.sh, to split the lines exceeding 32K, thus the following call on the function SS_CITY_UTIL.SAVE_IB_OB_CITY_HUB_FILE will not encounter any line exceeding 32k.
Here is the change to the SlDbloader.sh, the data is separated by the 32k threshold, the line exceeding 32k will be split by the function split_route_keys.
1split_route_keys() 2{ 3cat $1 | while read line 4 do 5 if [ -z "$line" ]; then 6 exit; 7 fi 8 CITY_INFO=`echo $line | sed 's/\(.*|\)\(.*\)/\1/'` 9 ALL_KEYS=`echo $line | sed 's/\(.*|\)\(.*\)/\2/'` 10 echo $ALL_KEYS | awk -v var="$CITY_INFO" '{ 11 max=split($0, a, ","); 12 #print a[1],a[2] 13 i=1 14 str = ""; 15 while ( i <= max ) 16 { 17 str = sprintf("%s,%s",str,a[i]); 18 i = i + 1; 19 if ( i%1000 == 0) { 20 print var substr(str,2); 21 str = ""; 22 } 23 } 24 print var substr(str,2); 25 }' 26done 27} 28 29cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp | awk '{if( length($0) >= 32767) print $0}' >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid 30cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp | awk '{if( length($0) < 32767) print $0}' >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid 31 32split_route_keys $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid > $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised 33 34cat $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised >> $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid 35 36rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp 37rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid 38rm -r $TMP/CSS_SS_ALL_CITY_HUB.data.temp.invalid_revised 39 40fmove $TMP/CSS_SS_ALL_CITY_HUB.data.temp.valid $TMP/CSS_SS_ALL_CITY_HUB.data