sql函式

牧野之月發表於2024-08-30

1. presto:
1. array_position(applist_install, 'com.funtomic.matchmasters') > 0
2. split('joyit_daily_mas_cleaner','_')[3]—>presto 從1開始計位(spark從0開始計位)
3. 日期diff:
1. DATE_DIFF('day', DATE_PARSE('20220301', '%Y%m%d'), DATE_PARSE('20220313', '%Y%m%d'))
2. date_diff('day',to_date(datepart,'yyyy-mm-dd'),current_date)=1
4. 日期轉化:
+8小時 :format_datetime((cast(create_time as timestamp) + interval '8' hour),'yyyyMMdd') ='20230927'
每月最後一天 :last_day_of_month(to_date(dt,'yyyymmdd')

split_part(try_cast(from_unixtime((try_cast(event_time AS double)+8*60*60*1000)/1000) AS varchar),':',1)
1. 刪除分割槽 :
	1. ALTER TABLE table_name DROP IF EXISTS PARTITION(year = 2015, month = 10, day = 1);
2. 修改表名 alter table name rename to new name
3. 修改欄位名:Alter table table_name  change column 原欄位名稱  現欄位名稱  資料型別
4. 新增欄位名  alter table detail_flow_test add columns(original_union_id string);
5. 查詢是否字元是否包含 select strpos(addr,'北京')
6. json解析 select json_extract(feild,'$.name') as feild_name
7. 去重列轉行:array_sort(array_distinct(array_agg(欄位名)))
8. 展示儲存地址 :show create table 表名

****2. spark
1. 動態分割槽要放在靜態分割槽之後寫入
2. MSCK REPAIR TABLE 修復分割槽
3. collect_set :將多個值彙總到一行
4. 建iceberg表: using iceberg partitioned by (dt)
5. array_contains(applist_install, 'com.funtomic.matchmasters')
6. 每月最後一天 last_day(to_date('20230727','yyyyMMdd'))

	1. 建表(csv格式的)
CREATE EXTERNAL TABLE analyst.huanglu_test  
(
`nation` string COMMENT 'from deserializer',
`beyla_id` string COMMENT 'from deserializer',
`push_launch_pv` bigint COMMENT 'from deserializer',
`other_launch_pv` bigint COMMENT 'from deserializer',

`extra1` string COMMENT 'from deserializer',
`extra2` string COMMENT 'from deserializer',
`extra3` string COMMENT 'from deserializer'
)
PARTITIONED BY (`dt` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',','quoteChar' = '\"','escapeChar' = '\\')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://da.results.prod.us-east-1/huanglu/temp/test_20200529改一改這裡'  ---3 這行要修改
TBLPROPERTIES ('has_encrypted_data'='false','skip.header.line.count'='1')
	1. 建表(mongodb json格式的)
CREATE EXTERNAL TABLE  game_ue1.ods_gameet_pp_feedback(
  `_id` STRING,
  u STRING,
  app STRING,
  uc STRING,
  i STRING,
  obj STRING,
  cont STRING,
  imgs STRING,
  star double,
  biz STRING,
  ct BIGINT
) PARTITIONED BY (datepart STRING)
 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
 WITH SERDEPROPERTIES ('serialization.format' = '1') 
 STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 LOCATION 's3://game.data.us-east-1/data_analyst/server_ods/ods_gameet_pp_feedback'

	1. 時間戳轉化:

select from_unixtime(unix_timestamp('20230515' ,'yyyyMMdd'), 'yyyy-MM-dd')
cast(create_time as timestamp) + interval '8' hour 增加8小時
select from_unixtime(cast ('1668009600000' AS bigint)/1000+ 86060,'yyyy-MM-dd')
from_unixtime(unix_timestamp(cast(create_time as timestamp) + interval '8' hour ,'yyyyMMdd'),'yyyyMMdd')

	1. 動態分割槽
		1. 注意遷移資料的時候需要把分割槽欄位放在最後select出來

set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict

3. 累計求和

sum(order_pv)OVER (partition by pp_user_id,item_id,item_type,stage
ORDER BY datepart ASC)

**3 tableau**
1. 最大年月:{MAX(DATETRUNC('day',date))}
2. 星期幾:DATENAME('weekday', [YourDateField])
3. 根據不同型別日期範圍截斷:
	1. DATETRUNC('date_part', date_expression, [start_of_week])
		1. 'date_part' 是指定截斷到的時間單位,例如 'year'(年)、'quarter'(季)、'month'(月)、'week'(周)、'day'(日)等。
4. 日期格式:DATEPARSE('yyyy-MM-dd', '2023-01-15')
5. 指定動態分割槽重新整理:--conf spark.sql.sources.partitionOverwriteMode=dynamic
6. spark的引數不需要加set:--conf hive.exec.dynamic.partition.mode=nonstrict
(動態分割槽嚴格模式下至少需要一個靜態列,所以這時候如果使用動態分割槽需要增加這個引數)

相關文章