一個spark清洗資料的demo

一隻勤奮愛思考的豬發表於2018-07-30

import sys

reload(sys)
sys.setdefaultencoding('utf8')

import re
import json

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructField, StructType
import copy


master_url = 'spark://sc-bd-10:7077'

spark = SparkSession.builder \
    .master(master_url) \
    .appName("saic_huangyu") \
    .getOrCreate()

spark.conf.set("spark.driver.maxResultSize", "4g")
spark.conf.set("spark.sql.broadcastTimeout", 1200)
spark.conf.set("spark.sql.crossJoin.enabled", "true")

spark.sparkContext.addPyFile("md5_eid_pid.py")
from md5_eid_pid import gen_md5_pid


person_inv_company_without_pid_list = ["eid_merged", "share_name", "share_type", "inv_conum", "con_date"]
person_inv_company_without_pid_list_schema = StructType([StructField(field_name, StringType(), True) for field_name in person_inv_company_without_pid_list])
df_person_inv_company_without_pid = spark.read.load("hdfs://sc-bd-10:9000/scdata/huangyu/result/person_inv_company_table_person_without_pid_compliment_without_pid.csv", format="csv", schema=person_inv_company_without_pid_list_schema, delimiter=',')
df_person_inv_company_without_pid.createOrReplaceTempView("person_inv_company_table_person_without_pid_compliment_without_pid")


person_name_without_pid_list = ["eid_merged", "person_name", "is_fr", "position"]
person_name_without_pid_schema = StructType([StructField(field_name, StringType(), True) for field_name in person_name_without_pid_list])
df_person_name_without_pid = spark.read.load("hdfs://sc-bd-10:9000/scdata/huangyu/result/person_position_company_table_person_without_pid_compliment_without_pid.csv", format="csv", schema=person_name_without_pid_schema, delimiter=',')
df_person_name_without_pid.createOrReplaceTempView("person_position_company_table_person_without_pid_compliment_without_pid")


merged_eid_table_list = ["eid_merged", "eid_new", "name"]
merged_eid_table_schema = StructType([StructField(field_name, StringType(), True) for field_name in merged_eid_table_list])
df_merged_eid_table = spark.read.load("hdfs://sc-bd-10:9000/scdata/huangyu/result/merge_new_old_table.csv", format="csv", schema=merged_eid_table_schema, delimiter=',')
df_merged_eid_table.createOrReplaceTempView("merged_eid_table")


pid_eid_table_list = ["eid_merged", "pid", "person_name"]
pid_eid_table_schema = StructType([StructField(field_name, StringType(), True) for field_name in pid_eid_table_list])


spark.sql("""
select t1.eid_merged, t2.name as company, t1.person_name 
from (
    select eid_merged, share_name as person_name
    from person_inv_company_table_person_without_pid_compliment_without_pid
    union 
    select eid_merged, person_name  
    from person_position_company_table_person_without_pid_compliment_without_pid
) t1
left join merged_eid_table t2
on t1.eid_merged=t2.eid_merged
where t1.eid_merged is not null and t2.eid_merged is not null 
""")\
    .rdd\
    .map(lambda _: Row(
    eid_merged=_["eid_merged"],
    pid=gen_md5_pid(_["company"]+_["person_name"]),
    person_name=_["person_name"],
))\
    .toDF(pid_eid_table_schema)\
    .write\
    .save("hdfs://sc-bd-10:9000/scdata/huangyu/result/person_pid_gen_new_all_field.csv", format="csv", header=False, delimiter=',', mode="overwrite")


spark.stop()

spark處理json資料Demo
2019-02-26
SparkJSON
“一圖勝千言”，資料清洗的5個基本流程
2022-01-07
還在為資料清洗抓狂？這裡有一個簡單實用的清洗程式碼集
2019-01-22
vue資料自動儲存的一個小Demo
2018-11-22
Vue
八個機器學習資料清洗
2019-06-19
機器學習
tensorflow資料清洗
2019-11-02
大資料之路 ——（一）演算法建模中的資料清洗
2021-08-05
大資料演算法
關於spark雙引號－－用spark清洗資料踩過的坑(spark和Python儲存csv的區別);以及調pg庫還是api獲取資料的策略
2018-08-28
SparkPythonAPI
資料清洗的方法有哪些？
2019-03-08
一個vue的電影資訊demo
2018-07-04
Vue
資料清洗和資料處理
2020-03-03
資料清洗如何測試？
2024-06-04
資料清洗有哪些方法？
2021-10-19
資料預處理（資料清洗）的一般方法及python實現
2019-01-28
Python
手把手教你完成一個資料科學小專案（3）：資料異常與清洗
2018-08-16
資料科學
一個clean框架的demo
2019-01-03
框架
爬蟲中資料清洗的選擇
2021-06-12
爬蟲
Spark聚合下推思路以及demo
2021-09-09
Spark
資料治理為什麼要清洗資料
2024-01-23
機器學習-資料清洗
2019-03-02
機器學習
資料清洗處理-常用操作
2020-03-24
python 操作 excel 之資料清洗
2021-03-02
PythonExcel
常用資料清洗方法大盤點
2018-08-22
資料清洗和準備 (待更新)
2020-06-22
使用Mysql工具進行資料清洗
2020-12-04
MySql
爬蟲第一章資料提取與清洗策略
2020-11-10
爬蟲
R資料分析：資料清洗的思路和核心函式介紹
2022-02-19
函式
資料分析-pandas資料處理清洗常用總結
2018-04-12
資料管理：業務資料清洗，落地實現方案
2021-06-09
機器學習中資料清洗的藝術
2019-08-23
機器學習
LLM大模型: 常用的資料清洗方法總結
2024-07-10
大模型
做資料分析必須瞭解的獲取資料與清洗資料技巧
2018-05-21
spark讀取hbase的資料
2019-04-05
Spark
DolphinDB +Python Airflow 高效實現資料清洗
2023-04-14
PythonAI
一文帶你瞭解關於資料清洗的三大問題
2021-10-12
5款優秀的資料清洗工具任你選擇
2021-10-15
一個極簡版本的 VUE SSR demo
2018-09-20
Vue
一個基於Android的MVP框架Demo
2019-03-04
AndroidMVP框架

一個spark清洗資料的demo

相關文章