檢視spark程式/區分pyspark和pandas的表的合併,pyspark是join,pandas是merge

一隻勤奮愛思考的豬發表於2018-10-08
命令:
vim ~/.bashrc

source ~/.bashrc

ps aux | grep spark

pkill -f "spark"


sudo chown -R sc:sc  spark-2.3.1-bin-hadoop2.7/

sudo mv /home/sc/Downloads/spark-2.3.1-bin-hadoop2.7 /opt/


locate *punish*
查詢檔案路徑;


用pandas做join報錯:
我用pandas做join像這樣:df22 = df1.join(df2, df2.company_name_a == df1.company_name,'left_outer')  報這個錯:ValueError: Can only compare identically-labeled Series objects

pyspark文件做join的文件:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql

>>> df1.join(df2, df1["value"] == df2["value"]).count()
0
>>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count()

pandas的merge文件:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html

train_x = pd.read_csv('/home/sc/PycharmProjects/sc/risk_rules/sklearn_result_02/the_check_shixin_train.csv')
print(train_x.columns)
train_x['add_companyname'] = train_x['company_name']
print(train_x.columns)
df_check_1000 = pd.read_csv('/home/sc/Desktop/shixin_detect_result_shixin_cnt.csv')
df_check_1000=df_check_1000.drop_duplicates()
df_ch1 = pd.merge(df_check_1000,train_x,on='company_name',how='left')
print(df_ch1.head(2))
df_ch2 = df_ch1[(df_ch1['add_companyname'].isnull()) & (df_ch1['shixin_cnt'] != 1)] #248家;多次失信並且沒有在訓練集出現過
print(df_ch2.groupby(['id']).size())
print(df_ch2.groupby(['shixin_cnt']).size())
print(len(df_ch2))

df_ch2 = pd.merge(df_ch2,df_check_1000,on='company_name',how='left')
print(len(df_ch2))
cols = ['company_name','established_years',
       'industry_dx_rate', 'regcap_change_cnt', 'industry_dx_cnt',
       'address_change_cnt', 'network_share_cancel_cnt', 'cancel_cnt',
       'fr_change_cnt', 'network_share_zhixing_cnt',
       'network_share_judge_doc_cnt', 'judge_doc_cnt', 'share_change_cnt',
       'industry_all_cnt', 'network_share_or_pos_shixin_cnt',
       'judgedoc_cnt']
print("hahahhaha")
print(df_ch2.columns)
df_ch22 = df_ch2.ix[:, cols]
print(df_ch22.columns)

相關文章