3.0 阿里雲大資料專案實戰開發

weixin_34162695發表於2018-11-09

原文網址 : https://blog.csdn.net/weixin_34162695/article/details/88199705

任務目標：

從阿里雲資料庫中讀取表1：hdp6_result 和表2：hdp6_locationresult
對取出的資料進行簡單處理後，在資料庫中新建表3：hdp5_resultsave
並將資料以分割槽表的形式存入表3當中。

1.資料庫讀取流程

步驟1：連結Spark master

SparkConf conf = new SparkConf()
                    .setAppName("ResultData");
                    .setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);

步驟2：初始化OdpsContext，並進行資料庫的select

OdpsContext sqlContext = new OdpsContext(ctx);
String sql = "select * from hdp6_result limit 100";
DataFrame df = sqlContext.sql(sql);
df.show(100);
df.printSchema();

2.多表查詢，返回一個dataframe

方法一：SQL語句多表查詢，返回一個dataframe

String sql = "select result_value,lower_tolerance,upper_tolerance,"
                    + "time_stamp,type_number from hdp6_result "
                    + "INNER join hdp6_locationresult ON "
                    + "hdp6_result.location_result_uid = hdp6_locationresult.location_result_uid";
            DataFrame df = sqlContext.sql(sql);
            df.show();

重要事項！！！：

經過驗證，SQL語句多表查詢的方式，並不適用於阿里雲ODPS，可能是由於多表查詢不利於官方進行計費。

方法二：每張表單獨返回一個dataframe，將dataframe進行組合，返回一個新的dataframe

String sql1 = "select * from hdp6_result limit 100";
DataFrame df1 = sqlContext.sql(sql1);
df1.show(100);
df1.printSchema();

String sql2 = "select * from hdp6_locationresult limit 100";
DataFrame df2 = sqlContext.sql(sql2);
df2.show(100);
df2.printSchema();

DataFrame df0 = df2.join(df1, df1.col("location_result_uid").equalTo(df2.col("location_result_uid")));;
System.out.println("----3----");
df0.show(200);
df0.printSchema();

3.對資料進行簡單處理並生成新的dataframe

        //---------------------------------------表格location_result列資料以getString形式取出
        List<String> resultlist = df1.select("location_result").distinct().javaRDD().map(new Function<Row, String>() {
            public String call(Row row) {
                return String.valueOf(row.getString(0));
            }
        }).distinct().collect();
        //---------------------------------------開始迴圈處理資料
        for (String locationresult : resultlist) {
            df2 = df1.filter("location_result = '" + locationresult + "' ");//特定location的內容
            df2.cache();//df2持久化
            DataFrame valueByType0 = df2.groupBy("tolerance", "type_number")
                    .agg(avg("result_value"),count("result_value"),stddev("result_value"))
                    .filter(("count(result_value) >= " + PART_COUNT_LOW_LIMIT))
                    .orderBy("tolerance", "type_number");//拆出公差，型號，平均和數量,方差
            List<Row> valueByType = valueByType0.javaRDD().map(new Function<Row, Row>() {
                public Row call(Row row) {
                    row.getString(0);//tolerance
                    row.getString(1);//type
                    row.getDouble(2);//avg
                    row.getLong(3);//count
                    row.getDouble(4);//std
                    return row;
                }
            }).collect();
            //-----------------------------------陣列，String tolerance 0，String type 1，Double avg 2，Long count 3，Double std 4
            for (Row row : valueByType) {
                if (processed.contains(row.getString(0))) {
                    msg = "processed tolerance" + row.getString(0);
                    System.out.println(msg);
                } else if ((valueByType0.filter("tolerance = " + row.getString(0))).count() == 1) {
                    msg = "only one line" + row.getString(0);
                    System.out.println(msg);
                } else {
                    msg = "---------DATA Hangding--------";
                    System.out.println(msg);
                    processed.add(row.getString(0));//將該公差加入已處理列表
                    List<Row> valueByTypeOneTol = valueByType0.filter("tolerance = '" + row.getString(0)+"'").collectAsList();
                    double sum = 0;
                    long count = 0;
                    double stdsum = 0;
                    for (Row row1 : valueByTypeOneTol) //計算總平均
                    {
                        sum += row1.getDouble(2) * row1.getLong(3);
                        count += row1.getLong(3);
                        stdsum += Math.pow(row1.getDouble(4), 2);
                    }
                    msg = "---------Print result of avg&std--------";
                    System.out.println(msg);
                    double avg = sum / count;
                    double std = Math.pow((stdsum / valueByTypeOneTol.size()), 0.5);
                    System.out.println("avg = " + avg + ", std = " + std);

                    msg = "---------Result Judgement--------";
                    System.out.println(msg);
                    for (Row row2 : valueByTypeOneTol) {
                        System.out.println("delta avg  = " + (avg - row2.getDouble(2)));
                        String tol[] = row2.getString(0).split("/");
                        double tolSpan = Math.abs(Double.valueOf(tol[1]) - Double.valueOf(tol[0]));
                        System.out.println("tolSpan = " + tolSpan);
                        if (tolSpan <= 0) //判斷公差是否為0
                        {
                            msg = "no tolerance or tolerance equals to 0";
                            System.out.println(msg);
                        } else {
                            if ((avg - row2.getDouble(2)) >= tolSpan * AVG_TOL)//均值判斷 平均值 - 單個值 大為壞
                            {
                                msg = "typeno = " + row2.getString(1) + "avg need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "avg is less than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                            System.out.println("delta std  = " + (row.getDouble(4) - std));
                            if ((row.getDouble(4) - std) >= tolSpan * STD_TOL)//方差判斷 單個值 - 平均值  大為壞
                            {
                                msg = "typeno = " + row.getString(1) + "std need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "std is greater than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                        }
                    }
                }
            }

4.新建表格用於儲存處理後的資料

//----------------------------------------------新建一個表格
sqlContext.sql("Create table if not exists scxtest001 (uid varchar(500), time_stamp String) partitioned by (pt String)");

5.儲存處理後的dataframe

方法一：以臨時表的形式存入新建表格

注意，目前阿里雲只有spark 2.3.0版本才支援建立臨時檢視

方法二：INSERTINTO和SAVEASTABLE的形式存入新建表格

關於上述兩個方法，請關注我個人的git lab 專案：
https://gitlab.com/uaes-tef3/hdev6online

小程式雲開發專案實戰
2018-11-17
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
阿里雲DataWorks實踐：資料整合+資料開發
2021-02-26
阿里
flask 專案開發實戰
2020-04-19
Flask
【大資料】MapReduce開發小實戰
2020-09-21
大資料
大資料專案實戰之 --- 使用者畫像專案分析
2018-11-16
大資料
專案實戰介面開發SpringBoot
2023-12-14
Spring Boot
阿里專家分享：企業級大資料輕量雲實踐
2018-06-04
阿里大資料
django專案開發實戰——部落格
2018-08-08
Django
最新Python開發專案實戰(完整)
2019-05-02
Python
騰訊雲大資料實戰案例
2020-10-25
大資料
一本實踐的專案開發《Python專案開發實戰（第2版）》
2018-06-10
Python
大資料開發實戰：實時資料平臺和流計算
2018-08-07
大資料
Python六大實戰開發專案!Python基礎學習
2021-01-06
Python
Web專案開發介紹及實戰專案介紹
2019-03-26
Web
Android開發專案實戰之我的雲音樂升級版
2019-01-18
Android
阿里雲NAS檔案遷移專案實踐
2021-08-17
阿里
Vue3實戰系列：Vue3.0 + Vant3.0 搭建種子專案
2020-10-26
Vue
實戰 | 將Apache Hudi資料集寫入阿里雲OSS
2020-04-25
Apache阿里
資料庫實驗室挑戰任務-3（阿里雲）
2021-05-11
資料庫阿里
Python爬蟲開發與專案實戰pdf
2020-01-11
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
git實戰-多分支開發-2022新專案
2022-05-29
Git
RabbitMQ專案開發實戰技巧-非常有用
2021-09-07
MQ
阿里雲資料庫的新戰役
2022-06-21
阿里資料庫
阿里雲“玩轉雲上 StarRocks3.0 湖倉分析”，開啟資料分析新正規化
2023-10-16
阿里S3
阿里雲部署nuxt專案
2018-04-26
阿里UX
Python專案案例開發從入門到實戰 - 書籍資訊
2019-05-19
Python
Kotlin 入門開發__安卓小專案實戰
2018-06-07
Kotlin安卓
WEB實戰：使用MERN技術棧開發專案
2018-09-05
Web
《Node+MongoDB+React 專案實戰開發》已出版
2021-07-29
MongoDBReact
官宣開源阿里雲與清華大學共建AI大模型推理專案Mooncake
2024-11-29
阿里AI大模型
杭州站｜阿里雲 Serverless 技術實踐營（Serverless + 大資料）開啟報名！
2023-09-19
阿里Server大資料
《離線和實時大資料開發實戰》（二）大資料平臺架構 & 技術概覽
2020-09-27
大資料架構
從原理到實戰，四大模組，200頁微信小程式開發寶典（UI開發、API操作、專案實戰....）
2020-11-03
微信小程式UIAPI
大資料專案實踐（五）——Hue安裝
2018-08-27
大資料
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲