3.0 阿里雲大資料專案實戰開發

weixin_34162695發表於2018-11-09

任務目標:

從阿里雲資料庫中讀取表1:hdp6_result 和表2:hdp6_locationresult
對取出的資料進行簡單處理後,在資料庫中新建表3:hdp5_resultsave
並將資料以分割槽表的形式存入表3當中。

1.資料庫讀取流程

步驟1:連結Spark master

SparkConf conf = new SparkConf()
                    .setAppName("ResultData");
                    .setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);  

步驟2:初始化OdpsContext,並進行資料庫的select

OdpsContext sqlContext = new OdpsContext(ctx);
String sql = "select * from hdp6_result limit 100";
DataFrame df = sqlContext.sql(sql);
df.show(100);
df.printSchema();

2.多表查詢,返回一個dataframe

方法一:SQL語句多表查詢,返回一個dataframe

String sql = "select result_value,lower_tolerance,upper_tolerance,"
                    + "time_stamp,type_number from hdp6_result "
                    + "INNER join hdp6_locationresult ON "
                    + "hdp6_result.location_result_uid = hdp6_locationresult.location_result_uid";
            DataFrame df = sqlContext.sql(sql);
            df.show();

重要事項!!!:

經過驗證,SQL語句多表查詢的方式,並不適用於阿里雲ODPS,可能是由於多表查詢不利於官方進行計費。

方法二:每張表單獨返回一個dataframe,將dataframe進行組合,返回一個新的dataframe

String sql1 = "select * from hdp6_result limit 100";
DataFrame df1 = sqlContext.sql(sql1);
df1.show(100);
df1.printSchema();

String sql2 = "select * from hdp6_locationresult limit 100";
DataFrame df2 = sqlContext.sql(sql2);
df2.show(100);
df2.printSchema();

DataFrame df0 = df2.join(df1, df1.col("location_result_uid").equalTo(df2.col("location_result_uid")));;
System.out.println("----3----");
df0.show(200);
df0.printSchema();

3.對資料進行簡單處理並生成新的dataframe

        //---------------------------------------表格location_result列資料以getString形式取出
        List<String> resultlist = df1.select("location_result").distinct().javaRDD().map(new Function<Row, String>() {
            public String call(Row row) {
                return String.valueOf(row.getString(0));
            }
        }).distinct().collect();
        //---------------------------------------開始迴圈處理資料
        for (String locationresult : resultlist) {
            df2 = df1.filter("location_result = '" + locationresult + "' ");//特定location的內容
            df2.cache();//df2持久化
            DataFrame valueByType0 = df2.groupBy("tolerance", "type_number")
                    .agg(avg("result_value"),count("result_value"),stddev("result_value"))
                    .filter(("count(result_value) >= " + PART_COUNT_LOW_LIMIT))
                    .orderBy("tolerance", "type_number");//拆出公差,型號,平均和數量,方差
            List<Row> valueByType = valueByType0.javaRDD().map(new Function<Row, Row>() {
                public Row call(Row row) {
                    row.getString(0);//tolerance
                    row.getString(1);//type
                    row.getDouble(2);//avg
                    row.getLong(3);//count
                    row.getDouble(4);//std
                    return row;
                }
            }).collect();
            //-----------------------------------陣列,String tolerance 0,String type 1,Double avg 2,Long count 3,Double std 4
            for (Row row : valueByType) {
                if (processed.contains(row.getString(0))) {
                    msg = "processed tolerance" + row.getString(0);
                    System.out.println(msg);
                } else if ((valueByType0.filter("tolerance = " + row.getString(0))).count() == 1) {
                    msg = "only one line" + row.getString(0);
                    System.out.println(msg);
                } else {
                    msg = "---------DATA Hangding--------";
                    System.out.println(msg);
                    processed.add(row.getString(0));//將該公差加入已處理列表
                    List<Row> valueByTypeOneTol = valueByType0.filter("tolerance = '" + row.getString(0)+"'").collectAsList();
                    double sum = 0;
                    long count = 0;
                    double stdsum = 0;
                    for (Row row1 : valueByTypeOneTol) //計算總平均
                    {
                        sum += row1.getDouble(2) * row1.getLong(3);
                        count += row1.getLong(3);
                        stdsum += Math.pow(row1.getDouble(4), 2);
                    }
                    msg = "---------Print result of avg&std--------";
                    System.out.println(msg);
                    double avg = sum / count;
                    double std = Math.pow((stdsum / valueByTypeOneTol.size()), 0.5);
                    System.out.println("avg = " + avg + ", std = " + std);

                    msg = "---------Result Judgement--------";
                    System.out.println(msg);
                    for (Row row2 : valueByTypeOneTol) {
                        System.out.println("delta avg  = " + (avg - row2.getDouble(2)));
                        String tol[] = row2.getString(0).split("/");
                        double tolSpan = Math.abs(Double.valueOf(tol[1]) - Double.valueOf(tol[0]));
                        System.out.println("tolSpan = " + tolSpan);
                        if (tolSpan <= 0) //判斷公差是否為0
                        {
                            msg = "no tolerance or tolerance equals to 0";
                            System.out.println(msg);
                        } else {
                            if ((avg - row2.getDouble(2)) >= tolSpan * AVG_TOL)//均值判斷 平均值 - 單個值 大為壞
                            {
                                msg = "typeno = " + row2.getString(1) + "avg need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "avg is less than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                            System.out.println("delta std  = " + (row.getDouble(4) - std));
                            if ((row.getDouble(4) - std) >= tolSpan * STD_TOL)//方差判斷 單個值 - 平均值  大為壞
                            {
                                msg = "typeno = " + row.getString(1) + "std need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "std is greater than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                        }
                    }
                }
            }

4.新建表格用於儲存處理後的資料

//----------------------------------------------新建一個表格
sqlContext.sql("Create table if not exists scxtest001 (uid varchar(500), time_stamp String) partitioned by (pt String)");

5.儲存處理後的dataframe

方法一:以臨時表的形式存入新建表格

注意,目前阿里雲只有spark 2.3.0版本才支援建立臨時檢視

方法二:INSERTINTO和SAVEASTABLE的形式存入新建表格

關於上述兩個方法,請關注我個人的git lab 專案:
https://gitlab.com/uaes-tef3/hdev6online

相關文章