讀一張表,對其進行二值化特徵轉換。可以二值化要求輸入型別必須double型別,型別怎麼轉換呢?
直接利用spark column 就可以進行轉換:
DataFrame dataset = hive.sql("select age,sex,race from hive_race_sex_bucktizer ");
/**
* 型別轉換
*/
dataset = dataset.select(dataset.col("age").cast(DoubleType).as("age"),dataset.col("sex"),dataset.col("race"));
是不是很簡單。想起之前的型別轉換做法,遍歷並建立另外一個滿足型別要求的RDD,然後根據RDD建立Datafame,好複雜!!!!
JavaRDD<Row> parseDataset = dataset.toJavaRDD().map(new Function<Row,Row>() {
@Override
public Row call(Row row) throws Exception {
System.out.println(row);
long age = row.getLong(row.fieldIndex("age"));
String sex = row.getAs("sex");
String race =row.getAs("race");
double raceV = -1;
if("white".equalsIgnoreCase(race)){
raceV = 1;
} else if("black".equalsIgnoreCase(race)) {
raceV = 2;
} else if("yellow".equalsIgnoreCase(race)) {
raceV = 3;
} else if("Asian-Pac-Islander".equalsIgnoreCase(race)) {
raceV = 4;
}else if("Amer-Indian-Eskimo".equalsIgnoreCase(race)) {
raceV = 3;
}else {
raceV = 0;
}
return RowFactory.create(age,("male".equalsIgnoreCase(sex)?1:0),raceV);
}
});
StructType schema = new StructType(new StructField[]{
createStructField("_age", LongType, false),
createStructField("_sex", IntegerType, false),
createStructField("_race", DoubleType, false)
});
DataFrame df = hive.createDataFrame(parseDataset, schema);
不斷探索,不斷嘗試!