一、SQLContext.scala中的介面說明
大部分介面都是建立DataFrame
1、構造:SQLContext的構造只需要一個SparkContext引數
2、設定/獲取 配置:setConf/getConf
3、isCached/cacheTable/uncacheTable/clearCache:資料快取相關,提高查詢速度,需謹慎防止OOM
4、read:用於從外部資料來源讀取 //todo,下文有詳細介紹
-
* Returns a [[DataFrameReader]] that can be used to read data in as a [[DataFrame]].
-
* {{{
-
* sqlContext.read.parquet("/path/to/file.parquet")
-
* sqlContext.read.schema(schema).json("/path/to/file.json")
-
* }}}
-
def read: DataFrameReader = new DataFrameReader(this)
5、刪除臨時表
-
def dropTempTable(tableName: String): Unit
6、檢視目前有哪些表(表的產生在下文DataFrame的方法中有介紹)
-
def tableNames(): Array[String]
-
def tableNames(databaseName: String): Array[String]
7、建立DataFrame:可通過現成的RDD或者Seq來建立
-
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
-
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
-
* Example:
-
* {{{
-
* import org.apache.spark.sql._
-
* import org.apache.spark.sql.types._
-
* val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
*
-
* val schema =
-
* StructType(
-
* StructField("name", StringType, false) ::
-
* StructField("age", IntegerType, true) :: Nil)
-
*
-
* val people =
-
* sc.textFile("examples/src/main/resources/people.txt").map(
-
* _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
-
* val dataFrame = sqlContext.createDataFrame(people, schema)
-
* dataFrame.printSchema
-
*
-
*
-
*
-
*
-
* dataFrame.registerTempTable("people")
-
* sqlContext.sql("select name from people").collect.foreach(println)
-
* }}}
-
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
8、建立DataFrame、外部表
-
def createExternalTable(tableName: String, path: String): DataFrame
-
def createExternalTable(
-
tableName: String,
-
path: String,
-
source: String): DataFrame
-
def createExternalTable(
-
tableName: String,
-
source: String,
-
options: Map[String, String]): DataFrame
-
def createExternalTable(
-
tableName: String,
-
source: String,
-
schema: StructType,
-
options: Map[String, String]): DataFrame
9、建立DataFrame:建立一個只包含一個名為id的列,且在指定區間內的DataFrame
-
def range(start: Long, end: Long): DataFrame
-
def range(end: Long): DataFrame
-
def range(start: Long, end: Long, step: Long, numPartitions: Int): DataFrame
10、建立DataFrame:從sql查詢語句中建立DataFrame
-
def sql(sqlText: String): DataFrame
11、建立DataFrame:從表建立DataFrame
-
def table(tableName: String): DataFrame
-
-
-
-
-
-
-
-
-
def tables(): DataFrame = {
-
DataFrame(this, ShowTablesCommand(None))
-
}
-
-
-
-
-
-
-
-
-
def tables(databaseName: String): DataFrame = {
-
DataFrame(this, ShowTablesCommand(Some(databaseName)))
-
}
二、SQLContext中隱式轉換
1、StringToColumn:將StringContext格式隱式轉換成StringToColumn類物件,從而具有$方法將列名轉成Column型別
-
-
-
-
-
implicit class StringToColumn(val sc: StringContext) {
-
def $(args: Any*): ColumnName = {
-
new ColumnName(sc.s(args : _*))
-
}
-
}
2、scala Symbol型別的轉換
-
-
-
-
-
implicit def symbolToColumn(s: Symbol): ColumnName = new ColumnName(s.name)
3、將RDD或者Seq型別(元素型別為case classes/tuples)轉換為DataFrameHolder,DataFrameHolder具有方法
-
-
-
-
-
implicit def rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A]): DataFrameHolder = {
-
DataFrameHolder(self.createDataFrame(rdd))
-
}
-
-
-
-
-
-
implicit def localSeqToDataFrameHolder[A <: Product : TypeTag](data: Seq[A]): DataFrameHolder =
-
{
-
DataFrameHolder(self.createDataFrame(data))
-
}
4、將RDD(元素型別為Int/Long/String)轉換為DataFrameHolder
-
implicit def intRddToDataFrameHolder(data: RDD[Int]): DataFrameHolder
-
implicit def longRddToDataFrameHolder(data: RDD[Long]): DataFrameHolder
-
implicit def stringRddToDataFrameHolder(data: RDD[String]): DataFrameHolder
5、DataFrameHolder具有如下方法名,用於最終轉換成DataFrame
-
private[sql] case class DataFrameHolder(df: DataFrame) {
-
-
-
-
def toDF(): DataFrame = df
-
-
def toDF(colNames: String*): DataFrame = df.toDF(colNames : _*)
-
}
三、DataFrameReader
SQLContext類中的read方法返回一個DataFrameReader,它提供方法從外部獲取DataFrame
1、格式化。通過format方法將資料來源設定為source引數。預設資料來源為parquet。
-
def format(source: String): DataFrameReader = {
-
this.source = source
-
this
-
}
2、定義schema資訊
-
def schema(schema: StructType): DataFrameReader = {
-
this.userSpecifiedSchema = Option(schema)
-
this
-
}
3、附加引數
-
def option(key: String, value: String): DataFrameReader
-
def options(options: scala.collection.Map[String, String]): DataFrameReader
4、load方法。正是載入從外部檔案系統的路徑下的檔案
-
def load(path: String): DataFrame = {
-
option("path", path).load()
-
}
-
def load(): DataFrame
5、jdbc
需首先載入相應的jdbc驅動到spark classpath
每個工作節點上也需要能載入驅動包,可以將驅動jars放在每個節點的classpath中。
-
def jdbc(url: String, table: String, properties: Properties): DataFrame
-
def jdbc(
-
url: String,
-
table: String,
-
columnName: String,
-
lowerBound: Long,
-
upperBound: Long,
-
numPartitions: Int,
-
connectionProperties: Properties): DataFrame
-
def jdbc(
-
url: String,
-
table: String,
-
predicates: Array[String],
-
connectionProperties: Properties): DataFrame
6、json
-
def json(path: String): DataFrame = format("json").load(path)
-
def json(jsonRDD: RDD[String]): DataFrame
7、parquet
-
def parquet(paths: String*): DataFrame
8、從表載入建立DataFrame,同SQLContext中的同名方法
-
def table(tableName: String): DataFrame
四、DataFrame
1、為schema注入列名,數量需一致
-
def toDF(colNames: String*): DataFrame
2、獲取schema資訊
3、獲取列名及型別
-
-
-
-
-
-
def dtypes: Array[(String, String)] = schema.fields.map { field =>
-
(field.name, field.dataType.toString)
-
}
4、獲取列名
-
def columns: Array[String]
5、列印schema資訊
6、列印資料
-
def show(numRows: Int): Unit
-
def show(): Unit = show(20)
7、join
-
def join(right: DataFrame): DataFrame
-
def join(right: DataFrame, usingColumn: String): DataFrame
-
def join(right: DataFrame, joinExprs: Column): DataFrame
-
def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame
8、排序sort/orderBy
-
-
-
-
-
-
-
-
-
-
-
-
@scala.annotation.varargs
-
def sort(sortCol: String, sortCols: String*): DataFrame
-
def sort(sortExprs: Column*): DataFrame
-
def orderBy(sortCol: String, sortCols: String*): DataFrame = sort(sortCol, sortCols : _*)
-
def orderBy(sortExprs: Column*): DataFrame
9、用小括號將列名轉化為Column型別
-
def apply(colName: String): Column = col(colName)
10、col方法將列名轉化為Column型別,同apply
-
def col(colName: String): Column
11、別名
-
def as(alias: String): DataFrame
-
def as(alias: Symbol): DataFrame
12、選取部分列
-
def select(cols: Column*): DataFrame
-
def select(col: String, cols: String*): DataFrame
13、選取部分列,比select靈活,引數可以是sql表達方式
-
-
-
-
-
-
-
-
-
-
-
@scala.annotation.varargs
-
def selectExpr(exprs: String*): DataFrame
14、filter
-
-
-
-
-
-
-
-
-
-
-
def filter(condition: Column): DataFrame
-
-
-
-
-
-
-
-
-
def filter(conditionExpr: String): DataFrame
15、where
-
-
-
-
-
-
-
-
-
-
-
def where(condition: Column): DataFrame
16、groupBy
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@scala.annotation.varargs
-
def groupBy(cols: Column*): GroupedData
def groupBy(col1: String, cols: String*): GroupedData
17、獲取前n行資料
-
def limit(n: Int): DataFrame
18、unionAll
-
-
-
-
-
-
-
def unionAll(other: DataFrame): DataFrame
19、intersect
-
-
-
-
-
-
-
def intersect(other: DataFrame): DataFrame
20、前幾行
-
def head(n: Int): Array[Row]
-
def head(): Row
-
override def first(): Row
21、map 這裡的rdd是RDD[Row],flatMap,mapPartitions,foreach,foreachPartition,take,collect,count,repartition,persist,cache同RDD方法
-
override def map[R: ClassTag](f: Row => R): RDD[R] = rdd.map(f)
22、distinct
-
-
-
-
-
-
-
override def distinct: DataFrame
23、rdd
24、將本DataFrame註冊為臨時表
-
-
-
-
-
-
-
-
def registerTempTable(tableName: String): Unit
25、write,向外部輸出,下面介紹
-
def write: DataFrameWriter
26、轉化為json
五、DataFrameWriter
1、設定儲存模式
-
-
-
-
-
-
-
-
-
-
def mode(saveMode: SaveMode): DataFrameWriter = {
-
this.mode = saveMode
-
this
-
}
2、format
-
-
-
-
-
-
def format(source: String): DataFrameWriter
3、option
-
def option(key: String, value: String): DataFrameWriter
-
def options(options: scala.collection.Map[String, String]): DataFrameWriter
4、partitionBy
-
def partitionBy(colNames: String*): DataFrameWriter
5、save
-
def save(path: String): Unit
-
def save(): Unit
6、insertInto
-
def insertInto(tableName: String): Unit
7、saveAsTable
-
def saveAsTable(tableName: String): Unit
8、jdbc
-
def jdbc(url: String, table: String, connectionProperties: Properties): Unit
9、json
-
def json(path: String): Unit = format("json").save(path)
10、parquet,也是save是預設的,可以不預先設定parquet
-
def parquet(path: String): Unit = format("parquet").save(path)
轉載: http://blog.csdn.net/yueqian_zhu/article/details/49587433