Spark从入门到精通67:Spark2.0读取文件转成DF
2020-08-21 本文已影响0人
勇于自信
案例1:通过实体类转换
读取数据:\t分割的日志文件
441336 1597716959042 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp http://css.tt.gzedu.com/h5 求学圆梦 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 chooseSchool other zgjy app 2020-08-18
441336 1597716993088 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://css.tt.gzedu.com/h5/company-subsidy 企业补助 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 chooseSchool other zgjy app 2020-08-18
441336 1597717005412 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717006384 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717011482 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5 培训消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717016762 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5 培训消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717016773 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5 培训消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717021798 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_nr.html?d_id=1173 消息详情 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717023109 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_nr.html?d_id=1173 消息详情 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717026543 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5 培训消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
441336 1597717030282 5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息 广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑 subsidy other zgjy app 2020-08-18
实现方法:
public static void localData(JavaSparkContext sc,SparkSession sparkSession) {
boolean local = ConfigurationManager.getBoolean(Constants.SPARK_LOCAL);
if(local) {
JavaRDD<String> lines = sc.textFile("data/000000_0");
JavaRDD<UserView> studentJavaRDD = lines.map(line->{
String[] split = line.split("\t");
String mark_user = split[0];
String page = split[3];
String current_page_title = split[4];
String dt = split[13];
UserView userView = new UserView(mark_user,page, current_page_title, dt);
return userView;
});
Dataset<Row> studentDF = sparkSession.createDataFrame(studentJavaRDD, UserView.class);
studentDF.registerTempTable("dwd_pageview_log");
Dataset<Row> sql = sparkSession.sql("select * from dwd_pageview_log limit 1");
sql.show(false);
}
}
输出结果:
+------------------+----------+---------+--------------------------+
|current_page_title|dt |mark_user|page |
+------------------+----------+---------+--------------------------+
|求学圆梦 |2020-08-18|441336 |http://css.tt.gzedu.com/h5|
+------------------+----------+---------+--------------------------+
案例二:通过StructType转换
输入数据:
1,leo,17
2,marry,17
3,jack,18
4,tom,19
代码:
package com.micro.spark.zgjy
import com.micro.bigdata.constant.Constants
import com.micro.bigdata.utils.SparkUtils
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object PVUV {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(Constants.SPARK_APP_NAME_PV)
.master("local")
.config("spark.sql.warehouse.dir","d:/")
// .enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
val frame = sc.textFile("data/hello.txt")
val colsLength = frame.first().split(",").length
val colNames = new Array[String](colsLength)
colNames(0) = "username"
colNames(1) = "age"
colNames(2) = "gender"
val schema = StructType(colNames.map(fieldName => StructField(fieldName, StringType)))
val rowRDD = frame.map(_.split(",")).map(p => Row(p: _*))
val data = spark.createDataFrame(rowRDD,schema)
data.createOrReplaceTempView("student")
val rd = spark.sql("select * from student")
rd.show(2,false)
// SparkUtils.localData(sc,spark)
}
}
输出结果:
+--------+---------+--------+
|username|age |gender |
+--------+---------+--------+
|hello |worldjava|beijing |
|shanghai|hubei |changsha|
+--------+---------+--------+
only showing top 2 rows