Spark

Spark从入门到精通67:Spark2.0读取文件转成DF

2020-08-21  本文已影响0人  勇于自信

案例1:通过实体类转换
读取数据:\t分割的日志文件

441336  1597716959042   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp    http://css.tt.gzedu.com/h5  求学圆梦    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   chooseSchool    other   zgjy    app 2020-08-18
441336  1597716993088   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://css.tt.gzedu.com/h5/company-subsidy  企业补助    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   chooseSchool    other   zgjy    app 2020-08-18
441336  1597717005412   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息  广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717006384   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息  广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717011482   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5   培训消息    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717016762   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5   培训消息    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717016773   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5   培训消息    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717021798   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_nr.html?d_id=1173    消息详情    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717023109   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_nr.html?d_id=1173    消息详情    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717026543   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx_xq.html?message_type=5   培训消息    广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18
441336  1597717030282   5.0 (Linux; Android 9; JKM-AL00b Build/HUAWEIJKM-AL00b; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/79.0.3945.116 Mobile Safari/537.36;zgjyApp; [did=gkzx863079047666136; gsign=gkzx] http://gd.ttt.workeredu.com/APP2/v2_xx.html 消息  广东省 广州市 天河区 广东省广州市天河区天府路67号靠近御景雅苑   subsidy other   zgjy    app 2020-08-18

实现方法:

public static void localData(JavaSparkContext sc,SparkSession sparkSession) {
        boolean local = ConfigurationManager.getBoolean(Constants.SPARK_LOCAL);
        if(local) {
            JavaRDD<String> lines = sc.textFile("data/000000_0");
            JavaRDD<UserView> studentJavaRDD = lines.map(line->{
                String[] split = line.split("\t");
                String mark_user = split[0];
                String page = split[3];
                String current_page_title = split[4];
                String dt = split[13];
                UserView userView = new UserView(mark_user,page, current_page_title, dt);
                return userView;
            });
            Dataset<Row> studentDF  = sparkSession.createDataFrame(studentJavaRDD, UserView.class);
            studentDF.registerTempTable("dwd_pageview_log");
            Dataset<Row> sql = sparkSession.sql("select * from dwd_pageview_log limit 1");
            sql.show(false);
        }
    }

输出结果:

+------------------+----------+---------+--------------------------+
|current_page_title|dt        |mark_user|page                      |
+------------------+----------+---------+--------------------------+
|求学圆梦              |2020-08-18|441336   |http://css.tt.gzedu.com/h5|
+------------------+----------+---------+--------------------------+

案例二:通过StructType转换
输入数据:

1,leo,17
2,marry,17
3,jack,18
4,tom,19

代码:

package com.micro.spark.zgjy

import com.micro.bigdata.constant.Constants
import com.micro.bigdata.utils.SparkUtils
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

object PVUV {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName(Constants.SPARK_APP_NAME_PV)
      .master("local")
      .config("spark.sql.warehouse.dir","d:/")
//      .enableHiveSupport()
      .getOrCreate()
    val sc = spark.sparkContext
    val frame = sc.textFile("data/hello.txt")
    val colsLength = frame.first().split(",").length
    val colNames = new Array[String](colsLength)
    colNames(0) = "username"
    colNames(1) = "age"
    colNames(2) = "gender"
    val schema = StructType(colNames.map(fieldName => StructField(fieldName, StringType)))
    val rowRDD = frame.map(_.split(",")).map(p => Row(p: _*))
    val data = spark.createDataFrame(rowRDD,schema)
    data.createOrReplaceTempView("student")
    val rd = spark.sql("select * from student")
    rd.show(2,false)
//    SparkUtils.localData(sc,spark)

  }
}

输出结果:

+--------+---------+--------+
|username|age      |gender  |
+--------+---------+--------+
|hello   |worldjava|beijing |
|shanghai|hubei    |changsha|
+--------+---------+--------+
only showing top 2 rows
上一篇下一篇

猜你喜欢

热点阅读