Java-Spark Dataset中向前填充缺失值
2023-01-29 本文已影响0人
小KKKKKKKK
在处理时序数据时,会遇到缺失数据的情况,在Dataset数据填充中会有均值填充、最大值填充、最小值填充,但业务要求中要使用前一个有值的数据去填充缺失数据。通过查询发现可以使用Spark-SQL中的window函数(窗口函数)进行处理。
示例如下
先将数据读取出来,这步省略,读取出的dataset如下
+-----------+-------------+-------+
| id| timestamp| value|
+-----------+-------------+-------+
|10001990000|1663120800000|285.108|
|10001990000|1663121100000|293.296|
|10001990000|1663121400000|297.988|
|10001990000|1663121700000|310.960|
|10001990000|1663122000000|317.216|
|10001990000|1663122300000|337.180|
|10001990000|1663122600000|344.080|
|10001990000|1663122900000|346.656|
|10001990000|1663123200000|358.892|
|10001990000|1663123500000|362.204|
|10001990000|1663123800000|368.092|
|10001990000|1663124100000|373.060|
|10001990000|1663124400000|384.928|
|10001990000|1663124700000|393.852|
|10001990000|1663125000000|395.140|
|10001990000|1663125300000|403.052|
|10001990000|1663125600000|403.236|
|10001990000|1663125900000|408.756|
|10001990000|1663126200000|413.540|
|10001990000|1663126500000|421.636|
|10001990000|1663126800000|421.912|
|10001990000|1663127100000|422.648|
|10001990000|1663127400000|429.272|
|10001990000|1663127700000|439.576|
|10001990000|1663128000000|448.592|
|10001990000|1663128300000|435.344|
|10001990000|1663128600000|454.112|
|10001990000|1663128900000|450.800|
|10001990000|1663129200000|461.656|
|10001990000|1663129500000|460.092|
|10001990000|1663129800000|469.752|
|10001990000|1663130100000|461.932|
|10001990000|1663130400000|463.496|
|10001990000|1663130700000| null|
|10001990000|1663131000000| null|
|10001990000|1663131300000| null|
|10001990000|1663131600000| null|
|10001990000|1663131900000| null|
|10001990000|1663132200000| null|
|10001990000|1663132500000|463.772|
|10001990000|1663132800000|460.368|
|10001990000|1663133100000|463.036|
|10001990000|1663133400000|478.676|
|10001990000|1663133700000|455.676|
|10001990000|1663134000000|453.468|
|10001990000|1663134300000|460.920|
|10001990000|1663134600000|457.424|
|10001990000|1663134900000|440.312|
|10001990000|1663135200000|443.716|
|10001990000|1663135500000|422.740|
|10001990000|1663135800000|434.608|
|10001990000|1663136100000|430.928|
|10001990000|1663136400000|429.272|
|10001990000|1663136700000|417.220|
|10001990000|1663137000000|415.656|
|10001990000|1663137300000|261.740|
|10001990000|1663137600000|425.316|
|10001990000|1663137900000|406.916|
|10001990000|1663138200000|400.108|
|10001990000|1663138500000|209.484|
|10001990000|1663138800000|394.128|
|10001990000|1663139100000|382.260|
|10001990000|1663139400000|390.632|
|10001990000|1663139700000|363.768|
|10001990000|1663140000000|226.320|
|10001990000|1663140300000|213.716|
|10001990000|1663140600000|125.488|
|10001990000|1663140900000|100.372|
|10001990000|1663141200000|111.872|
|10001990000|1663141500000|178.388|
|10001990000|1663141800000|106.260|
|10001990000|1663142100000|119.048|
|10001990000|1663142400000|133.676|
|10001990000|1663142700000|116.104|
|10001990000|1663143000000| 88.688|
|10001990000|1663143300000| 94.852|
|10001990000|1663143600000| 69.184|
|10001990000|1663143900000| 67.436|
|10001990000|1663144200000| 88.136|
|10001990000|1663144500000|212.244|
|10001990000|1663144800000|228.436|
|10001990000|1663145100000|228.436|
|10001990000|1663145400000|160.264|
|10001990000|1663145700000|196.788|
|10001990000|1663146000000|184.736|
|10001990000|1663146300000| 97.980|
|10001990000|1663146600000|159.988|
|10001990000|1663146900000|196.052|
|10001990000|1663147200000|171.120|
|10001990000|1663147500000|146.556|
|10001990000|1663147800000|130.456|
|10001990000|1663148100000|121.072|
|10001990000|1663148400000|108.928|
|10001990000|1663148700000|101.568|
|10001990000|1663149000000| 94.024|
|10001990000|1663149300000| 82.340|
|10001990000|1663149600000| 69.736|
|10001990000|1663149900000| 61.364|
|10001990000|1663150200000| 53.820|
|10001990000|1663150500000| 49.312|
|10001990000|1663150800000| 39.836|
|10001990000|1663151100000| 33.212|
|10001990000|1663151400000| 26.680|
|10001990000|1663151700000| 22.356|
|10001990000|1663152000000| 18.032|
|10001990000|1663152300000| 13.708|
|10001990000|1663152600000| 9.844|
|10001990000|1663152900000| 7.636|
|10001990000|1663153200000| 2.392|
|10001990000|1663153500000| 2.208|
|10001990000|1663153800000| 0.920|
+-----------+-------------+-------+
可以发现,有一段时间戳对应的value值为null,现在需要将其补全。补全后的数据与timestamp为“1663130400000”时刻的一模一样才对,代码如下:
WindowSpec windowSpec = Window.partitionBy(col("id")).orderBy(col("timestamp"));
Dataset<Row> datasetNew = dataset.withColumn("valueNew", last(col("value"), true).over(windowSpec));
datasetNew.show()
新的dataset为:
+-----------+-------------+-------+--------+
| id| timestamp| value|valueNew|
+-----------+-------------+-------+--------+
|10001990000|1663120800000|285.108| 285.108|
|10001990000|1663121100000|293.296| 293.296|
|10001990000|1663121400000|297.988| 297.988|
|10001990000|1663121700000|310.960| 310.960|
|10001990000|1663122000000|317.216| 317.216|
|10001990000|1663122300000|337.180| 337.180|
|10001990000|1663122600000|344.080| 344.080|
|10001990000|1663122900000|346.656| 346.656|
|10001990000|1663123200000|358.892| 358.892|
|10001990000|1663123500000|362.204| 362.204|
|10001990000|1663123800000|368.092| 368.092|
|10001990000|1663124100000|373.060| 373.060|
|10001990000|1663124400000|384.928| 384.928|
|10001990000|1663124700000|393.852| 393.852|
|10001990000|1663125000000|395.140| 395.140|
|10001990000|1663125300000|403.052| 403.052|
|10001990000|1663125600000|403.236| 403.236|
|10001990000|1663125900000|408.756| 408.756|
|10001990000|1663126200000|413.540| 413.540|
|10001990000|1663126500000|421.636| 421.636|
|10001990000|1663126800000|421.912| 421.912|
|10001990000|1663127100000|422.648| 422.648|
|10001990000|1663127400000|429.272| 429.272|
|10001990000|1663127700000|439.576| 439.576|
|10001990000|1663128000000|448.592| 448.592|
|10001990000|1663128300000|435.344| 435.344|
|10001990000|1663128600000|454.112| 454.112|
|10001990000|1663128900000|450.800| 450.800|
|10001990000|1663129200000|461.656| 461.656|
|10001990000|1663129500000|460.092| 460.092|
|10001990000|1663129800000|469.752| 469.752|
|10001990000|1663130100000|461.932| 461.932|
|10001990000|1663130400000|463.496| 463.496|
|10001990000|1663130700000| null| 463.496|
|10001990000|1663131000000| null| 463.496|
|10001990000|1663131300000| null| 463.496|
|10001990000|1663131600000| null| 463.496|
|10001990000|1663131900000| null| 463.496|
|10001990000|1663132200000| null| 463.496|
|10001990000|1663132500000|463.772| 463.772|
|10001990000|1663132800000|460.368| 460.368|
|10001990000|1663133100000|463.036| 463.036|
|10001990000|1663133400000|478.676| 478.676|
|10001990000|1663133700000|455.676| 455.676|
|10001990000|1663134000000|453.468| 453.468|
|10001990000|1663134300000|460.920| 460.920|
|10001990000|1663134600000|457.424| 457.424|
|10001990000|1663134900000|440.312| 440.312|
|10001990000|1663135200000|443.716| 443.716|
|10001990000|1663135500000|422.740| 422.740|
|10001990000|1663135800000|434.608| 434.608|
|10001990000|1663136100000|430.928| 430.928|
|10001990000|1663136400000|429.272| 429.272|
|10001990000|1663136700000|417.220| 417.220|
|10001990000|1663137000000|415.656| 415.656|
|10001990000|1663137300000|261.740| 261.740|
|10001990000|1663137600000|425.316| 425.316|
|10001990000|1663137900000|406.916| 406.916|
|10001990000|1663138200000|400.108| 400.108|
|10001990000|1663138500000|209.484| 209.484|
|10001990000|1663138800000|394.128| 394.128|
|10001990000|1663139100000|382.260| 382.260|
|10001990000|1663139400000|390.632| 390.632|
|10001990000|1663139700000|363.768| 363.768|
|10001990000|1663140000000|226.320| 226.320|
|10001990000|1663140300000|213.716| 213.716|
|10001990000|1663140600000|125.488| 125.488|
|10001990000|1663140900000|100.372| 100.372|
|10001990000|1663141200000|111.872| 111.872|
|10001990000|1663141500000|178.388| 178.388|
|10001990000|1663141800000|106.260| 106.260|
|10001990000|1663142100000|119.048| 119.048|
|10001990000|1663142400000|133.676| 133.676|
|10001990000|1663142700000|116.104| 116.104|
|10001990000|1663143000000| 88.688| 88.688|
|10001990000|1663143300000| 94.852| 94.852|
|10001990000|1663143600000| 69.184| 69.184|
|10001990000|1663143900000| 67.436| 67.436|
|10001990000|1663144200000| 88.136| 88.136|
|10001990000|1663144500000|212.244| 212.244|
|10001990000|1663144800000|228.436| 228.436|
|10001990000|1663145100000|228.436| 228.436|
|10001990000|1663145400000|160.264| 160.264|
|10001990000|1663145700000|196.788| 196.788|
|10001990000|1663146000000|184.736| 184.736|
|10001990000|1663146300000| 97.980| 97.980|
|10001990000|1663146600000|159.988| 159.988|
|10001990000|1663146900000|196.052| 196.052|
|10001990000|1663147200000|171.120| 171.120|
|10001990000|1663147500000|146.556| 146.556|
|10001990000|1663147800000|130.456| 130.456|
|10001990000|1663148100000|121.072| 121.072|
|10001990000|1663148400000|108.928| 108.928|
|10001990000|1663148700000|101.568| 101.568|
|10001990000|1663149000000| 94.024| 94.024|
|10001990000|1663149300000| 82.340| 82.340|
|10001990000|1663149600000| 69.736| 69.736|
|10001990000|1663149900000| 61.364| 61.364|
|10001990000|1663150200000| 53.820| 53.820|
|10001990000|1663150500000| 49.312| 49.312|
|10001990000|1663150800000| 39.836| 39.836|
|10001990000|1663151100000| 33.212| 33.212|
|10001990000|1663151400000| 26.680| 26.680|
|10001990000|1663151700000| 22.356| 22.356|
|10001990000|1663152000000| 18.032| 18.032|
|10001990000|1663152300000| 13.708| 13.708|
|10001990000|1663152600000| 9.844| 9.844|
|10001990000|1663152900000| 7.636| 7.636|
|10001990000|1663153200000| 2.392| 2.392|
|10001990000|1663153500000| 2.208| 2.208|
|10001990000|1663153800000| 0.920| 0.920|
+-----------+-------------+-------+--------+
可以发现在valueNew字段中,没有null值得出现,在value为null的情况下,valueNew的值与timestamp为“1663130400000”的时候的值是相同的,补全成功。