pyspark 实现数据分桶(Bucketizer)

2021-07-23  本文已影响0人  米斯特芳
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession\
    .builder\
    .appName("BucketizerExample")\
    .getOrCreate()
splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]
data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)]
dataFrame = spark.createDataFrame(data, ["features"])
# splits:区间边界,outputCol:分桶后的特征名
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures")
# Transform original data into its bucket index.
bucketedData = bucketizer.transform(dataFrame)
print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
bucketedData.show()
上一篇下一篇

猜你喜欢

热点阅读