Saya memiliki kerangka data di mana kolom dalam bentuk daftar json. Saya ingin mengekstrak nilai (skor) tertentu dari kolom dan membuat kolom independen.

raw_data = [{"user_id" : 1234, "col" : [{"id":14577120145280,"score":64.71,"Elastic_position":0},{"id":14568530280240,"score":88.53,"Elastic_position":1},{"id":14568530119661,"score":63.75,"Elastic_position":2},{"id":14568530205858,"score":62.79,"Elastic_position":3},{"id":14568530414899,"score":60.88,"Elastic_position":4}]}]

df = pd.DataFrame.from_dict(raw_data)

Saya ingin meledakkan kerangka data hasil saya sebagai:

enter image description here

1
Teresa 21 September 2021, 13:37

2 jawaban

Jawaban Terbaik

Dengan asumsi Anda memiliki json Anda terlihat seperti ini

# a.json
# {
#     "user_id" : 1234,
#     "col" : [
#         {"id":14577120145280,"score":64.71,"Elastic_position":0},
#         {"id":14568530280240,"score":88.53,"Elastic_position":1},
#         {"id":14568530119661,"score":63.75,"Elastic_position":2},
#         {"id":14568530205858,"score":62.79,"Elastic_position":3},
#         {"id":14568530414899,"score":60.88,"Elastic_position":4}
#     ]
# }

Anda dapat membacanya, meratakannya, lalu memutarnya seperti itu

from pyspark.sql import functions as F
from pyspark.sql import types as T

schema = T.StructType([
    T.StructField('user_id', T.IntegerType()),
    T.StructField('col', T.StringType()),
])

df = spark.read.json('a.json', multiLine=True, schema=schema)
df.show(10, False)

# +-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |user_id|col                                                                                                                                                                                                                                                                                           |
# +-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |1234   |[{"id":14577120145280,"score":64.71,"Elastic_position":0},{"id":14568530280240,"score":88.53,"Elastic_position":1},{"id":14568530119661,"score":63.75,"Elastic_position":2},{"id":14568530205858,"score":62.79,"Elastic_position":3},{"id":14568530414899,"score":60.88,"Elastic_position":4}]|
# +-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


df.printSchema()
# root
#  |-- user_id: integer (nullable = true)
#  |-- col: string (nullable = true)

(df
    # this will parse your JSON string to JSON object
    .withColumn('col', F.from_json(
        F.col('col'),
        T.ArrayType(T.StructType([
            T.StructField('id', T.LongType()),
            T.StructField('score', T.DoubleType()),
            T.StructField('Elastic_position', T.IntegerType()),
        ]))
    ))
 
    .select('user_id', F.explode('col'))
    .groupBy('user_id')
    .pivot('col.Elastic_position')
    .agg(F.first('col.score'))
    .show(10, False)
)

# Output
# +-------+-----+-----+-----+-----+-----+
# |user_id|0    |1    |2    |3    |4    |
# +-------+-----+-----+-----+-----+-----+
# |1234   |64.71|88.53|63.75|62.79|60.88|
# +-------+-----+-----+-----+-----+-----+
1
pltc 22 September 2021, 15:45

Coba gunakan pd.Series.explode dengan groupby:

df = pd.DataFrame.from_dict(raw_data).explode('col')
df.assign(col=df['col'].str['score']).groupby('user_id').agg(list).apply(lambda x: (y:=x.explode()).set_axis(y.index + '_' + y.groupby(level=0).cumcount().astype(str)), axis=1).reset_index()

   user_id  col_0  col_1  col_2  col_3  col_4
0     1234  64.71  88.53  63.75  62.79  60.88

Jika pertama-tama membuat kerangka data dan meledakkan kolom col, kemudian mengelompokkan dengan user_id yang digandakan dan melakukan explode lain untuk membuatnya panjang menjadi lebar, lalu tambahkan awalan 0 ke 4 dengan cumcount.

0
U12-Forward 21 September 2021, 10:45