我正在运行一个循环遍历嵌套字典的基本脚本,从每条记录中获取数据,并将其附加到PandasDataFrame。数据看起来像这样:data={"SomeCity":{"Date1":{record1,record2,record3,...},"Date2":{},...},...}它总共有几百万条记录。脚本本身如下所示:city=["SomeCity"]df=DataFrame({},columns=['Date','HouseID','Price'])forcityincities:fordateRunindata[city]:forrecordindata[city][dateRun]
我有一些以下格式的数据(RDD或SparkDataFrame):frompyspark.sqlimportSQLContextsqlContext=SQLContext(sc)rdd=sc.parallelize([('X01',41,'US',3),('X01',41,'UK',1),('X01',41,'CA',2),('X02',72,'US',4),('X02',72,'UK',6),('X02',72,'CA',7),('X02',72,'XX',8)])#converttoaSparkDataFrameschema=StructType([StructField('ID',