問題描述
我正在嘗試從 mysql 讀取數據并將其寫回 s3 中具有特定分區的 parquet 文件,如下所示:
Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows:
df=sqlContext.read.format('jdbc')\
.options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",
dbtable='tbl',
numPartitions=4 )\
.load()
df2=df.withColumn('updated_date',to_date(df.updated_at))
df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date'])
我的問題是它只打開一個到 mysql 的連接(而不是 4 個),并且在它從 mysql 獲取所有數據之前它不會寫入 parquert,因為我在 mysql 中的表很大(100M 行)進程失敗內存不足.
My problem is that it open only one connection to mysql (instead of 4) and it doesn't write to parquert until it fetches all the data from mysql, because my table in mysql is huge (100M rows) the process failed on OutOfMemory.
有沒有辦法配置Spark打開多個mysql連接并將部分數據寫入parquet?
Is there a way to configure Spark to open more than one connection to mysql and to write partial data to parquet?
推薦答案
你應該設置這些屬性:
partitionColumn,
lowerBound,
upperBound,
numPartitions
正如這里記錄的那樣:http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
這篇關于spark從mysql并行讀取數據的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!