手机毛片,国产精品视频免费观看,亚洲成人精品一区二区

本文介紹了spark從mysql并行讀取數據的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我正在嘗試從 mysql 讀取數據并將其寫回 s3 中具有特定分區的 parquet 文件，如下所示:

Im trying to read data from mysql and write it back to parquet file in s3 with specific partitions as follows:

df=sqlContext.read.format('jdbc')\
   .options(driver='com.mysql.jdbc.Driver',url="""jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>""",
         dbtable='tbl',
         numPartitions=4 )\
   .load()


df2=df.withColumn('updated_date',to_date(df.updated_at))
df2.write.parquet(path='s3n://parquet_location',mode='append',partitionBy=['updated_date'])

我的問題是它只打開一個到 mysql 的連接(而不是 4 個)，并且在它從 mysql 獲取所有數據之前它不會寫入 parquert，因為我在 mysql 中的表很大(100M 行)進程失敗內存不足.

My problem is that it open only one connection to mysql (instead of 4) and it doesn't write to parquert until it fetches all the data from mysql, because my table in mysql is huge (100M rows) the process failed on OutOfMemory.

有沒有辦法配置Spark打開多個mysql連接并將部分數據寫入parquet?

Is there a way to configure Spark to open more than one connection to mysql and to write partial data to parquet?

推薦答案

你應該設置這些屬性:

partitionColumn, 
lowerBound, 
upperBound, 
numPartitions

正如這里記錄的那樣:http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

這篇關于spark從mysql并行讀取數據的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

spark從mysql并行讀取數據

問題描述

推薦答案

相關文檔推薦