問題描述
我有一個(gè) dynamodb 表來存儲(chǔ)電子郵件屬性信息.我在電子郵件上有一個(gè)哈希鍵,在時(shí)間戳(數(shù)字)上有一個(gè)范圍鍵.使用電子郵件作為哈希鍵的最初想法是按電子郵件查詢所有電子郵件.但我想做的一件事是檢索所有電子郵件 ID(在哈希鍵中).我為此使用 boto,但我不確定如何檢索不同的電子郵件 ID.
I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
我當(dāng)前提取 10,000 條電子郵件記錄的代碼是
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
但是要檢索不同的記錄,我必須進(jìn)行全表掃描,然后在代碼中選擇不同的記錄.我的另一個(gè)想法是維護(hù)另一個(gè)表,該表將僅存儲(chǔ)這些電子郵件并進(jìn)行條件寫入以查看電子郵件 ID 是否存在,如果不存在則寫入.但是我正在嘗試考慮這是否會(huì)更昂貴,并且會(huì)是有條件的寫入.
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
推薦答案
使用 DynamoDB 掃描,您需要在客戶端過濾掉重復(fù)項(xiàng)(在您的情況下,使用 boto).即使您使用反向架構(gòu)創(chuàng)建 GSI,您仍然會(huì)得到重復(fù)項(xiàng).給定一個(gè)名為 stamped_emails 的 email_id+timestamp 的 H+R 表,所有唯一 email_ids 的列表是 H+R stamped_emails 表的物化視圖.您可以啟用 DynamoDB Stream 在 stamped_emails 表上,訂閱 Lambda 函數(shù)對 stamped_emails 的 Stream 執(zhí)行 PutItem (email_id) 到名為 emails_only 的僅哈希表.然后,您可以 Scan emails_only 并且不會(huì)收到重復(fù)郵件.
Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
最后,關(guān)于您關(guān)于成本的問題,即使您只請求這些項(xiàng)目的某些預(yù)計(jì)屬性,Scan 也會(huì)讀取整個(gè)項(xiàng)目.其次,Scan 必須通讀每個(gè)項(xiàng)目,即使它被 FilterExpression(條件表達(dá)式)過濾掉.第三,掃描順序讀取項(xiàng)目.這意味著為了計(jì)量目的,每個(gè)掃描調(diào)用都被視為一次大讀取.這樣做的成本含義是,如果一個(gè) Scan 調(diào)用讀取 200 個(gè)不同的項(xiàng)目,它不一定會(huì)花費(fèi) 100 個(gè) RCU.如果每個(gè)項(xiàng)目的大小為 100 字節(jié),則該 Scan 調(diào)用將花費(fèi) ROUND_UP((20000 字節(jié)/1024 kb/字節(jié))/8 kb/EC RCU) = 3 RCU.即使此調(diào)用僅返回 123 個(gè)項(xiàng)目,如果 Scan 必須讀取 200 個(gè)項(xiàng)目,在這種情況下您將產(chǎn)生 3 個(gè) RCU.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.
這篇關(guān)于從哈希鍵中檢索不同的值 - DynamoDB的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!