久久免费视频网,国产99久久,国产精品久久av

本文介紹了 pandas 數(shù)據(jù)框:基于列和時間范圍的重復的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我有一個(非常簡單的)熊貓數(shù)據(jù)框，看起來像這樣:

I have a (very simplyfied here) pandas dataframe which looks like this:

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我現(xiàn)在想做的是獲取所有時間戳在 3 秒內(nèi)的重復消息.期望的輸出是:

What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

沒有第三行，因為它的文本與第一行和第二行相同，但它的時間戳不是3秒以內(nèi).

without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.

我嘗試將列 datetime 和 msg 定義為 duplicate() 方法的參數(shù)，但它返回一個空數(shù)據(jù)幀，因為時間戳不相同:

I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

有沒有一種方法可以為我的日期時間"參數(shù)定義一個范圍?為了說明，某事喜歡:

Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

我們將一如既往地為您提供任何幫助.

Any help here would as always be very much appreciated.

推薦答案

這段代碼給出了預期的輸出

This Piece of code gives the expected output

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我已對數(shù)據(jù)框的msg"列進行分組，然后選擇該數(shù)據(jù)框的日期時間"列并使用內(nèi)置函數(shù) 差異.Diff 函數(shù)查找該列的值之間的差異.用零填充 NaT 值并僅選擇那些值小于 3 秒的索引.

I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.

在使用上述代碼之前，請確保您的數(shù)據(jù)框按日期時間升序排序.

Before using above code make sure that your dataframe is sorted on datetime in ascending order.

這篇關(guān)于 pandas 數(shù)據(jù)框:基于列和時間范圍的重復的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網(wǎng)！

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題，如果有圖片或者內(nèi)容侵犯了您的權(quán)益，請聯(lián)系我們刪除處理，感謝您的支持！

pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

pandas 數(shù)據(jù)框:基于列和時間范圍的重復

問題描述

推薦答案

相關(guān)文檔推薦