問題描述
我有一個(非常簡單的)熊貓數(shù)據(jù)框,看起來像這樣:
I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
我現(xiàn)在想做的是獲取所有時間戳在 3 秒內(nèi)的重復消息.期望的輸出是:
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
沒有第三行,因為它的文本與第一行和第二行相同,但它的時間戳不是3秒以內(nèi).
without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.
我嘗試將列 datetime 和 msg 定義為 duplicate()
方法的參數(shù),但它返回一個空數(shù)據(jù)幀,因為時間戳不相同:
I tried to define the columns datetime and msg as parameters for the duplicate()
method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
有沒有一種方法可以為我的日期時間"參數(shù)定義一個范圍?為了說明,某事喜歡:
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
我們將一如既往地為您提供任何幫助.
Any help here would as always be very much appreciated.
推薦答案
這段代碼給出了預期的輸出
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
我已對數(shù)據(jù)框的msg"列進行分組,然后選擇該數(shù)據(jù)框的日期時間"列并使用內(nèi)置函數(shù) 差異.Diff 函數(shù)查找該列的值之間的差異.用零填充 NaT 值并僅選擇那些值小于 3 秒的索引.
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
在使用上述代碼之前,請確保您的數(shù)據(jù)框按日期時間升序排序.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
這篇關(guān)于 pandas 數(shù)據(jù)框:基于列和時間范圍的重復的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!