当前位置: 首页 > 知识库问答 >
问题:

使用Timegrouper“1M”按列分组和求和会弄乱我的日期索引

严元白
2023-03-14

发现了错误:下面作为解决方案发布的代码片段起作用。关于我的结果的问题根源于数据源(FEC. GOV)。我已经找到了,现在正在继续前进。感谢所有的时间,耐心,帮助,等等来自社区关于这个问题!

由于有人建议使用github站点上的代码片段,因此我提供了以下原始文件的链接(http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012). 我使用的是2008年至2014年的数据文件:pas212.zip,数据名称:(委员会对候选人的捐款(和其他支出)。同样,下面的代码可以在[https://github.com/Michae108/python-coding.git]. 提前感谢您为解决此问题提供的任何帮助。我已经工作了三天了,这应该是一项非常简单的任务。我导入并连接4个“|”分隔值文件。读作pd.df;将日期列设置为date.time。这为我提供了以下输出:

              cmte_id trans_typ entity_typ state  amount     fec_id    cand_id
date                                                                          
2007-08-15  C00112250       24K        ORG    DC    2000  C00431569  P00003392
2007-09-26  C00119040       24K        CCM    FL    1000  C00367680  H2FL05127
2007-09-26  C00119040       24K        CCM    MD    1000  C00140715  H2MD05155
2007-07-20  C00346296       24K        CCM    CA    1000  C00434571  H8CA37137

其次,我希望能够按一个月的频率对指数进行分组。然后我想根据[trans_typ]和[cand_id]求和[数额]。

下面是我这样做的代码:

import numpy as np
import pandas as pd
import glob

df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, \
    names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
    '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
    '21', '22'], index_col=None, dtype={'date':str}) for f in \
    glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True)  
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
df2 = df1[['1', '6', '7', '10', '15', '16', '17']].copy() 
df2.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount',\
               'fec_id','cand_id']

df2['amount'] = df2['amount'].astype(float)

grouper = df2.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])

df = grouper['amount'].sum()
grouper['amount'].sum().unstack().fillna(0)
#print (df.head())

这是我运行代码的输出:

    trans_typ   24A     24C     24E     24F     24K     24N     24R     24Z
date    cand_id                                 
1954-07-31  S8AK00090   0   0   0   0   1000    0   0   0
1985-09-30  H8OH18088   0   0   36  0   0   0   0   0
1997-04-30  S6ND00058   0   0   0   0   1000    0   0   0

以下是我的原始数据示例(如果有帮助):

C00409409|N|Q2|P|29992447808|24K|CCM|PERRIELLO FOR CONGRESS|IVY|VA|22945|||06262009|500|C00438788|H8VA05106|D310246|424490|||4072320091116608455
C00409409|N|Q2|P|29992447807|24K|CCM|JOHN BOCCIERI FOR CONGRESS|ALLIANCE|OH|44601|||06262009|500|C00435065|H8OH16058|D310244|424490|||4072320091116608452
C00409409|N|Q2|P|29992447807|24K|CCM|MIKE MCMAHON FOR CONGRESS|STATEN ISLAND|NY|10301|||06262009|500|C00451138|H8NY13077|D310245|424490|||4072320091116608453
C00409409|N|Q2|P|29992447808|24K|CCM|MINNICK FOR CONGRESS|BOISE|ID|83701|||06262009|500|C00441105|H8ID01090|D310243|424490|||4072320091116608454
C00409409|N|Q2|P|29992447807|24K|CCM|ADLER FOR CONGRESS|MARLTON|NJ|08053|||06262009|500|C00439067|H8NJ03156|D310247|424490|||4072320091116608451
C00435164|N|Q2|P|29992448007|24K|CCM|ALEXI FOR ILLINOIS EXPLORATORY COMMITTEE||||||06292009|1500|C00459586|S0IL00204|SB21.4124|424495|||4071620091116385529

共有2个答案

百里成仁
2023-03-14

更新

我想@jezrael已经提到的问题是由缺少日期和以下两行造成的:

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True) 

这就是为什么您可以首先找到“有问题的行”,然后对它们进行清理(设置一些日期,这对您来说很有意义):

import pandas as pd
import glob

def get_headers(fn):
    with open(fn, 'r') as f:
        for line in f:
            if ',' in line:
                return line.strip().split(',')


####################################################
# Data Dictionary - Contributions to Candidates from Committees
# http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml
# http://www.fec.gov/finance/disclosure/metadata/pas2_header_file.csv
#
headers_file = 'pas2_header_file.csv'

interesting_cols = ['CMTE_ID', 'TRANSACTION_TP', 'ENTITY_TP', 'STATE',
                    'TRANSACTION_DT', 'TRANSACTION_AMT', 'OTHER_ID', 'CAND_ID']

#
# rename columns rules
#
rename_cols = {
  'TRANSACTION_TP':     'trans_typ',
  'TRANSACTION_DT':     'date',
  'TRANSACTION_AMT':    'amount',
  'OTHER_ID':           'fec_id',
}

#
# all columns/headers (already renamed)
#
all_cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
            for col in get_headers(headers_file)]

#
# columns to use in read_csv() (already renamed)
#
cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
        for col in get_headers(headers_file) if col in interesting_cols]


####################################################


df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=cols,
            header=None,
            low_memory=False,
            names=all_cols,
            index_col=None,
            parse_dates=['date'],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
         )
         for f in glob.glob('./itpas2.txt'))
     )

# print rows where 'date' is empty
print(df[pd.isnull(df.date)])

#
# sanitize NaT/empty dates in order to prevent problems with an index in future
#
df.date.fillna(pd.Timestamp('20110101'), inplace=True)

# the rest is your code almost unchanged:
grouper = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])
grouper['amount'].sum().unstack().fillna(0)

空日期行:

          cmte_id trans_typ entity_tp state date  amount     fec_id    cand_id
52372   C00317446       24K       NaN    CA  NaT    2500  C00409219  H6CA05195
57731   C00416693       24K       IND    DC  NaT    2500  C00463836  H2NM02126
58386   C00152892       24K       NaN    DC  NaT    1000  C00359034  H0MO06073
145715  C00154641       24K       IND    DC  NaT    1000  C00257337  H2CA37023
193651  C00000992       24K       NaN    MI  NaT     500  C00390724  H4MI07103
212982  C00454074       24E       ORG    CA  NaT    1138  S2TX00312  S2TX00312
212983  C00454074       24E       ORG    CA  NaT    4764  S2TX00312  S2TX00312
212984  C00454074       24E       ORG    CA  NaT    7058  S2MO00403  S2MO00403
212985  C00454074       24E       ORG    CA  NaT    5000  S2MO00403  S2MO00403
212986  C00454074       24E       ORG    CA  NaT   50003  S8WI00158  S8WI00158
212987  C00454074       24E       ORG    CA  NaT    8830  S8WI00158  S8WI00158
212988  C00454074       24E       ORG    CA  NaT   22189  S8WI00158  S8WI00158
212989  C00454074       24E       ORG    CA  NaT   11258  S8WI00158  S8WI00158
212990  C00454074       24E       ORG    CA  NaT    5000  S8WI00158  S8WI00158
212991  C00454074       24E       ORG    CA  NaT    7743  S2MO00403  S2MO00403
212992  C00454074       24E       ORG    CA  NaT   12463  S0MI00056  S0MI00056
212993  C00454074       24E       ORG    CA  NaT    2795  S8WI00158  S8WI00158
213034  C00454074       24E       ORG    CA  NaT    6431  S2IN00083  S2IN00083
213035  C00454074       24E       ORG    CA  NaT   28015  S2TX00312  S2TX00312
213036  C00454074       24E       ORG    CA  NaT    5395  S8NE00091  S8NE00091
213037  C00454074       24E       ORG    CA  NaT   19399  S2MO00403  S2MO00403
213038  C00454074       24E       ORG    CA  NaT    2540  S2IN00083  S2IN00083
213039  C00454074       24E       ORG    FL  NaT    1500  S2IN00083  S2IN00083
213040  C00454074       24E       ORG    CA  NaT    8065  S2TX00312  S2TX00312
213041  C00454074       24E       ORG    CA  NaT   11764  S2TX00312  S2TX00312
213042  C00454074       24E       ORG    CA  NaT   61214  S2TX00312  S2TX00312
213043  C00454074       24E       ORG    CA  NaT   44634  S2MO00403  S2MO00403
213044  C00454074       24E       ORG    TN  NaT   15000  S2TX00312  S2TX00312
213045  C00454074       24E       ORG    CA  NaT    5176  S2TX00312  S2TX00312
214642  C90014358       24E       NaN    VA  NaT    2000  S6MT00097  S6MT00097
214643  C90014358       24E       NaN    VA  NaT    2000  H2MT01060  H2MT01060
214644  C90014358       24E       NaN    DC  NaT     139  S6MT00097  S6MT00097
214645  C90014358       24E       NaN    DC  NaT     139  H2MT01060  H2MT01060
214646  C90014358       24E       NaN    DC  NaT     149  S6MT00097  S6MT00097
214647  C90014358       24E       NaN    DC  NaT     149  H2MT01060  H2MT01060
216428  C00023580       24E       ORG    VA  NaT    3352  P80003338  P80003338
216445  C00023580       24E       ORG    VA  NaT     250  P80003338  P80003338
216446  C00023580       24E       ORG    VA  NaT     333  P80003338  P80003338
216447  C00023580       24E       ORG    VA  NaT    2318  P80003338  P80003338
216448  C00023580       24E       ORG    VA  NaT     583  P80003338  P80003338
216449  C00023580       24E       ORG    VA  NaT    2969  P80003338  P80003338
216450  C00023580       24E       ORG    VA  NaT   14011  P80003338  P80003338
216451  C00023580       24E       ORG    VA  NaT     383  P80003338  P80003338
216452  C00023580       24E       ORG    VA  NaT     366  P80003338  P80003338
216453  C00023580       24E       ORG    VA  NaT     984  P80003338  P80003338
216454  C00023580       24E       ORG    VA  NaT     542  P80003338  P80003338
216503  C00023580       24E       ORG    VA  NaT    3077  P80003338  P80003338
216504  C00023580       24E       ORG    VA  NaT    3002  P80003338  P80003338
216505  C00023580       24E       ORG    VA  NaT    5671  P80003338  P80003338
216506  C00023580       24E       ORG    VA  NaT    3853  P80003338  P80003338
231905  C00454074       24E       ORG    CA  NaT   26049  S4WV00084  S4WV00084
231906  C00454074       24E       ORG    NC  NaT  135991  P80003353  P80003353
231907  C00454074       24E       ORG    FL  NaT    5000  P80003353  P80003353
231908  C00454074       24E       ORG    TX  NaT   12500  P80003353  P80003353
231909  C00454074       24A       ORG    TX  NaT   12500  P80003338  P80003338
234844  C00417519       24K       NaN    NY  NaT    2500  C00272633  H2NY26080
281989  C00427203       24K       NaN    DC  NaT     500  C00412304  S6MT00162
309146  C00500785       24A       NaN   NaN  NaT       0  H4FL20023  H4FL20023
310225  C00129189       24K       NaN    MI  NaT    1000  C00347476  H0MI10071

PS我添加了一些帮助函数/变量(get_headers()interesting_colsrename_colsall_colsol),这可能有助于您处理不同的数据/CSV文件从fec.gov未来

基于样本数据的原始答案

指定“切割”样本数据集的代码:

#import numpy as np
import pandas as pd
import glob

#dtparser = lambda x: pd.datetime.fromtimestamp(int(x))

cols = ['cmte_id', 'trans_typ', 'entity_typ', 'state',
        'date', 'amount', 'fec_id', 'cand_id']

df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=[0, 5, 6, 9, 13, 14, 15, 16],
            header=None,
            low_memory=False,
            #names=cols,
            index_col=None,
            parse_dates=[13],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
            #dtype={5: np.float64}
         )
         for f in glob.glob('./itpas2**github.txt'))
     )
df.columns = cols
df.trans_typ = df.trans_typ.astype('category')
#print(df.head())
#print(df.dtypes)

a = df.set_index('date').\
        groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ']).\
        agg({'amount': sum}).\
        reset_index()

print(a.pivot_table(index=['date', 'cand_id'],
                    columns='trans_typ',
                    values='amount',
                    fill_value=0,
                    aggfunc='sum').tail(10))

输出:

trans_typ             24A  24C  24E  24F   24K  24N  24R  24Z
date       cand_id
2013-02-28 S0FL00312    0    0    0    0     0    0    0    0
           S0IA00028    0    0    0    0     0    0    0    0
           S0IL00204    0    0    0    0     0    0    0    0
           S2ND00099    0    0    0    0  1000    0    0    0
           S4ME00055    0    0    0    0     0    0    0    0
           S4SC00240    0    0    0    0  5000    0    0    0
           S6MN00267    0    0    0    0     0    0    0    0
           S6NV00028    0    0    0    0  2500    0    0    0
           S6PA00100    0    0    0    0     0    0    0    0
           S8MT00010    0    0    0    0  3500    0    0    0

PS在您的文件中,trans\u typ24K只有一个值,因此无法旋转。所以我在CSV文件中对它进行了处理,这样我们现在就有了不同的值

宗政学
2023-03-14

这是非常复杂的。Date_parser返回错误,所以第一列dateread_csv中被转换为string。然后将列date转换为to_datetime并删除所有NaN值。最后,您可以使用Groupbyunstack

import pandas as pd
import glob



#change path by your 
df = pd.concat((pd.read_csv(f, 
                            sep='|', 
                            header=None, 
                            names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
                            usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
                            dtype={'date': str}
                           ) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)


#parse column date to datetime
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')

#remove rows, where date is NaN
df = df[df['date'].notnull()]

#set column date to index
df = df.set_index('date')

g = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])['amount'].sum()
print g.unstack().fillna(0)

trans_typ                24A  24C   24E  24F     24K  24N  24R  24Z
date       cand_id                                                 
2001-09-30 H2HI02110       0    0     0    0    2500    0    0    0
2007-03-31 S6TN00216       0    0     0    0    2000    0    0    0
2007-10-31 H8IL21021       0    0     0    0   -1000    0    0    0
2008-03-31 S6TN00216       0    0     0    0    1000    0    0    0
2008-07-31 H2PA11098       0    0     0    0    1000    0    0    0
           H4KS03105       0    0     0    0   49664    0    0    0
           H6KS03183       0    0     0    0    1000    0    0    0
2008-10-31 H8KS02090       0    0     0    0    1000    0    0    0
           S6TN00216       0    0     0    0    1500    0    0    0
2008-12-31 H6KS01146       0    0     0    0    2000    0    0    0
2009-02-28 S6OH00163       0    0     0    0   -1000    0    0    0
2009-03-31 S2KY00012       0    0     0    0    2000    0    0    0
           S6WY00068       0    0     0    0   -2500    0    0    0
2009-06-30 S6TN00216       0    0     0    0   -1000    0    0    0
2009-08-31 S0MO00183       0    0     0    0    1000    0    0    0
2009-09-30 S0NY00410       0    0     0    0    1000    0    0    0
2009-10-31 S6OH00163       0    0     0    0   -2500    0    0    0
           S6WY00068       0    0     0    0   -1000    0    0    0
2009-11-30 H8MO09153       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -1000    0    0    0
           S6OH00163       0    0     0    0    -500    0    0    0
2009-12-31 H0MO00019       0    0     0    0     500    0    0    0
           S6TN00216       0    0     0    0   -1000    0    0    0
2010-01-31 H0CT03072       0    0     0    0     250    0    0    0
           S0MA00109       0    0     0    0    5000    0    0    0
2010-02-28 S6TN00216       0    0     0    0   -1500    0    0    0
2010-03-31 H0MO00019       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -2500    0    0    0
2010-05-31 H0MO06149       0    0     0    0     530    0    0    0
           S6OH00163       0    0     0    0   -1000    0    0    0
...                      ...  ...   ...  ...     ...  ...  ...  ...
2012-12-31 S6UT00063       0    0     0    0    5000    0    0    0
           S6VA00093       0    0     0    0   97250    0    0    0
           S6WY00068       0    0     0    0    1500    0    0    0
           S6WY00126       0    0     0    0   11000    0    0    0
           S8AK00090       0    0     0    0  132350    0    0    0
           S8CO00172       0    0     0    0   88500    0    0    0
           S8DE00079       0    0     0    0    6000    0    0    0
           S8FL00166       0    0     0    0    -932    0    0  651
           S8ID00027       0    0     0    0   13000    0    0  326
           S8ID00092       0    0     0    0    2500    0    0    0
           S8MI00158       0    0     0    0    7500    0    0    0
           S8MI00281     110    0     0    0    3000    0    0    0
           S8MN00438       0    0     0    0   65500    0    0    0
           S8MS00055       0    0     0    0   21500    0    0    0
           S8MS00196       0    0     0    0     500    0    0  650
           S8MT00010       0    0     0    0  185350    0    0    0
           S8NC00239       0    0     0    0   67000    0    0    0
           S8NE00067       0   40     0    0       0    0    0    0
           S8NE00117       0    0     0    0   13000    0    0    0
           S8NJ00392       0    0     0    0   -5000    0    0    0
           S8NM00168       0    0     0    0   -2000    0    0    0
           S8NM00184       0    0     0    0   51000    0    0    0
           S8NY00082       0    0     0    0    1000    0    0    0
           S8OR00207       0    0     0    0   23500    0    0    0
           S8VA00214       0    0   120    0   -2000    0    0    0
           S8WA00194       0    0     0    0   -4500    0    0    0
2013-10-31 P80003338  314379    0     0    0       0    0    0    0
           S8VA00214   14063    0     0    0       0    0    0    0
2013-11-30 H2NJ03183       0    0  2333    0       0    0    0    0
2014-10-31 S6PA00217       0    0     0    0    1500    0    0    0
 类似资料:
  • 问题内容: 这应该很简单,但是让我受益匪浅。 我所拥有的只是一个只有两列的表格,如下所示: 等等。 我想计算 每天 的 总字数 -我将它们按日期添加分组并选择WordCount的总和,最后得到语法错误(wordcount必须在group by子句中),但是现在我得到的天数为null 这是我的查询: 这只是选择null。我怎么知道怎么了? 谢谢。 问题答案: 如果您使用该怎么办: 我不明白您为什么还

  • 问题内容: 我知道有一些与此相关的帖子,但是我的情况有些不同,因此我希望获得一些帮助。 我需要从数据库中提取一些数据,这些数据是每天交互的累积计数。目前这就是我所拥有的 这样的输出接近我想要的,但不完全是我所需要的。我遇到的问题是日期与发生互动的时分秒存储在一起,因此group by不能将天分组在一起。 这就是输出的样子。http://screencast.com/t/N1KFNFyil 12月2

  • 问题内容: 我有一个称为 activity_dt 的日期时间,数据如下所示: 如何按日期和小时分组? 问题答案: SQL Server: 甲骨文: MySQL的:

  • 问题内容: 建立库存系统。我有很多产品,每个产品都有三个不同的变量。因此,对于总库存,我想按两列(产品和尺寸)和总数量分组以获得总库存。 我想要输出的内容: 小部件一-2:375 小部件二-3:150 小部件二-2:150 我想出了如何使用以下代码将一列分组并求和: 我只是按两列分组。可能吗?还是应该仅针对这三种尺寸的商品创建三种不同的产品并删除该列?谢谢。 问题答案: 根据示例表,您似乎希望分组

  • 问题内容: 作为输入,我有一个带时间的CSV文件,每次都有一串数字。 我想输出按小时平均和总和分组的每小时表格: 到目前为止,我一直在看用字典来完成它,其中小时是一个关键,值是一个计数和总和的列表,然后将总和除以计数就可以得到平均值。我敢肯定,必须有一种更清洁的方法来做到这一点。也许有些图书馆可以使用它。有什么建议? 问题答案: 一个熊猫的解决方案: 印刷品: 另存为csv文件: 这是以下内容:

  • 在这篇博文中,保罗·希姆斯特拉展示了如何使用 来总结两列。复制/粘贴相关部件: 允许一个人做: 太好了! 我接着问了一个问题(见评论),如何将其扩展到100列,因为(对我来说)不太清楚如何在不使用上述方法键入所有名称的情况下做到这一点。保罗很友好地纵容了我,并提供了这个答案(谢谢!): 我想在以下几点上改进这个答案: > 其他栏目都没有了,我想留下它们。 它使用<code>rowSums() 另外