pandas dataframe boolean条件数据筛选3

2019-01-11 本文已影响74人筝韵徽

import pandas as pd
import numpy as np
from tabulate import tabulate

pandas dataframe boolean条件数据筛选3

df=pd.read_csv('data/stackoverflow_qa.csv')
dd=df.head()
print(tabulate(dd,headers=df.columns,tablefmt='pipe'))

|    |      id | creationdate        |   score |   viewcount | title                                                                               |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name        |   ans_rep |
|---:|--------:|:--------------------|--------:|------------:|:------------------------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:----------------|----------:|
|  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                                            |             3 |              4 |               1 | yueerhu        |         125 | Mike Pennington |     26995 |
|  1 | 5515021 | 2011-04-01 14:50:44 |       8 |        7015 | Compute a compounded return series in Python                                        |             3 |              6 |               7 | Jason Strimpel |        3301 | Mike Pennington |     26995 |
|  2 | 5558607 | 2011-04-05 21:13:50 |       2 |        7392 | Sort a pandas DataMatrix in ascending order                                         |             2 |              0 |               1 | Jason Strimpel |        3301 | Wes McKinney    |     43310 |
|  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas                      |             1 |              0 |               7 | user814005     |         117 | Wes McKinney    |     43310 |
|  4 | 7577546 | 2011-09-28 01:58:38 |       9 |        2488 | Using pandas, how do I subsample a large DataFrame by group in an efficient manner? |             1 |              0 |               5 | Uri Laserson   |         958 | HYRY            |     54137 |

列与列之间的比较形成的筛选条件
- 筛选出 answercount > score 的数据如下

dd=df.loc[df['answercount']>df['score']].head()
print(tabulate(dd,headers=df.columns,tablefmt='pipe'))

|    |       id | creationdate        |   score |   viewcount | title                                                                    |   answercount |   commentcount |   favoritecount | quest_name       |   quest_rep | ans_name               |   ans_rep |
|---:|---------:|:--------------------|--------:|------------:|:-------------------------------------------------------------------------|--------------:|---------------:|----------------:|:-----------------|------------:|:-----------------------|----------:|
| 10 |  8273092 | 2011-11-25 18:39:02 |       1 |        2333 | python: pandas install errors                                            |             2 |              0 |             nan | codingknob       |        2279 | codingknob             |      2279 |
| 46 |  9927711 | 2012-03-29 14:42:42 |       1 |        1659 | Reading csv in python pandas and handling bad values                     |             3 |              0 |               2 | Dave31415        |         914 | eumiro                 |    104313 |
| 54 | 10003171 | 2012-04-03 23:59:41 |       1 |         404 | What is an efficient way in pandas to do summaryBy(...,full.dimension=T) |             2 |              1 |             nan | LmW.             |         486 | Wes McKinney           |     43310 |
| 59 | 10027719 | 2012-04-05 11:28:00 |       0 |         500 | Installing Pandas with Python 2.5 on Windows                             |             1 |              0 |             nan | JamesS           |         191 | Wes McKinney           |     43310 |
| 77 | 10393447 | 2012-05-01 04:12:13 |       0 |         130 | Scope gotcha when dynamically adding methods in a loop                   |             2 |              0 |             nan | Chris Billington |         424 | Ignacio Vazquez-Abrams |    513959 |

iloc 中不能使用类似 df['answercount'] >2 的条件来看一下原因

s = df['answercount'] >2

s.head()

0     True
1     True
2    False
3    False
4    False
Name: answercount, dtype: bool

type(s)

pandas.core.series.Series

上列中df['answercount'] >2返回的Series类型，index跟loc使用的index是一致的，
而iloc使用的是下标索引
看下效果,会报错

df.iloc[s]

---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

<ipython-input-10-9ddb204bb531> in <module>()
----> 1 df.iloc[s]


~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):


~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   2084 
   2085         if com.is_bool_indexer(key):
-> 2086             self._validate_key(key, axis)
   2087             return self._getbool_axis(key, axis=axis)
   2088 


~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1945             if hasattr(key, 'index') and isinstance(key.index, Index):
   1946                 if key.index.inferred_type == 'integer':
-> 1947                     raise NotImplementedError("iLocation based boolean "
   1948                                               "indexing on an integer type "
   1949                                               "is not available")


NotImplementedError: iLocation based boolean indexing on an integer type is not available

可以只使用series的values，如下

s.values

array([ True,  True, False, ..., False, False, False])

dd=df.iloc[s.values].head()
print(tabulate(dd,headers=df.columns,tablefmt='pipe'))

|    |      id | creationdate        |   score |   viewcount | title                                                                      |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name        |   ans_rep |
|---:|--------:|:--------------------|--------:|------------:|:---------------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:----------------|----------:|
|  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                                   |             3 |              4 |               1 | yueerhu        |         125 | Mike Pennington |     26995 |
|  1 | 5515021 | 2011-04-01 14:50:44 |       8 |        7015 | Compute a compounded return series in Python                               |             3 |              6 |               7 | Jason Strimpel |        3301 | Mike Pennington |     26995 |
|  7 | 7813132 | 2011-10-18 20:16:12 |      10 |       18917 | Convert array of string (category) to array of int from a pandas dataframe |             3 |              0 |               6 | Jean-Pat       |         882 | Wes McKinney    |     43310 |
|  8 | 7837722 | 2011-10-20 14:46:14 |     201 |      223746 | What is the most efficient way to loop through dataframes with pandas?     |             8 |              3 |             115 | Muppet         |        1563 | Nick Crawford   |      2779 |
| 12 | 8842114 | 2012-01-12 20:52:41 |       8 |        4009 | How to apply slicing on pandas Series of strings                           |             3 |              2 |               1 | davidbrai      |         934 | Rob Wouters     |     10083 |

dd=df.iloc[s.values,[1,3,4]].head()
print(tabulate(dd,headers=df.columns,tablefmt='pipe'))

|   id | creationdate        |   score | viewcount                                                                  |
|-----:|:--------------------|--------:|:---------------------------------------------------------------------------|
|    0 | 2011-03-30 12:26:50 |    2113 | Rolling median in python                                                   |
|    1 | 2011-04-01 14:50:44 |    7015 | Compute a compounded return series in Python                               |
|    7 | 2011-10-18 20:16:12 |   18917 | Convert array of string (category) to array of int from a pandas dataframe |
|    8 | 2011-10-20 14:46:14 |  223746 | What is the most efficient way to loop through dataframes with pandas?     |
|   12 | 2012-01-12 20:52:41 |    4009 | How to apply slicing on pandas Series of strings                           |

pandas dataframe boolean条件数据筛选3

pandas dataframe boolean条件数据筛选3

猜你喜欢

热点阅读