9-pandas导入导出数据
9-pandas导入导出数据
eddy_linux 发表于3个月前
9-pandas导入导出数据
  • 发表于 3个月前
  • 阅读 9
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云 新注册用户 域名抢购1元起>>>   

#encoding:utf8



'''
数据导入导出处理:
    索引和列名
    缺失值处理
    逐块读取数据
    保存数据到磁盘
    二进制数据
    其他格式:HDF5,Excel,JSON,SQL,NoSQL
'''

import numpy as np
import pandas as pd

#读取文件
'''
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
没有行索引
'''
print(pd.read_csv('data/ex1.csv'))
'''
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
'''
#读取数据,分隔符逗号,支持正则
print(pd.read_table('data/ex1.csv',sep=','))
'''
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
'''

#读取数据没有行,列名称
'''
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
'''
print(pd.read_csv('data/ex2.csv'))
#说明没有列名称,并且指定列名称
print(pd.read_csv('data/ex2.csv',header=None,names=['a','b','c','d','msg']))

'''
   1   2   3   4  hello
0  5   6   7   8  world
1  9  10  11  12    foo
默认把第一行作为列名称

   a   b   c   d    msg
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo
'''
#指定某列名为行索引
print(pd.read_csv('data/ex2.csv',header=None,names=['a','b','c','d','msg'],index_col='msg'))
'''
       a   b   c   d
msg
hello  1   2   3   4
world  5   6   7   8
foo    9  10  11  12
'''

#分隔符不规则的
'''
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491
'''
print(pd.read_csv('data/ex3.csv'))
'''
               A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491
实际上它把后面的整体都作为数据了
'''
print(pd.read_table('data/ex3.csv',sep='\s+'))
'''
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491
如果行列不匹配会自动的把相应第一列作为索引
'''

#缺失数据
'''
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
'''
print(pd.read_csv('data/ex5.csv'))
print(pd.read_csv('data/ex5.csv',na_values=['NA','NULL','foo']))
'''
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo
会把Na和空值赋值为Nan进行表示
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN
指定了缺失值
'''
#给不同的列取不同的缺失值
print(pd.read_csv('data/ex5.csv',na_values={'message':['NA','NULL','foo'],'something':['one','two']}))
'''
  something  a   b     c   d message
0       NaN  1   2   3.0   4     NaN
1       NaN  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN
'''


#处理比较多的数据方式
'''
one,two,three,four,key
0.467976300189,-0.0386485396255,-0.295344251987,-1.82472622729,L
-0.358893469543,1.40445260007,0.704964644926,-0.200638304015,B
10000行左右
'''
#统计key列出现的次数并且输出前10的
#读取10行
print(pd.read_csv('data/ex6.csv',nrows=10))
'''
        one       two     three      four key
0  0.467976 -0.038649 -0.295344 -1.824726   L
1 -0.358893  1.404453  0.704965 -0.200638   B
2 -0.501840  0.659254 -0.421691 -0.057688   G
3  0.204886  1.074134  1.388361 -0.982404   R
4  0.354628 -0.133116  0.283763 -0.837063   Q
5  1.817480  0.742273  0.419395 -2.251035   Q
6 -0.776764  0.935518 -0.332872 -1.875641   U
7 -0.913135  1.530624 -0.572657  0.477252   K
8  0.358480 -0.497572 -0.367016  0.507702   S
9 -1.740877 -1.160417 -1.637830  2.172201   G
'''
#分块
#每次读取1000行
tr = pd.read_csv('data/ex6.csv',chunksize=1000)
print(type(tr))
'''
<class 'pandas.io.parsers.TextFileReader'>
'''
#可以同for循环来进行
result = pd.Series()
for chunk in tr:
    result = result.add(chunk['key'].value_counts(),fill_value=0)
print(result)
'''
0    151.0
1    146.0
2    152.0
3    162.0
4    171.0
5    157.0
6    166.0
7    164.0
8    162.0
9    150.0
A    320.0
B    302.0
C    286.0
D    320.0
E    368.0
F    335.0
G    308.0
H    330.0
I    327.0
J    337.0
K    334.0
L    346.0
M    338.0
N    306.0
O    343.0
P    324.0
Q    340.0
R    318.0
S    308.0
T    304.0
U    326.0
V    328.0
W    305.0
X    364.0
Y    314.0
Z    288.0
'''
print(result.sort_values(ascending=False)[:10])
'''
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
这样就得到前10的数据
'''

#保存

df = pd.read_csv('data/ex5.csv')
print(df)
'''
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo
'''
df.to_csv('data/ex5_out.csv')
'''
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
可以看到它把原来的索引也当成数据写入到文件中了
'''
df.to_csv('data/ex5_out.csv',index=False)
'''
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
'''
#取消列
df.to_csv('data/ex5_out.csv',index=False,header=None)

#保存那几列
df.to_csv('data/ex5_out.csv',index=False,header=None,columns=['b','c','message'])




 

共有 人打赏支持
粉丝 19
博文 132
码字总数 185568
×
eddy_linux
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: