Python 之 Pandas 常用操作
Python的Pandas是一个基于Python构建的开源数据分析库,它提供了强大的数据结构和运算功能。主要的数据结构包括 Series 和 DataFrame。
- Series:一维数组,类似于Numpy中的一维array,但具有索引标签,可以保存不同类型的数据,如字符串、布尔值、数字等。
- DataFrame:二维表格型数据结构,与SQL表或Excel工作表类似,每列可以是不同的数据类型(如数值、字符串或日期),并且具有列名和行索引。DataFrame是Pandas的核心数据结构,提供了丰富的数据操作方法。
Series
对象创建
系列创建主要有两种方式,通过字典创建或通过列表创建。
通过字典创建时,字典的 key 默认就是系列的 index。
通过列表创建时,索引则自动从 0 开始递增。
import pandas as pd
if __name__ == '__main__':
# 从字典创建
data = {'a': 0, 'b': 1, 'c': 2}
print(pd.Series(data))
# a 0
# b 1
# c 2
# dtype: int64
# 从列表创建
data = [18, 30, 25, 40]
print(pd.Series(data))
# 0 18
# 1 30
# 2 25
# 3 40
# dtype: int64
添加索引
当然,我们可以给系列重新设置索引(但是记得索引的长度和数据长度要保持一致)。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
print(user_age)
# Tom 18
# Bob 30
# Mary 25
# James 40
# dtype: int64
新增数据
可以通过 _append 方法追加数据(我记得老版本时直接可以用 append),而且追加的数据不是自修改操作,必须重新赋值后才会生效。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
user_age = user_age._append(pd.Series({"Looking": 100}))
print(user_age)
# Tom 18
# Bob 30
# Mary 25
# James 40
# Looking 100
# dtype: int64
或者更简单的方法,通过对新的 index 进行赋值的方式,达到新增数据的目的。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
user_age.at["Looking"] = 99
print(user_age)
# Tom 18
# Bob 30
# Mary 25
# James 40
# Looking 99
# dtype: int64
修改数据
可以直接通过索引进行数据修改。也可以使用 at 或者 loc 定位 index 来修改。甚至还可以通过 iloc的原始 index 来进行修改(原始索引就是默认的 0、1、2、... )。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
user_age["Tom"] = 100
user_age.at["Bob"] = 99
user_age.loc["Mary"] = 98
user_age.iloc[3] = 97 # 通过原始 index 修改
print(user_age)
# Tom 100
# Bob 99
# Mary 98
# James 97
# dtype: int64
删除数据
删除使用 drop 方法进行删除数据。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
user_age = user_age.drop("Tom")
print(user_age)
# Bob 30
# Mary 25
# James 40
# dtype: int64
数据排序
可以按照索引 index 或者 value 进行升序或者降序进行数据排列。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
# user_age = user_age.sort_index(ascending=True)
user_age = user_age.sort_values(ascending=False)
print(user_age)
# James 40
# Bob 30
# Mary 25
# Tom 18
# dtype: int64
数据查询
基本上可以当成列表那样去进行操作,下面只列举了一部分。输出的 dtype 表示对应 value 的数据类型,如果数据类型不一致,则输出 object。
import pandas as pd
if __name__ == '__main__':
# 从列表创建
data = [18, 30, 25, 40]
user_age = pd.Series(data)
user_age.index = ["Tom", "Bob", "Mary", "James"]
print(user_age[:2])
# Tom 18
# Bob 30
# dtype: int64
print(user_age[2:])
# Mary 25
# James 40
# dtype: int64
print(list(user_age.keys())) # 对应的索引列表
# ['Tom', 'Bob', 'Mary', 'James']
print(user_age.values)
# [18 30 25 40]
数据操作
加减乘除
系列的加减乘除也是按照相同的 index 来计算的。如果其中一个系列缺少对应的 key,则最终计算的结果为 NaN。当然,我们可以在计算时指定 fill_value,当计算时出现 key 缺失的情况,则使用指定的 fill_value 作为缺省的默认值参与计算。
import pandas as pd
if __name__ == '__main__':
data = [18, 30, 25, 40]
s1 = pd.Series(data)
s1.index = ["Tom", "Bob", "Mary", "James"]
s2 = pd.Series([1, 2, 3, 4], index=["Tom", "Bob", "Mary", "Looking"])
print(s1.add(s2))
# Bob 32.0
# James NaN
# Looking NaN
# Mary 28.0
# Tom 19.0
# dtype: float64
print(s1.sub(s2, fill_value=0))
# Bob 28.0
# James 40.0
# Looking -4.0
# Mary 22.0
# Tom 17.0
# dtype: float64
print(s1.mul(s2))
# Bob 60.0
# James NaN
# Looking NaN
# Mary 75.0
# Tom 18.0
# dtype: float64
print(s1.div(s2))
# Bob 15.000000
# James NaN
# Looking NaN
# Mary 8.333333
# Tom 18.000000
# dtype: float64
数值统计
import pandas as pd
if __name__ == '__main__':
data = [18, 30, 25, 40]
s1 = pd.Series(data)
s1.index = ["Tom", "Bob", "Mary", "James"]
print(s1.describe())
# count 4.000000
# mean 28.250000
# std 9.251126
# min 18.000000
# 25% 23.250000
# 50% 27.500000
# 75% 32.500000
# max 40.000000
# dtype: float64
apply
对系列的每个 value 操作生成新的系列,输入则是一个函数(功能有点类似 Python 内置的 map 方法)。
import pandas as pd
def apply_function(age):
return age + 100
if __name__ == '__main__':
data = [18, 30, 25, 40]
s1 = pd.Series(data)
s1.index = ["Tom", "Bob", "Mary", "James"]
s2 = s1.apply(apply_function)
print(s2)
# Tom 118
# Bob 130
# Mary 125
# James 140
# dtype: int64
DataFrame
对象创建
DataFrame 数据帧默认索引也是 0、1、2、...,创建时可以重新指定新的 index。如果是用 Series 拼接生成 DataFrame,记得 Series 的 index 要和 DataFrame 的 index 保持一致(index 的顺序就算不一致也不会影响,会自动根据 index 将相同 key 的数据对应起来)。
import pandas as pd
if __name__ == '__main__':
# 从字典创建
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": pd.Series([18, 30, 25, 40], index=["Bob", "Tom", "Mary", "James"]),
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info)
# age city
# Tom 18 BeiJing
# Bob 30 ShangHai
# Mary 25 GuangZhou
# James 40 ShenZhen
数据查询
import pandas as pd
if __name__ == '__main__':
# 从字典创建
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info.values)
# [[18 'BeiJing']
# [30 'ShangHai']
# [25 'GuangZhou']
# [40 'ShenZhen']]
print(user_info.index)
# Index(['Tom', 'Bob', 'Mary', 'James'], dtype='object')
print(user_info.columns)
# Index(['age', 'city'], dtype='object')
行列互换
可以理解为数学中矩阵的转置
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info.T)
# Tom Bob Mary James
# age 18 30 25 40
# city BeiJing ShangHai GuangZhou ShenZhen
数据提取
列提取
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info[["city", "age"]])
# city age
# Tom BeiJing 18
# Bob ShangHai 30
# Mary GuangZhou 25
# James ShenZhen 40
print(user_info["city"])
# Tom BeiJing
# Bob ShangHai
# Mary GuangZhou
# James ShenZhen
# Name: city, dtype: object
行提取
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info.loc["Tom"])
# age 18
# city BeiJing
# Name: Tom, dtype: object
print(user_info.iloc[3])
# age 40
# city ShenZhen
# Name: James, dtype: object
print(user_info.iloc[1:3])
# age city
# Bob 30 ShangHai
# Mary 25 GuangZhou
行列切片
通过 : 来指定对应行列的范围,与列表的切片操作类似,只不过针对二维数组,需要对行和列都需要进行限制。
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info.loc["Tom":"Mary", "age":"city"])
# age city
# Tom 18 BeiJing
# Bob 30 ShangHai
# Mary 25 GuangZhou
数据判断和筛选
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
print(user_info["age"] > 20) # 相当于生成一个筛选器
# Tom False
# Bob True
# Mary True
# James True
# Name: age, dtype: bool
print(user_info[user_info["age"] > 20])
# age city
# Bob 30 ShangHai
# Mary 25 GuangZhou
# James 40 ShenZhen
删除数据
行删除
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
user_info = user_info.drop(["Tom"], axis=0)
print(user_info)
# age city
# Bob 30 ShangHai
# Mary 25 GuangZhou
# James 40 ShenZhen
列删除
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
user_info = user_info.drop(["age"], axis=1)
print(user_info)
# city
# Tom BeiJing
# Bob ShangHai
# Mary GuangZhou
# James ShenZhen
数据操作
apply
和系列的 apply 比较类似,可以对某一列的数据进行操作。比如进行数据格式化,或者进行简单的数据转换生成新的数据
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
user_info = pd.DataFrame(data=data, index=index)
user_info["new_age"] = user_info["age"].apply(lambda item: item + 100)
print(user_info)
# age city new_age
# Tom 18 BeiJing 118
# Bob 30 ShangHai 130
# Mary 25 GuangZhou 125
# James 40 ShenZhen 140
concat
类似数据库的 union 操作。
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
s1 = pd.DataFrame(data=data, index=index)
index = pd.Index(data=["Looking", "Sandra"])
data = {
"age": [10, 20],
"city": ["ChongQing", "XiAn"]
}
s2 = pd.DataFrame(data=data, index=index)
s = pd.concat([s1, s2])
print(s)
# age city
# Tom 18 BeiJing
# Bob 30 ShangHai
# Mary 25 GuangZhou
# James 40 ShenZhen
# Looking 10 ChongQing
# Sandra 20 XiAn
merge
类似数据库的 join 操作。
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
s1 = pd.DataFrame(data=data, index=index)
index = pd.Index(data=["Looking", "Sandra"])
data = {
"age": [10, 20],
"city": ["ShangHai", "GuangZhou"]
}
s2 = pd.DataFrame(data=data, index=index)
s = pd.merge(s1, s2, on="city")
print(s)
# age_x city age_y
# 0 30 ShangHai 10
# 1 25 GuangZhou 20
生成 json
import json
import pandas as pd
if __name__ == '__main__':
index = pd.Index(data=["Tom", "Bob", "Mary", "James"])
data = {
"age": [18, 30, 25, 40],
"city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
}
s1 = pd.DataFrame(data=data, index=index)
print(s1.to_json(orient="records", indent=2)) # 每行转换成一个字段,组成数组
# [
# {
# "age":18,
# "city":"BeiJing"
# },
# {
# "age":30,
# "city":"ShangHai"
# },
# {
# "age":25,
# "city":"GuangZhou"
# },
# {
# "age":40,
# "city":"ShenZhen"
# }
# ]
print(s1.to_json(orient="split", indent=2)) # 将索引,列和数据分别返回
# {
# "columns":[
# "age",
# "city"
# ],
# "index":[
# "Tom",
# "Bob",
# "Mary",
# "James"
# ],
# "data":[
# [
# 18,
# "BeiJing"
# ],
# [
# 30,
# "ShangHai"
# ],
# [
# 25,
# "GuangZhou"
# ],
# [
# 40,
# "ShenZhen"
# ]
# ]
# }
print(s1.to_json(orient="values", indent=2)) # 只返回值,以二维数组形式返回
# [
# [
# 18,
# "BeiJing"
# ],
# [
# 30,
# "ShangHai"
# ],
# [
# 25,
# "GuangZhou"
# ],
# [
# 40,
# "ShenZhen"
# ]
# ]
print(s1.to_json(orient="index", indent=2)) # index 为一级 key,column 为二级 key
# {
# "Tom":{
# "age":18,
# "city":"BeiJing"
# },
# "Bob":{
# "age":30,
# "city":"ShangHai"
# },
# "Mary":{
# "age":25,
# "city":"GuangZhou"
# },
# "James":{
# "age":40,
# "city":"ShenZhen"
# }
# }
print(s1.to_json(orient="columns", indent=2)) # column 为一级 key,index 为二级 key
# {
# "age":{
# "Tom":18,
# "Bob":30,
# "Mary":25,
# "James":40
# },
# "city":{
# "Tom":"BeiJing",
# "Bob":"ShangHai",
# "Mary":"GuangZhou",
# "James":"ShenZhen"
# }
# }