当前位置：首页 > news >正文

【Pandas】pandas DataFrame sample

news 2025/8/30 12:06:22

Pandas2.2 DataFrame

Reindexing selection label manipulation

方法	描述
DataFrame.add_prefix(prefix[, axis])	用于在 DataFrame 的行标签或列标签前添加指定前缀的方法
DataFrame.add_suffix(suffix[, axis])	用于在 DataFrame 的行标签或列标签后添加指定后缀的方法
DataFrame.align(other[, join, axis, level, …])	用于对齐两个 `DataFrame` 或 `Series` 的方法
DataFrame.at_time(time[, asof, axis])	用于筛选特定时间点的行的方法
DataFrame.between_time(start_time, end_time)	用于筛选指定时间范围内的数据行的方法
DataFrame.drop([labels, axis, index, …])	用于从 `DataFrame` 中删除指定行或列的方法
DataFrame.drop_duplicates([subset, keep, …])	用于删除重复行的方法
DataFrame.duplicated([subset, keep])	用于检测重复行的方法
DataFrame.equals(other)	用于比较两个 `DataFrame` 是否完全相等的方法
DataFrame.filter([items, like, regex, axis])	用于筛选列或行标签的方法
DataFrame.first(offset)	用于选取时间序列型 DataFrame 中从起始时间开始的一段连续时间窗口的方法
DataFrame.head([n])	用于快速查看 `DataFrame` 前几行数据的方法
DataFrame.idxmax([axis, skipna, numeric_only])	用于查找每列或每行中最大值的索引标签的方法
DataFrame.idxmin([axis, skipna, numeric_only])	用于查找每列或每行中最小值的索引标签的方法
DataFrame.last(offset)	用于选取时间序列型 DataFrame 中从最后时间点开始向前截取一段指定长度的时间窗口的方法
DataFrame.reindex([labels, index, columns, …])	用于重新索引 DataFrame 的核心方法
DataFrame.reindex_like(other[, method, …])	用于将当前 DataFrame 的索引和列重新设置为与另一个对象（如另一个 DataFrame 或 Series）相同
DataFrame.rename([mapper, index, columns, …])	用于重命名 DataFrame 的行索引标签或列名的方法
DataFrame.rename_axis([mapper, index, …])	用于重命名 DataFrame 的索引轴名称（index axis name）或列轴名称（column axis name）的方法
DataFrame.reset_index([level, drop, …])	用于将 DataFrame 的索引（index）重置为默认整数索引，并将原索引作为列添加回 DataFrame 中的方法
DataFrame.sample([n, frac, replace, …])	用于从 DataFrame 中随机抽取样本行或列的方法

`pandas.DataFrame.sample()`

pandas.DataFrame.sample() 是一个用于从 DataFrame 中随机抽取样本行或列的方法。它支持按指定数量（n）或比例（frac）抽样，支持有放回或无放回抽样，并可用于数据分析、数据清洗、模型训练前的数据划分等场景。

📌 方法签名

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

🔧 参数说明

参数	类型	说明
`n`	整数	要抽取的样本数量（不能与 `frac` 同时使用）
`frac`	浮点数	抽取样本占总体的比例（如 `0.5` 表示抽取 50% 的数据）
`replace`	`bool`，默认 `False`	是否有放回抽样（`True` 表示允许重复抽取）
`weights`	str 或 array-like	权重数组或列名，表示每行/列被抽取的概率权重
`random_state`	int 或 `numpy.random.RandomState` 实例	控制随机性，确保结果可复现
`axis`	`{0/'index', 1/'columns'}`，默认 `0`	指定是按行抽样还是按列抽样
`ignore_index`	`bool`，默认 `False`	是否重置索引（抽样后的 DataFrame 使用从 0 开始的新索引）

⚠️ n 和 frac 不能同时使用。

✅ 返回值

返回一个新的 DataFrame，包含随机抽取的样本；
若 inplace=True 不可用，必须赋值给新变量；
默认保留原始索引，除非设置 ignore_index=True。

🧪 示例代码及结果

示例 1：基本用法 - 随机抽取 2 行

import pandas as pddf = pd.DataFrame({'A': [1, 2, 3, 4],'B': [10, 20, 30, 40]
}, index=['x', 'y', 'z', 'w'])print("Original DataFrame:")
print(df)# 随机抽取 2 行
sampled = df.sample(n=2, random_state=42)
print("\nRandomly sampled 2 rows:")
print(sampled)

输出结果：

Original DataFrame:A   B
x  1  10
y  2  20
z  3  30
w  4  40Randomly sampled 2 rows:A   B
z  3  30
x  1  10

设置 random_state=42 可保证每次运行结果一致。

示例 2：按比例抽样（frac=0.5）

# 抽取 50% 的行
sampled_frac = df.sample(frac=0.5, random_state=42)
print("\nSampled 50% of the rows:")
print(sampled_frac)

输出结果：

Sampled 50% of the rows:A   B
z  3  30
x  1  10

示例 3：有放回抽样（replace=True）

# 从 4 行中抽取 5 行（必须允许重复）
sampled_replace = df.sample(n=5, replace=True, random_state=42)
print("\nSampled with replacement (n=5):")
print(sampled_replace)

输出结果：

Sampled with replacement (n=5):A   B
z  3  30
x  1  10
z  3  30
x  1  10
y  2  20

注意某些行出现多次。

示例 4：加权抽样（weights 参数）

# 给每一行指定不同的权重
sampled_weighted = df.sample(n=2, weights=[1, 1, 1, 10], random_state=42)
print("\nWeighted sampling (last row has higher weight):")
print(sampled_weighted)

输出结果：

Weighted sampling (last row has higher weight):A   B
w  4  40
w  4  40

因为最后一行权重最高，所以更容易被选中。

示例 5：按列抽样（axis=1）

# 随机抽取 1 列
sampled_col = df.sample(n=1, axis=1, random_state=42)
print("\nRandomly sampled 1 column:")
print(sampled_col)

输出结果：

Randomly sampled 1 column:B
x  10
y  20
z  30
w  40

示例 6：忽略原索引（ignore_index=True）

# 抽样并重置索引
sampled_ignore = df.sample(n=2, ignore_index=True, random_state=42)
print("\nSampled and reset index:")
print(sampled_ignore)

输出结果：

Sampled and reset index:A   B
0  3  30
1  1  10

🧠 应用场景

数据探索：快速查看部分数据；
模型训练前的数据划分：随机选取训练集/验证集；
数据增强：通过有放回抽样增加样本量；
测试脚本：模拟小规模数据进行调试；
统计分析：进行抽样调查或蒙特卡洛模拟。

⚠️ 注意事项

n 和 frac 不能同时使用；
若需要重复抽样，需设置 replace=True；
使用 random_state 确保结果可复现；
支持按行或列抽样（通过 axis）；
默认保留原始索引，可通过 ignore_index=True 重置；
加权抽样时注意权重和应大于 0，否则会报错。

查看全文

http://www.dtcms.com/a/230183.html

微软重磅发布Magentic UI，交互式AI Agent助手实测！

mybatis 参数绑定错误示范（1）

AWS DocumentDB vs MongoDB：数据库的技术抉择

PostgreSQL的扩展 pg_buffercache

第5篇《中间件负载均衡与连接池管理机制设计》

银行用户评分规则深度学习

Linux容器篇、第一章_01Linux系统下 Docker 安装与镜像部署全攻略

分布式爬虫代理IP使用技巧

C++性能优化指南

go的工具库：github.com/expr-lang/expr

突破数据孤岛：StarRocks联邦查询实战指南

【发布实录】云原生+AI，助力企业全球化业务创新

Java设计模式：责任链模式

Linux 特殊权限位详解：SetUID, SetGID, Sticky Bit

数据结构第一章

【RAG优化】rag整体优化建议

[ Qt ] | 与系统相关的操作(二)：键盘、定时器、窗口移动和大小

跟着deepseek浅学分布式事务（2） - 两阶段提交（2PC）

yum更换阿里云的镜像源

保险丝选型

树莓派系列教程第九弹：Cpolar内网穿透搭建NAS

云数据库选型指南：关系型 vs NoSQL vs NewSQL的企业决策

【开源工具】黑客帝国系列系统监控工具：基于PyQt5的全方位资源监控系统

【Linux】编译器gcc/g++及其库的详细介绍

【从GEO数据库批量下载数据】

Python训练打卡Day42

YOLOv1 到 YOLOv12汇总信息2025.6.4

Python绘图库及图像类型

[Linux] Linux GPIO应用编程深度解析与实践指南(代码示例)

Flutter如何支持原生View

Pandas2.2 DataFrame

Reindexing selection label manipulation

pandas.DataFrame.sample()

📌 方法签名

🔧 参数说明

✅ 返回值

🧪 示例代码及结果

示例 1：基本用法 - 随机抽取 2 行

输出结果：

示例 2：按比例抽样（frac=0.5）

输出结果：

示例 3：有放回抽样（replace=True）

输出结果：

示例 4：加权抽样（weights 参数）

输出结果：

示例 5：按列抽样（axis=1）

输出结果：

示例 6：忽略原索引（ignore_index=True）

输出结果：

🧠 应用场景

⚠️ 注意事项

相关文章：

`pandas.DataFrame.sample()`