Pandas:如何使用 df.to_dict() 轻松共享示例数据框?

时间：2023-09-29

本文介绍了Pandas:如何使用 df.to_dict() 轻松共享示例数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

尽管在我如何提出一个好问题? 和如何创建一个最小的、可重现的示例，许多人似乎只是忽略了在他们的问题中包含一个可重现的数据样本.那么当简单的 pd.DataFrame(np.random.random(size=(5, 5))) 不够用时，有什么实用且简单的方法来重现数据样本呢?例如，如何使用 df.to_dict() 并将输出包含在问题中?

Despite the clear guidance on How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough? How can you, for example, use df.to_dict() and include the output in a question?

答案:

在许多情况下，使用带有 df.to_dict() 的方法可以完美地完成工作！以下是我想到的两种情况:

The answer:

In many situations, using an approach with df.to_dict() will do the job perfectly! Here are two cases that come to mind:

案例 1: 您已经从本地来源用 Python 构建或加载了一个数据框

案例 2: 您在另一个应用程序(如 Excel)中有一个表格

案例 1: 您从本地源构建或加载了一个数据框

假设您有一个名为 df 的 pandas 数据框，只需

Given that you've got a pandas dataframe named df, just

在控制台或编辑器中运行 df.to_dict()，然后
复制格式化为字典的输出，然后
将内容粘贴到 pd.DataFrame(<output>) 并将该块包含在您现在可重现的代码片段中.

run df.to_dict() in you console or editor, and
copy the output that is formatted as a dictionary, and
paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.

案例 2: 您在另一个应用程序(如 Excel)中有一个表格

根据来源和分隔符，如 (',', ';' '\s+') 后者表示任何空格，您可以简单地:

Depending on the source and separator like (',', ';' '\s+') where the latter means any spaces, you can simply:

Ctrl+C内容
在您的控制台或编辑器中运行 df=pd.read_clipboard(sep='\s+')，然后
运行df.to_dict()，然后
在 df=pd.DataFrame(<output>)

Ctrl+C the contents
run df=pd.read_clipboard(sep='\s+') in your console or editor, and
run df.to_dict(), and
include the output in df=pd.DataFrame(<output>)

在这种情况下，您的问题的开头将如下所示:

In this case, the start of your question would look something like this:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

当然，对于较大的数据帧，这会有点笨拙.但很多时候，所有试图回答您问题的人都需要您的真实世界数据的一小部分样本，以便将您的数据结构考虑在内.

Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.

运行 df.head(20).to_dict() 以仅包含前 20 行，并且
使用例如 df.to_dict('split')(有其他选项除了 'split') 将输出重塑为需要更少行的 dict.

run df.head(20).to_dict() to only include the first 20 rows, and
change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split') to reshape your output to a dict that requires fewer lines.

这是一个使用 iris 数据集的示例，以及其他可用位置来自情节快递.

Here's an example using the iris dataset, among other places available from plotly express.

如果你只是运行:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

这将产生近 1000 行的输出，并且作为可重现的样本不太实用.但是如果你包含 .head(25)，你会得到:

This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample. But if you include .head(25), you'll get:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

现在我们正在取得进展.但是根据数据的结构和内容，这可能无法以令人满意的方式涵盖内容的复杂性.但是您可以通过包含 to_dict('split') 像这样:


And now we're getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split') like this:
import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

现在您的输出将如下所示:
Now your output will look like:
{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

现在您可以轻松地增加 .head(10) 中的数字，而不会过多地混淆您的问题.但有一个小缺点.现在您不能再直接在 pd.DataFrame 中使用输入.但是，如果您包含一些关于 index、column 和 data 的规范，那么您就可以了.所以对于这个特定的数据集，我首选的方法是:
And now you can easily increase the number in .head(10) without cluttering your question too much. But there's one minor drawback. Now you can no longer use the input directly in pd.DataFrame. But if you include a few specifications with regards to index, column, and data you'll be just fine. So for this particluar dataset, my preferred approach would be:
import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

现在你可以使用这个数据框了:
Now you'll have this dataframe to work with:
    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

这将大大增加您获得有用答案的机会！
Which will increase your chances of receiving useful answers significantly!
df_to_dict() 将无法读取像 1: Timestamp('2020-01-02 00:00:00') 这样的时间戳，而不包括 >从熊猫导入时间戳
df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp

                        这篇关于Pandas:如何使用 df.to_dict() 轻松共享示例数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！



上一篇：PYODBC 到 Pandas - DataFrame 不起作用 - 传递值的形状是(x，y)，索引暗示(w，z) 
下一篇：Plotly:如何使用 plotly.graph_objects 和 plotly.express 定义图形中的颜色? 

 
相关文章

     
    
随机选择子目录中的 x 个文件
将 CSV 文件转换为 TF 记录
在“from_delayed"JSON 文件中发现 DASK 元数据不匹配
在 TensorFlow 2.0 中，如何查看数据集中的元素数量?
从分段时间序列创建 Scikit-learn 标记数据集
分块、处理和在 Pandas/Python 中合并数据集
TypeError:“numpy.int64"类型的对象没有 len()
在 pandas 折线图中绘制多列
具有渴望模式的 TF.data.dataset.map(map_func)
pandas python中没有名为read_csv的属性

Pandas:如何使用 df.to_dict() 轻松共享示例数据框?

问题描述

推荐答案

答案:

The answer:

相关文章