Documentation | Slack | Stack Overflow | Latest changelog
Generates profile reports from a pandas DataFrame
.
The pandas df.describe()
function is great but a little basic for serious exploratory data analysis.pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
Spark backend in progress: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports.Beta testers wanted! The Spark backend will be released as a pre-release for this package.
Monitoring time series?: I'd like to draw your attention to popmon. Whereas pandas-profiling allows you to explore patterns in a single dataset, popmon allows you to uncover temporal patterns. It's worth checking out!
Support pandas-profiling
The development of pandas-profiling
relies completely on contributions.If you find value in the package, we welcome you to support the project directly through GitHub Sponsors!Please help me to continue to support this package.Find more information: Sponsor the project on GitHub
Contents: Examples |Installation | Documentation |Large datasets | Command line usage |Advanced usage | Support | Go beyond |Support the project | Types | How to contribute |Editor Integration | Dependencies
The following examples can give you an impression of what the package can do:
Specific features:
Tutorials:
You can install using the pip package manager by running
pip install pandas-profiling[notebook]
Alternatively, you could install the latest version directly from Github:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
Download the source code by cloning the repository or by pressing 'Download ZIP' on this page.
Install by navigating to the proper directory and running:
python setup.py install
The documentation for pandas_profiling
can be found here. Previous documentation is still available here.
Start by loading in your pandas DataFrame, e.g. by using:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the report, run:
profile = ProfileReport(df, title="Pandas Profiling Report")
You can configure the profile report in any way you like. The example code below loads the explorative configuration, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
Learn more about configuring pandas-profiling
on the Advanced usage page.
We recommend generating reports interactively by using the Jupyter notebook.There are two interfaces (see animations below): through widgets and through a HTML report.
This is achieved by simply displaying the report. In the Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be included in a Jupyter notebook:
Run the following code:
profile.to_notebook_iframe()
If you want to generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, you can obtain the data as JSON:
# As a string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
Version 2.4 introduces minimal mode.
This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).
Use the following syntax:
profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")
Benchmarks are available here.
For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling
executable.
Run the following for information about options and arguments.
pandas_profiling -h
A set of options is available in order to adapt the report generated.
title
(str
): Title for the report ('Pandas Profiling Report' by default).pool_size
(int
): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).progress_bar
(bool
): If True, pandas-profiling
will display a progress bar.infer_dtypes
(bool
): When True
(default) the dtype
of variables are inferred using visions
using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).More settings can be found in the default configuration file and minimal configuration file.
You find the configuration docs on the advanced usage page here
Example
profile = df.profile_report(
title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")
Need help? Want to share a perspective? Want to report a bug? Ideas for collaboration?You can reach out via the following channels:
For many real-world problems we are interested how the data changes over time.The excellent pacakge To learn more on Popmon, have a look at these resources here |
Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics.For that purpose, You can find more details on the Great Expectations integration here |
Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible without support of our gracious sponsors.
Lambda workstations, servers, laptops, and cloud services power engineers and researchers at Fortune 500 companies and 94% of the top 50 universities. Lambda Cloud offers 4 & 8 GPU instances starting at $1.50 / hr. Pre-installed with TensorFlow, PyTorch, Ubuntu, CUDA, and cuDNN. |
We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:
Martin Sotir, Stephanie Rivera, abdulAziz
More info if you would like to appear here: Github Sponsor page
Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).pandas-profiling
currently, recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.
We have developed a type system for Python, tailored for data analysis: visions.Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code.To learn more about pandas-profiling
's type system, check out the default implementation here.In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR!
Read on getting involved in the Contribution Guide.
A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.
pandas-profiling
via the instructions abovepandas-profiling
executable.
$ which pandas_profiling
(example) /usr/local/bin/pandas_profiling
$ where pandas_profiling
(example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
"$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
$ProjectFileDir$
To use the PyCharm Integration, right click on any dataset file:
External Tools > Pandas Profiling.
Other editor integrations may be contributed via pull requests.
The profile report is written in HTML and CSS, which means pandas-profiling
requires a modern browser.
You need Python 3 to run this package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for Widgets etc. |
标签:2021.09.28工作内容 背景:做EDA分析,想要利用pandas-profling的集成工具,实现一键EDA自动化流程。而pandas-profiling是python封装好的库,能够使用DataFrame自动生成数据的详细报告并能自动生成网页进行可视化。但理想和现实总是有差距,这个过程出现了很多error,主要原因都是由pandas-profiling的版本与环境不兼容导致的,谨以此
安装与调用 pip install pandas-profiling import pandas_profiling jupyter中查看 pandas_profiling.ProfileReport(df) 保存文件查看 pfr = pandas_profiling.ProfileReport(df) pfr.to_file("./example.html") 参考: https://blog.
对于探索性数据分析来说,做数据分析前需要先看一下数据的总体概况,pandas_profiling工具可以快速预览数据。 安装 pip install pandas-profiling 使用 import pandas as pd import pandas_profiling data = pd.read_csv('books.csv') pandas_profiling.ProfileRep
Python Data Analysis Library 或 pandas 是连接 SciPy 和 NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。Comma-separated values (CSV) 文件表示在有关各方之间分发数据的最常见的方法之一。Pandas 提供了一种优化库功能来读写多种
Pandas是一个开源Python库,用于使用其强大的数据结构进行高性能数据处理和数据分析。 Python与pandas一起用于各种学术和商业领域,包括金融,经济,统计,广告,网络分析等。 使用Pandas,我们可以完成数据处理和分析中的五个典型步骤,无论数据来源如何 - 加载,组织,操作,建模和分析数据。 以下是Pandas的一些重要功能,专门用于数据处理和数据分析工作。 熊猫的主要特点 具有默
Sklearn-pandas既可以视为一个通用型的机器学习工具包,也可是视为一些特定算法的实现。它在具体的机器学习任务中主要充当支持者的角色。 这里所谓支持者的角色,按照其官网的解释即是说:Sklearn-pandas在Scikit-Learn和pandas之间提供了一个互通的桥梁(这一点从项目的名称也能看出)。Scikit-Learn上文已经提过,这里pandas是指一个开源的基于Python实
Pandas AI 是一个 Python 库,它为流行的 Python 数据分析和操作工具 Pandas 库添加了生成人工智能功能,旨在与 Pandas 结合使用,而不是它的替代品。 Demo 在浏览器中试用 PandasAI: 安装 pip install pandasai 用法 PandasAI 旨在与 Pandas 结合使用。它使 Pandas 具有对话性,允许以 Pandas DataFr
本专题主要介绍 Pandas0.25+ 库的内容: 1. Pandas 概览 1.1. 数据结构 1.2. 大小可变与数据复制 1.3. 获得支持 1.4. 社区 1.5. 项目监管 1.6. 开发团队 1.7. 机构合作伙伴 1.8. 许可协议 2. Pandas 基础用法 2.1. Head 与 Tail 2.2. 属性与底层数据 2.3. 加速操作 2.4. 二进制操作 2.5. 描述性统计
问题内容: 所以我的数据框看起来像这样: 每个站点的分数因国家/地区而异。我正在尝试查找每个站点/国家/地区组合得分的1/3/5天差异。 输出应为: 我首先尝试按网站/国家/日期排序,然后按网站和国家/地区分组,但是我无法从分组对象中获得区别。 问题答案: 首先,对DataFrame排序,然后您需要做的是: 不支持任意排序。如果您需要进行任意排序(例如Google在fb之前),则需要将它们存储在集