Integrations
Pandas
pandas-profiling
is built on pandas
and numpy
.
Pandas supports a wide range of data formats including CSV, XLSX, SQL, JSON, HDF5, SAS, BigQuery and Stata.
Read more on supported formats by Pandas.
Other frameworks
If you have data in another Python framework, you can use pandas-profiling by converting to a pandas DataFrame. For large datasets you might need to sample. Direct integrations are not yet supported.
# Convert spark RDD to a pandas DataFrame
df = spark_df.toPandas()
# Convert dask DataFrame to a pandas DataFrame
df = df.compute()
# Convert vaex DataFrame to a pandas DataFrame
df = df.to_pandas_df()
# Convert modin DataFrame to pandas DataFrame
df = df._to_pandas()
# Note that:
# "This is not part of the API as pandas.DataFrame, naturally, does not posses such a method.
# You can use the private method DataFrame._to_pandas() to do this conversion.
# If you would like to do this through the official API you can always save the Modin DataFrame to
# storage (csv, hdf, sql, ect) and then read it back using Pandas. This will probably be the safer
# way when working big DataFrames, to avoid out of memory issues."
# Source: https://github.com/modin-project/modin/issues/896
User interfaces
This section lists the various ways the user can interact with the profiling results.
HTML Report

Jupyter Lab/Notebook

Command line
Command line usage For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable. Run
pandas_profiling -h
for information about options and arguments.

Streamlit
Streamlit is an open-source Python library made to build web-apps for machine learning and data science.

import pandas as pd
import pandas_profiling
import streamlit as st
from streamlit_pandas_profiling import st_profile_report
df = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
pr = df.profile_report()
st.title("Pandas Profiling in Streamlit")
st.write(df)
st_profile_report(pr)
You can install this Pandas Profiling component for Streamlit with pip:
pip install streamlit-pandas-profiling
Panel
For more information on how to use pandas-profiling
in Panel, see https://github.com/pandas-profiling/pandas-profiling/issues/491 and the Pandas Profiling example at https://awesome-panel.org.
Cloud Integrations
Lambda GPU Cloud

pandas-profiling
will be pre-installed on one of the Lambda GPU Cloud images. Pandas Profiling itself does not provide GPU acceleration, but does support a workflow in which GPU acceleration is possible, e.g. this is a great setup for profiling your image datasets while developing computer vision applications. Learn how to launch a 4x GPU instance here.
Google Cloud
The Google Cloud Platform documentation features an article that uses pandas-profiling
.
Read it here: Building a propensity model for financial services on Google Cloud.
Kaggle
pandas-profiling
is available in Kaggle notebooks by default, as it is included in the standard Kaggle image.
Pipeline Integrations
With Python, command-line and Jupyter interfaces, pandas-profiling
integrates seamlessly with DAG execution tools like Airflow, Dagster, Kedro and Prefect.
Integration with Dagster or Prefect can be achieved in a similar way as with Airflow.
Airflow
Integration with Airflow can be easily achieved through the BashOperator or the PythonOperator.
# Using the command line interface
profiling_task = BashOperator(
task_id="Profile Data",
bash_command="pandas_profiling dataset.csv report.html",
dag=dag,
)
# Using the Python inferface
import pandas_profiling
def profile_data(file_name, report_file):
df = pd.read_csv(file_name)
report = pandas_profiling.ProfileReport(df, title="Profiling Report in Airflow")
report.to_file(report_file)
return "Report generated at {}".format(report_file)
profiling_task2 = PythonOperator(
task_id="Profile Data",
op_kwargs={"file_name": "dataset.csv", "report_file": "report.html"},
python_callable=profile_data,
dag=dag,
)
Kedro
There is a community created Kedro plugin available.
Editor Integrations
PyCharm
Install
pandas-profiling
via the instructions aboveLocate your
pandas-profiling
executable.
On macOS / Linux / BSD:
$ which pandas_profiling (example) /usr/local/bin/pandas_profilingOn Windows:
$ where pandas_profiling (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
In Pycharm, go to Settings (or Preferences on macOS) > Tools > External tools
Click the + icon to add a new external tool
Insert the following values
Name:
Pandas Profiling
Program: The location obtained in step 2
Arguments:
"$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
Working Directory:
$ProjectFileDir$

To use the PyCharm Integration, right click on any dataset file: External Tools > Pandas Profiling.