Advanced Usage
A set of options is available in order to adapt the report generated.
General settings
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
string |
Pandas Profiling Report |
Title for the report, shown in the header and title bar. |
|
integer |
0 |
Number of workers in thread pool. When set to zero, it is set to the number of CPUs available. |
|
boolean |
True |
If True, pandas-profiling will display a progress bar. |
The configuration can be changed in the following ways:
# Change the config when creating the report
profile = df.profile_report(title="Pandas Profiling Report", pool_size=1)
# Change the config after
profile.config.html.minify_html = False
profile.to_file("output.html")
Variable summary settings
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
None, asc or desc |
None |
Sort the variables asc(ending), desc(ending) or None (leaves original sorting). |
|
dict |
{} |
Ability to display a description alongside the descriptive statistics of each variable ({‘var_name’: ‘Description’}). |
|
list[float] |
[0.05,0.25,0.5,0.75,0.95] |
The quantiles to calculate. Note that .25, .5 and .75 are required for other metrics median and IQR. |
|
integer |
20 |
Warn if the skewness is above this threshold. |
|
integer |
5 |
If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable. |
|
float |
0.999 |
Set to zero to disable chi squared calculation. |
|
boolean |
True |
Check the string length and aggregate values (min, max, mean, media). |
|
boolean |
False |
Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive. |
|
boolean |
False |
Check the distribution of words. Often informative, but may be computationally expensive. |
|
integer |
50 |
Warn if the number of distinct values is above this threshold. |
|
integer |
5 |
Display this number of observations. |
|
float |
0.999 |
Same as above. |
|
integer |
3 |
Same as above. |
profile = df.profile_report(
sort="ascending",
vars={
"num": {"low_categorical_threshold": 0},
"cat": {
"length": True,
"characters": False,
"words": False,
"n_obs": 5,
},
},
)
profile.config.variables.descriptions = {
"files": "Files in the filesystem",
"datec": "Creation date",
"datem": "Modification date",
}
profile.to_file("report.html")
Missing data overview plots
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Display a bar chart with counts of missing values for each column. |
|
boolean |
True |
Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows. |
|
boolean |
True |
Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another). |
|
boolean |
True |
Display a dendrogram. Provides insight in the co-occurrence of missing values (i.e. columns that are both filled or both none). |
profile = df.profile_report(
missing_diagrams={
"heatmap": False,
"dendrogram": False,
}
)
profile.to_file("report.html")
The missing data diagrams are generated by the missingno package.
Correlations
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
True |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
False |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
|
boolean |
True |
Whether to calculate this coefficient |
|
boolean |
True |
Warn for correlations higher than the threshold |
|
float |
0.9 |
Warning threshold |
Disable all correlations:
profile = df.profile_report(
title="Report without correlations",
correlations={
"pearson": {"calculate": False},
"spearman": {"calculate": False},
"kendall": {"calculate": False},
"phi_k": {"calculate": False},
"cramers": {"calculate": False},
},
)
# or using a shorthand that is available for correlations
profile = df.profile_report(
title="Report without correlations",
correlations=None,
)
Interactions
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
boolean |
True |
Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs. |
|
list |
[] |
When a list of variable names is given, only interactions between these and all other variables are given. |
The HTML Report
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
bool |
True |
If True, the output html is minified using the htmlmin package. |
|
bool |
True |
If True, all assets (stylesheets, scripts, images) are stored locally. If False, a CDN is used for some stylesheets and scripts. |
|
boolean |
True |
If True, all assets are contained in the report. If False, then a web export is created, where all assets are stored in the ‘[REPORT_NAME]_assets/’ directory. |
|
boolean |
True |
Whether to include a navigation bar in the report |
|
string |
None |
Select a ‘bootswatch’ theme. Available options: ‘flatly’ (dark) and ‘united’ (orange) |
|
string |
A base64 encoded logo, to display in the navigation bar. |
|
|
string |
#337ab7 |
The primary color to use in the report. |
|
boolean |
False |
By default, the width of the report is fixed. If set to True, the full width of the screen is used. |
Using a custom configuration file
To set the configuration of pandas-profiling using a custom file, you can start one of the sample configuration files below. Then, change the configuration to your liking.
from pandas_profiling import ProfileReport
profile = ProfileReport(df, config_file="your_config.yml")
profile.to_file("report.html")
Sample configuration files
A great way to get an overview of the possible configuration is to look through sample configuration files. The repository contains the following files:
default configuration file (default),
minimal configuration file (minimal computation, optimized for performance)
Configuration shorthands
It’s possible to disable certain groups of features through configuration shorthands.
# Disable samples, correlations, missing diagrams and duplicates at once
r = ProfileReport(
samples=None,
correlations=None,
missing_diagrams=None,
duplicates=None,
interactions=None,
)
Customise plots
Plot rendering options
A way how to pass arguments to the underlying matplotlib is to use the plot
argument. It is possible to change the default format of images to png (default svg) using the key-pair image_format: "png"
and also the resolution of the image using dpi: 800
.
An example would be:
profile = ProfileReport(
planets,
title="Pandas Profiling Report",
explorative=True,
plot={"dpi": 200, "image_format": "png"},
)
Pie charts
Pie charts are used to plot the frequency of categories in categorical (or boolean) features.
By default, a feature is considered as categorical if it does not have more than 10 distinct values.
This threshold can be configured with the plot.pie.max_unique
setting.
If the feature is not considered as categorical, the pie chart will not be displayed.
All pie charts can therefore be removed by setting: plot.pie.max_unique = 0
.
The pie chart colors can be configured to any recognised matplotlib colour
with the plot.pie.colors
setting.
Customise correlation matrix
It’s possible to directly access the correlation matrix as well.
That is done with the plot
argument and then with the correlation
key.
It is possible to customise the palette, one can use the following list used in seaborn or create their own custom matplotlib palette.
Supported values are:
‘Accent’, ‘Accent_r’, ‘Blues’, ‘Blues_r’, ‘BrBG’, ‘BrBG_r’, ‘BuGn’, ‘BuGn_r’, ‘BuPu’, ‘BuPu_r’, ‘CMRmap’, ‘CMRmap_r’, ‘Dark2’, ‘Dark2_r’, ‘GnBu’, ‘GnBu_r’, ‘Greens’, ‘Greens_r’, ‘Greys’, ‘Greys_r’, ‘OrRd’, ‘OrRd_r’, ‘Oranges’, ‘Oranges_r’, ‘PRGn’, ‘PRGn_r’, ‘Paired’, ‘Paired_r’, ‘Pastel1’, ‘Pastel1_r’, ‘Pastel2’, ‘Pastel2_r’, ‘PiYG’, ‘PiYG_r’, ‘PuBu’, ‘PuBuGn’, ‘PuBuGn_r’, ‘PuBu_r’, ‘PuOr’, ‘PuOr_r’, ‘PuRd’, ‘PuRd_r’, ‘Purples’, ‘Purples_r’, ‘RdBu’, ‘RdBu_r’, ‘RdGy’, ‘RdGy_r’, ‘RdPu’, ‘RdPu_r’, ‘RdYlBu’, ‘RdYlBu_r’, ‘RdYlGn’, ‘RdYlGn_r’, ‘Reds’, ‘Reds_r’, ‘Set1’, ‘Set1_r’, ‘Set2’, ‘Set2_r’, ‘Set3’, ‘Set3_r’, ‘Spectral’, ‘Spectral_r’, ‘Wistia’, ‘Wistia_r’, ‘YlGn’, ‘YlGnBu’, ‘YlGnBu_r’, ‘YlGn_r’, ‘YlOrBr’, ‘YlOrBr_r’, ‘YlOrRd’, ‘YlOrRd_r’, ‘afmhot’, ‘afmhot_r’, ‘autumn’, ‘autumn_r’, ‘binary’, ‘binary_r’, ‘bone’, ‘bone_r’, ‘brg’, ‘brg_r’, ‘bwr’, ‘bwr_r’, ‘cividis’, ‘cividis_r’, ‘cool’, ‘cool_r’, ‘coolwarm’, ‘coolwarm_r’, ‘copper’, ‘copper_r’, ‘crest’, ‘crest_r’, ‘cubehelix’, ‘cubehelix_r’, ‘flag’, ‘flag_r’, ‘flare’, ‘flare_r’, ‘gist_earth’, ‘gist_earth_r’, ‘gist_gray’, ‘gist_gray_r’, ‘gist_heat’, ‘gist_heat_r’, ‘gist_ncar’, ‘gist_ncar_r’, ‘gist_rainbow’, ‘gist_rainbow_r’, ‘gist_stern’, ‘gist_stern_r’, ‘gist_yarg’, ‘gist_yarg_r’, ‘gnuplot’, ‘gnuplot2’, ‘gnuplot2_r’, ‘gnuplot_r’, ‘gray’, ‘gray_r’, ‘hot’, ‘hot_r’, ‘hsv’, ‘hsv_r’, ‘icefire’, ‘icefire_r’, ‘inferno’, ‘inferno_r’, ‘jet’, ‘jet_r’, ‘magma’, ‘magma_r’, ‘mako’, ‘mako_r’, ‘nipy_spectral’, ‘nipy_spectral_r’, ‘ocean’, ‘ocean_r’, ‘pink’, ‘pink_r’, ‘plasma’, ‘plasma_r’, ‘prism’, ‘prism_r’, ‘rainbow’, ‘rainbow_r’, ‘rocket’, ‘rocket_r’, ‘seismic’, ‘seismic_r’, ‘spring’, ‘spring_r’, ‘summer’, ‘summer_r’, ‘tab10’, ‘tab10_r’, ‘tab20’, ‘tab20_r’, ‘tab20b’, ‘tab20b_r’, ‘tab20c’, ‘tab20c_r’, ‘terrain’, ‘terrain_r’, ‘turbo’, ‘turbo_r’, ‘twilight’, ‘twilight_r’, ‘twilight_shifted’, ‘twilight_shifted_r’, ‘viridis’, ‘viridis_r’, ‘vlag’, ‘vlag_r’, ‘winter’, ‘winter_r’
An example can be:
from pandas_profiling import ProfileReport
profile = ProfileReport(
df,
title="Pandas Profiling Report",
explorative=True,
plot={"correlation": {"cmap": "RdBu_r", "bad": "#000000"}},
)
Similarly, one can change the palette for Missing values using the missing
argument, eg:
from pandas_profiling import ProfileReport
profile = ProfileReport(
df,
title="Pandas Profiling Report",
explorative=True,
plot={"missing": {"cmap": "RdBu_r"}},
)
Multiple runs
The ProfileReport caches intermediary results for improved performance.
For rendering both the HTMl report write the statistics as a JSON file will reuse the same computations.
If you modify the configuration in between runs, you should either create a new ProfileReport
object or invalidate the relevant cached values.
If the config for only the HTML report is changed (for instance you would like to tune the theme), then you only need to reset the cached HTML report.
You can use the report.invalidate_cache()
method for this.
Passing the values “rendering” only resets previously rendered reports (HTML, JSON or widgets).
Alternatively “report” also resets the report structure.
Read config from environment
Any profile report config setting can also be read in from environment variables.
For example:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="My Custom Pandas Profiling Report")
is equivalent to setting the title as an environment variable
export PROFILE_TITLE="My Custom Pandas Profiling Report"
and running
from pandas_profiling import ProfileReport
profile = ProfileReport(df)