Changelog v3.2.0

πŸŽ‰ Features

  • Add stop words to word_summary_vc [#863]

  • show categorical freq with stacked barh instead of pie

  • Make pie plot colors customizable

πŸ› Bug fixes

  • Fix pandas 1.4.x compatibility [#911]

  • Omit setting of mpl backend (special thanks to Jake Odom

  • Character counts bugfix [#842]

  • Default type for render map (Unsupported)

πŸ‘·β€β™‚οΈ Internal Improvements

  • tryceratops for CI: improved exception handling

⬆️ Dependencies

  • tangled-up-in-unicode 0.2.0 (unicode 14)

  • Loosen jupyter-client dependency for Colab (now >=5.3.4, was >=6.0.0)

Changelog v3.1.0

πŸŽ‰ Features

  • Fine-grained progress bar

πŸ› Bug fixes

  • Python 3.9 and 3.10 compatibility

  • Phik correlation order

πŸ“– Documentation

  • Several link fixes, readme updates

πŸ‘·β€β™‚οΈ Internal Improvements

  • Matplotlib backend

⬆️ Dependencies

  • pre-commit

  • visions to 0.7.4

Changelog v3.0.0

This is the first release to adhere to the SemVer and Conventional Commits specifications.

πŸŽ‰ Features

  • The report configuration was completely overhauled, providing a more intuitive API and fixing issues inherent to the previous global config.

πŸ› Bug fixes

  • Various issues could not be (easily) solved in the previous configuration architecture, are fixed in this release ([584], [644], [698], [720] and [724])

  • Fix crash with exotic characters ([707])

  • Fixed the way (sub)titles were shown in the report grids.

πŸ“– Documentation

  • Enforce QA using flake8 for documentation, for instance checking for backticks and enforcing black code style on examples.

  • Automated configuration documentation API.

πŸ‘·β€β™‚οΈ Internal Improvements

  • CI: mypy type checking was moved to the pre-commit hooks.

🚨 Breaking changes

The configuration syntax has changed!

The yaml configuration now requires the official syntax (e.g. null instead of None). The previously used configuration library could not handle comments with indentation - you are now free to use conventional yaml.

For the python configuration the set_variable method has been replaced by more intuitively accessing the configuration object. For example, you can now set the title in the following way report.config.title = "My title".

The docs provide additional examples.

⬆️ Dependencies

  • pydantic and PyYaml are dependencies for the new configuration.

  • confuse and attrs are no longer (explicit) dependencies.

  • Upgraded tangled-up-in-unicode to 0.0.7.

Changelog v2.13.0

πŸŽ‰ Features

  • configurable numeric precision

πŸ‘·β€β™‚οΈ Internal Improvements

  • string type detection performance optimization

  • various improvements to software quality (flake8, commitlint)

⬆️ Dependencies

  • upgrade from visions 0.6.0 to 0.7.1

  • upgrade from coverage <5 to ~=5.5

Changelog v2.12.0

πŸŽ‰ Features

  • Add the number and the percentage of negative values for numerical variables [695] (contributed by @gverbock)

  • Enable setting of typeset/summarizer (contributed by @ieaves)

  • Allow empty data frames [678] (contributed by @spbail, @fwd2020-c)

πŸ› Bug fixes

  • Patch args for great_expectations datetime profiler [727] (contributed by @jstammers)

  • Negative exponent formatting [723] (reported by @rdpapworth)

πŸ“– Documentation

  • Fix link syntax (contributed by @ChrisCarini)

πŸ‘·β€β™‚οΈ Internal Improvements

  • Several performance improvements (minimal mode, duplicates, frequency table sorting)

  • Introduce pytest-benchmark in CI to monitor commit performance impact

  • Introduce commitlint in CI to start automating the changelog generation

⬆️ Dependencies

  • The ipywidgets dependency was moved to the [notebook] extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx)

  • Replaced the (testing only) fastparquet dependency with pyarrow (default pandas parquet engine, contributed by @kurosch)

  • Upgrade phik. This drops the hard dependency on numba (contributed by @akx)

Changelog v2.11.0

πŸŽ‰ Features

  • Great Expectations integration [430] docs (thanks @spbail, @talagluck and the Great Expectations team).

  • Introduced the infer_dtypes parameter to control automatic inference of data types [676] (thanks @mohith7548 and @ieaves).

  • Improved JSON representation for pd.Series, pd.DataFrame, numpy data and Samples.

🚨 Breaking changes

  • Global config setting removed; config resets on report initialization.

⬆️ Dependencies

  • Update pyupgrade to 2.10.0.

Changelog v2.10.1

πŸ› Bug fixes

  • Fixed recursion error for NaN values [683] and [671]

  • Fixed error for empty dataframe [664]

  • Fixed Jupyter notebook widget string rendering issue [668]

  • Fixed histogram of string length with NaNs [642] and [613]

  • Fixed slugify logic for interaction columns [663]

πŸ“– Documentation

  • Update Slack community link on readme [673]

  • Include recent contributions to the β€œResources” page.

Changelog v2.10.0

πŸŽ‰ Features

  • Restructured the overview for categorical variables.

  • Handling of compressed files

  • Option for random sample

  • Restructure categorical variable overview

πŸ‘·β€β™‚οΈ Internal Improvements

  • Full visions integration for type system: read more here.

  • Migrate from Travis CI to Github Actions…

🚨 Breaking changes

  • The configuration parameter is replaced by

Changelog v2.9.0

πŸŽ‰ Features

  • Description per variable now possible (see the metadata page) or the Census example.

πŸ› Bug fixes

  • Fixed bug for small DataFrames with unused categories.

  • Fixed bug where parallelization would have side effects.

  • Removed warning where colormap was modified in place.

  • Distinguish between unique and distinct correctly.

πŸ“– Documentation

  • Extend documentation for frequent issues.

  • Extended documentation for Streamlit and Panel.

  • Provide visibility to our supporters.

⬆️ Dependencies

  • Pandas 1.1.0 contains bugs that make it incompatible. Please up- or downgrade.

  • Upgraded visions to 0.5.0.

Changelog v2.9.0rc1

πŸŽ‰ Features

  • Working with sensitive data: Introduced sensitive=True option to mask non-aggregated data (such as samples, duplicates, frequency tables for categorical columns) [#503].

  • The sample section can be parametrized with a custom sample (for instance mock data).

  • Introduce shorthands for groups of parameters for styles and explorative mode [#499].

  • Metadata of a dataset can be added to the report (see documentation).

  • Numeric columns now report monotonicity information.

  • A pie chart can be generated for boolean and (low) categorical columns.

πŸ› Bug fixes

  • NaT in date columns were interpreted as a date in 1680 by histograms [#507].

  • ValueError: (β€˜widget type not understood’, β€˜select’) [#493].

  • Fixed regression in working with pandas’ nullable integers [#502].

  • Formatting of precision of numeric values has been improved in a few places.

πŸ‘·β€β™‚οΈ Internal Improvements

  • Histograms used to be calculated at view time (single thread) and are now computed in parallel.

  • Matplotlib’s rcParams are now modified through the contextmanager [#494].

πŸ“– Documentation

  • Links to Colab and Binder notebooks [#480 and #497].

  • The documentation for sensitive data, large datasets and metadata have been extended.

🚨 Breaking changes

  • bayesian_blocks binning has been removed, together with the astropy dependency.

  • Config files config_dark.yaml, config_united.yaml and config_explorative.yaml have been removed in favour of shorthand for groups of parameters.

⬆️ Dependencies

  • isort updated to major version 5.

  • attrs is now required for classes.

Changelog v2.8.0

πŸŽ‰ Features

  • Expanded the Unicode analysis capabilities: next to the most occurring unicode scripts, categories and blocks, it’s now possible to inspect the most frequent characters for each of them.

  • ProfileReport.set_variable now accepts nested parameters such as report.set_variable("variables.descriptions", {"var1": "Identifier"}).

  • Ability to have descriptions of the variables alongside the descriptive statistics (#232, #402).

  • Config: Introducing config shorthands.

  • Config: plot.scatter_threshold allows for configuration above what value scatter plots are replace with hexbin plots.

  • Config: html.inline allows for rendering assets as vector images to package export as folder and file (similar to exporting a website). (#452).

  • It’s now possible to specify which interactions to compute to filter out un-needed interactions between columns (#451).

  • When the output_file is omitted in the CLI, it uses the input_file with HTML extensions. This can be useful when profiling of a complete directory from the command line, e.g. find . -type f -name "*.csv" -exec pandas_profiling {} \;.

  • Config: Split the in and for more control on the summaries.

  • Config: Included a new configuration sample file config_explorative.yml, including Text (length distribution, unicode information), File (file size, creation time), Image (dimensions, exif information).

πŸ› Bug fixes

  • Resolved color ValueError on Mac (#464).

  • Style: too many interactions overflowed tabs. Now they elegantly turn into a select control.

  • Unique variables are always uniform and have high cardinality, hence we can remove the redundant labels.

  • The counts for unicode properties were based on unique characters, instead of following the original frequency distribution.

  • Slimmed down the HTML by removing classes and more effective CSS.

πŸ‘·β€β™‚οΈ Internal Improvements

  • CI: Added macOS and Windows to the testing environments (experimental).

  • CI: Added python3.9-dev to the testing environment (experimental).

  • CI: Reduced the number of permutations for code formatting and type checking.

πŸ“– Documentation

  • API documentation is now available.

⚠️ Deprecated

  • The bayesian_bins parameter will be removed in the next release.

🚨 Breaking changes

  • Config: is replaced by and

⬆️ Dependencies

  • Update visions to 0.4.4 for more informative Unicode summaries.

Changelog v2.7.1

⬆️ Dependencies

  • Fix version of visions due to breaking changes in new summarization functions.

Changelog v2.7.0

πŸŽ‰ Features

  • Reports are built in phases, see issue for details (#421)

  • The most occurring duplicates rows are included in the report.

  • ProfileReports can now be saved to and loaded from disk (for caching).

  • Explicit analysis duration is added to the reproduction section of the report.

  • Doc: this version introduces documentation powered by Sphinx. The previously used pdoc3 has been adequate initially, however misses functionality and extensibility.

  • Doc: Dedicated page for large datasets is created (#420).

  • Doc: The installation instructions have been extended, installation via conda would default to 1.4.1 (#449, #448).

  • CI: Linting, building the documentation and examples and uploading the package to PyPi have been automated using git flow and Github Actions.

πŸ› Bug fixes

  • warnings were not shown in the β€œwarnings” tab, but were at variable level (#389).

  • The β€œmedian absolute deviation” is now reported instead of the β€œmean absolute deviation” (#453).

  • Several style-related fixes for Jupyter lab and notebooks (tables, warnings, wide images).

  • pd.NAN introduced in pandas 1 now supported (#437).

  • The logic for calculating infinite values is now correct (#397).

πŸ‘·β€β™‚οΈ Internal Improvements

  • The number of progress bars is reduced. The progress bars are now grouped by build phase (e.g. describing dataset, building report structure, rendering report, exporting to file).

  • The progress bars provide more information about the current step to the user #434).

  • Invalid correlations coefficients do not cause it to drop the complete variable anymore, instead the plot now propagates the NaN (#417).

  • Performance: type inference test now short-circuit, as visions does by default.

  • Performance: the numerical summary is optimized to use numpy directly, instead of slower methods provided by pandas.

  • Config: dynamic histogram bins are now disabled by default default for better default computational performance (#441).

  • Config: type inference to warning when date variables are processed as categorical is set to False by default for being a bottleneck for larger datasets.

  • Warn: the user is warned that the to_widgets does not work in Google Colab, which doesn’t support ipywidgets properly (#462).

  • Cln: Moved ProfileReport out of __init__ to it’s own class file.

  • Cln: removed the output_file parameter form examples.

  • Cln: the HTML representation of the footer and wrapper are moved out of ProfileReport to the report structure.

  • Cln: the imports are automatically ordered with isort.

⚠️ Deprecated

  • Doc: the pdoc3 documentation will be removed in the future.

  • Config: using the config globally is deprecated. In the future, the configuration will be tied to the ProfileReport.

🚨 Breaking changes

  • Doc: the example HTML reports were removed from the repository (still available in the gh-pages branch and documentation).

  • The recoded β€œcorrelation” was removed for not being informative enough to justify it’s costs.

⬆️ Dependencies

  • Requirements now correctly excludes pandas 1.0.0, 1.0.1 and 1.0.2. Either user pandas <1 or >= 1.0.3.

Prior to v2.7.0

Previously, there was no explicit changelog. However, changes were included in the release description on GitHub, which you can find in this page.