pandas-profiling generates profile reports from a pandas
df.describe() function is handy yet a little basic for exploratory data analysis.
pandas-profiling extends pandas
which automatically generates a standardized univariate and multivariate report for data understanding.
For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:
Type inference: detect the types of columns in a
Essentials: type, unique values, indication of missing values
Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent and extreme values
Histograms: categorical and numerical
Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik, Auto)
Missing values: through counts, matrix and heatmap
Duplicate rows: list of the most common duplicated rows
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
The report contains three additional sections:
Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
Reproduction: technical details about the analysis (time, version and configuration)
The package can be used via code but also directly as a CLI utility. The generated interactive report can be consumed and shared as regular HTML or embedded in an interactive way inside Jupyter Notebooks.
⚡ Looking for a Spark backend to profile large datasets?
While not yet finished, a Spark backend is in development. Progress can be tracked here. Testing and contributions are welcome!