Concepts

Data types

Types, when going beyond the the logical data types such as integer, floats, etc, are a powerful abstraction for effective data analysis, allowing analysis under higher level lenses. pandas-profiling is backed by a powerful type system developed specifically for data analysis: visions. Currently, pandas-profiling recognizes the following types:

  • Boolean

  • Numerical

  • Date (and Datetime)

  • Categorical

  • URL

  • Path

  • File

  • Image

Appropriate typesets can both improve the overall expressiveness and reduce the complexity of the analysis/code. User customized summarizations and type definitions are fully supported, with PRs supporting new data types for specific use cases more than welcome. For reference, you can check the implementation of pandas-profiling’s default typeset here.

Data quality alerts

Data quality warnings

Alerts section in the NASA Meteorites dataset’s report. Some alerts include numerical indicators.

The Alerts section of the report includes a comprehensive and automatic list of potential data quality issues. Although useful, the decision on whether an alert is in fact a data quality issue always requires domain validation. Some of the warnings refer to a specific column, others refer to inter-column relationships and others are dataset-wide. The table below lists all possible data quality alerts and their meanings.

Alert

Description

Constant

Column only contains one value

Zeros

Column only contains zeros

High Correlation

Correlations (either Spearman, Cramer, Pearson, Kendall, 𝜙k) are above the warning threshold (configurable).

High Cardinality

Whether the column has more than 50 distinct values. Threshold is configurable.

Skewness

Column’s univariate distribution presents skewness. Threshold value is configurable.

Missing Values

Column has missing values

Infinite Values

Column has infinite values (either np.inf or -np.inf)

Unique Values

All values of the column are unique (count of unique values equals column’s length)

Date

Column (likely/mostly) contains Date or Datetime records

Uniform

Column follows a uniform distribution (Chi-squared test score > 0.999, threshold score is configrable)

Constant length

For strings/date/datetimes columns whose entries all have the same length

Rejected

Variable has mixed types or is constant (thus not suitable for meaningful analysis)

Unsupported

Column can’t be analysed (type is not supported, has mixed types, has lists/dicts/tuples, is empty, wrongly formatted)

Duplicates

Dataset-level warning signaling the presence of more than 10 duplicated records.

Empty

Dataset-level warning signaling there’s no data to be analysed.

Information on the default values and the specific parameters/thresholds used in the computation of these alerts, as well as settings to disable specific ones, can be consulted in Available settings.