Profiling large datasets
pandas-profiling comprehensively summarizes the input dataset in a way that gives the most insights for data analysis. For small datasets, these computations can be performed in quasi real-time. For larger datasets, deciding upfront which calculations to make might be required.
Whether a computation scales to a large datasets not only depends on the exact size of the detaset, but also on its complexity and on whether fast computations are available. If the computation time of the profiling becomes a bottleneck,
pandas-profiling offers several alternatives to overcome it.
pandas-profiling includes a minimal configuration file where the most expensive computations are turned off by default.
This is the recommended starting point for larger datasets.
profile = ProfileReport(large_dataset, minimal=True) profile.to_file("output.html")
(minimal mode was introduced in version v2.4.0)
Sample the dataset
An alternative way to handle really large datasets is to use a portion of it to generate the profiling report. Several users report this is a good way to scale back the computation time while maintaining representativity.
# Sample 10.000 rows sample = large_dataset.sample(10000) profile = ProfileReport(sample, minimal=True) profile.to_file("output.html")
The reader of the report might want to know that the profile is generated using a sample from the data. This can be done by adding a description to the report (see Dataset metadata and data dictionaries for details).
description = "Disclaimer: this profiling report was generated using a sample of 5% of the original dataset." sample = large_dataset.sample(frac=0.05) profile = sample.profile_report(description=description, minimal=True) profile.to_file("output.html")
Disable expensive computations
To decrease the computational burden in particularly large datasets but still maintain some information of interest that may stem from them, some computations can be filtered only for certain columns. Particularly, a list of targets can be provided to Interactions, so that only the interactions with these variables in specific are computed.
from pandas_profiling import ProfileReport import pandas as pd # Reading the data data = pd.read_csv( "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv" ) # Creating the profile without specifying the data source, to allow editing the configuration profile = ProfileReport() profile.config.interactions.targets = ["Name", "Sex", "Age"] # Assigning a DataFrame and exporting to a file, triggering computation profile.df = data profile.to_file("report.html")
The setting controlling this,
ìnteractions.targets, can be changed via multiple interfaces (configuration files or environment variables). For details, see Changing settings.