Overview

Dataset statistics

Number of variables14
Number of observations32561
Missing cells4262
Missing cells (%)0.9%
Duplicate rows23
Duplicate rows (%)0.1%
Total size in memory18.1 MiB
Average record size in memory583.0 B

Variable types

Numeric6
Categorical8

Dataset

DescriptionPredict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year.
CreatorBarry Becker
AuthorRonny Kohavi and Barry Becker
URLhttps://archive.ics.uci.edu/ml/datasets/adult

Variable descriptions

agedefinition 0
workclassdefinition 1
fnlwgtdefinition 2
educationdefinition 3
education-numdefinition 4
marital-statusdefinition 5
occupationdefinition 6
relationshipdefinition 7
racedefinition 8
sexdefinition 9
capital-gaindefinition 10
capital-lossdefinition 11
hours-per-weekdefinition 12
native-countrydefinition 13

Alerts

Dataset has 23 (0.1%) duplicate rowsDuplicates
sex is highly correlated with relationshipHigh correlation
relationship is highly correlated with sexHigh correlation
age is highly correlated with marital-statusHigh correlation
workclass is highly correlated with occupationHigh correlation
education is highly correlated with education-num and 1 other fieldsHigh correlation
education-num is highly correlated with education and 1 other fieldsHigh correlation
marital-status is highly correlated with age and 1 other fieldsHigh correlation
occupation is highly correlated with workclass and 3 other fieldsHigh correlation
relationship is highly correlated with marital-status and 1 other fieldsHigh correlation
race is highly correlated with native-countryHigh correlation
sex is highly correlated with occupation and 1 other fieldsHigh correlation
native-country is highly correlated with raceHigh correlation
workclass has 1836 (5.6%) missing values Missing
occupation has 1843 (5.7%) missing values Missing
native-country has 583 (1.8%) missing values Missing
capital-gain has 29849 (91.7%) zeros Zeros
capital-loss has 31042 (95.3%) zeros Zeros

Reproduction

Analysis started2022-09-06 19:01:15.019036
Analysis finished2022-09-06 19:01:23.282137
Duration8.26 seconds
Software versionpandas-profiling v3.2.0
Download configurationconfig.json

Variables

age
Real number (ℝ≥0)

HIGH CORRELATION

Distinct73
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean38.58164676
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:23.355355image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum17
5-th percentile19
Q128
median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range (IQR)20

Descriptive statistics

Standard deviation13.64043255
Coefficient of variation (CV)0.3535471837
Kurtosis-0.1661274596
Mean38.58164676
Median Absolute Deviation (MAD)10
Skewness0.5587433694
Sum1256257
Variance186.0614002
MonotonicityNot monotonic
2022-09-06T19:01:23.479829image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
36898
 
2.8%
31888
 
2.7%
34886
 
2.7%
23877
 
2.7%
35876
 
2.7%
33875
 
2.7%
28867
 
2.7%
30861
 
2.6%
37858
 
2.6%
25841
 
2.6%
Other values (63)23834
73.2%
ValueCountFrequency (%)
17395
1.2%
18550
1.7%
19712
2.2%
20753
2.3%
21720
2.2%
22765
2.3%
23877
2.7%
24798
2.5%
25841
2.6%
26785
2.4%
ValueCountFrequency (%)
9043
0.1%
883
 
< 0.1%
871
 
< 0.1%
861
 
< 0.1%
853
 
< 0.1%
8410
 
< 0.1%
836
 
< 0.1%
8212
 
< 0.1%
8120
0.1%
8022
0.1%

workclass
Categorical

HIGH CORRELATION
MISSING

Distinct8
Distinct (%)< 0.1%
Missing1836
Missing (%)5.6%
Memory size2.0 MiB
Private
22696 
Self-emp-not-inc
2541 
Local-gov
 
2093
State-gov
 
1298
Self-emp-inc
 
1116
Other values (3)
 
981

Length

Max length17
Median length8
Mean length9.274597234
Min length8

Characters and Unicode

Total characters284962
Distinct characters28
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row State-gov
2nd row Self-emp-not-inc
3rd row Private
4th row Private
5th row Private

Common Values

ValueCountFrequency (%)
Private22696
69.7%
Self-emp-not-inc2541
 
7.8%
Local-gov2093
 
6.4%
State-gov1298
 
4.0%
Self-emp-inc1116
 
3.4%
Federal-gov960
 
2.9%
Without-pay14
 
< 0.1%
Never-worked7
 
< 0.1%
(Missing)1836
 
5.6%

Length

2022-09-06T19:01:23.598182image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-09-06T19:01:23.716896image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
private22696
73.9%
self-emp-not-inc2541
 
8.3%
local-gov2093
 
6.8%
state-gov1298
 
4.2%
self-emp-inc1116
 
3.6%
federal-gov960
 
3.1%
without-pay14
 
< 0.1%
never-worked7
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
e33249
11.7%
30725
10.8%
t27861
9.8%
a27061
9.5%
v27054
9.5%
i26367
9.3%
r23670
8.3%
P22696
8.0%
-14227
 
5.0%
o9006
 
3.2%
Other values (18)43046
15.1%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter209285
73.4%
Space Separator30725
 
10.8%
Uppercase Letter30725
 
10.8%
Dash Punctuation14227
 
5.0%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e33249
15.9%
t27861
13.3%
a27061
12.9%
v27054
12.9%
i26367
12.6%
r23670
11.3%
o9006
 
4.3%
l6710
 
3.2%
n6198
 
3.0%
c5750
 
2.7%
Other values (10)16359
7.8%
Uppercase Letter
ValueCountFrequency (%)
P22696
73.9%
S4955
 
16.1%
L2093
 
6.8%
F960
 
3.1%
W14
 
< 0.1%
N7
 
< 0.1%
Space Separator
ValueCountFrequency (%)
30725
100.0%
Dash Punctuation
ValueCountFrequency (%)
-14227
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin240010
84.2%
Common44952
 
15.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
e33249
13.9%
t27861
11.6%
a27061
11.3%
v27054
11.3%
i26367
11.0%
r23670
9.9%
P22696
9.5%
o9006
 
3.8%
l6710
 
2.8%
n6198
 
2.6%
Other values (16)30138
12.6%
Common
ValueCountFrequency (%)
30725
68.4%
-14227
31.6%

Most occurring blocks

ValueCountFrequency (%)
ASCII284962
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e33249
11.7%
30725
10.8%
t27861
9.8%
a27061
9.5%
v27054
9.5%
i26367
9.3%
r23670
8.3%
P22696
8.0%
-14227
 
5.0%
o9006
 
3.2%
Other values (18)43046
15.1%

fnlwgt
Real number (ℝ≥0)

Distinct21648
Distinct (%)66.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean189778.3665
Minimum12285
Maximum1484705
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:23.854465image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum12285
5-th percentile39460
Q1117827
median178356
Q3237051
95-th percentile379682
Maximum1484705
Range1472420
Interquartile range (IQR)119224

Descriptive statistics

Standard deviation105549.9777
Coefficient of variation (CV)0.5561749721
Kurtosis6.218810978
Mean189778.3665
Median Absolute Deviation (MAD)59894
Skewness1.446980095
Sum6179373392
Variance1.114079779 × 1010
MonotonicityNot monotonic
2022-09-06T19:01:23.998513image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
16419013
 
< 0.1%
20348813
 
< 0.1%
12301113
 
< 0.1%
14899512
 
< 0.1%
12112412
 
< 0.1%
11336412
 
< 0.1%
12667512
 
< 0.1%
11148311
 
< 0.1%
12027711
 
< 0.1%
12398311
 
< 0.1%
Other values (21638)32441
99.6%
ValueCountFrequency (%)
122851
 
< 0.1%
137691
 
< 0.1%
148781
 
< 0.1%
188271
 
< 0.1%
192141
 
< 0.1%
193025
< 0.1%
193952
 
< 0.1%
194101
 
< 0.1%
194911
 
< 0.1%
195201
 
< 0.1%
ValueCountFrequency (%)
14847051
< 0.1%
14554351
< 0.1%
13661201
< 0.1%
12683391
< 0.1%
12265831
< 0.1%
11846221
< 0.1%
11613631
< 0.1%
11256131
< 0.1%
10974531
< 0.1%
10855151
< 0.1%

education
Categorical

HIGH CORRELATION

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
HS-grad
10501 
Some-college
7291 
Bachelors
5355 
Masters
1723 
Assoc-voc
1382 
Other values (11)
6309 

Length

Max length13
Median length12
Mean length9.433709038
Min length4

Characters and Unicode

Total characters307171
Distinct characters32
Distinct categories5 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Bachelors
2nd row Bachelors
3rd row HS-grad
4th row 11th
5th row Bachelors

Common Values

ValueCountFrequency (%)
HS-grad10501
32.3%
Some-college7291
22.4%
Bachelors5355
16.4%
Masters1723
 
5.3%
Assoc-voc1382
 
4.2%
11th1175
 
3.6%
Assoc-acdm1067
 
3.3%
10th933
 
2.9%
7th-8th646
 
2.0%
Prof-school576
 
1.8%
Other values (6)1912
 
5.9%

Length

2022-09-06T19:01:24.147542image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
hs-grad10501
32.3%
some-college7291
22.4%
bachelors5355
16.4%
masters1723
 
5.3%
assoc-voc1382
 
4.2%
11th1175
 
3.6%
assoc-acdm1067
 
3.3%
10th933
 
2.9%
7th-8th646
 
2.0%
prof-school576
 
1.8%
Other values (6)1912
 
5.9%

Most occurring characters

ValueCountFrequency (%)
32561
 
10.6%
e29415
 
9.6%
o26424
 
8.6%
-21964
 
7.2%
l20564
 
6.7%
a19059
 
6.2%
r18619
 
6.1%
c18584
 
6.1%
S17792
 
5.8%
g17792
 
5.8%
Other values (22)84397
27.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter205896
67.0%
Uppercase Letter38860
 
12.7%
Space Separator32561
 
10.6%
Dash Punctuation21964
 
7.2%
Decimal Number7890
 
2.6%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e29415
14.3%
o26424
12.8%
l20564
10.0%
a19059
9.3%
r18619
9.0%
c18584
9.0%
g17792
8.6%
s14494
7.0%
d11568
 
5.6%
h11163
 
5.4%
Other values (4)18214
8.8%
Decimal Number
ValueCountFrequency (%)
13884
49.2%
0933
 
11.8%
7646
 
8.2%
8646
 
8.2%
9514
 
6.5%
2433
 
5.5%
5333
 
4.2%
6333
 
4.2%
4168
 
2.1%
Uppercase Letter
ValueCountFrequency (%)
S17792
45.8%
H10501
27.0%
B5355
 
13.8%
A2449
 
6.3%
M1723
 
4.4%
P627
 
1.6%
D413
 
1.1%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
-21964
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin244756
79.7%
Common62415
 
20.3%

Most frequent character per script

Latin
ValueCountFrequency (%)
e29415
12.0%
o26424
10.8%
l20564
8.4%
a19059
 
7.8%
r18619
 
7.6%
c18584
 
7.6%
S17792
 
7.3%
g17792
 
7.3%
s14494
 
5.9%
d11568
 
4.7%
Other values (11)50445
20.6%
Common
ValueCountFrequency (%)
32561
52.2%
-21964
35.2%
13884
 
6.2%
0933
 
1.5%
7646
 
1.0%
8646
 
1.0%
9514
 
0.8%
2433
 
0.7%
5333
 
0.5%
6333
 
0.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII307171
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
 
10.6%
e29415
 
9.6%
o26424
 
8.6%
-21964
 
7.2%
l20564
 
6.7%
a19059
 
6.2%
r18619
 
6.1%
c18584
 
6.1%
S17792
 
5.8%
g17792
 
5.8%
Other values (22)84397
27.5%

education-num
Real number (ℝ≥0)

HIGH CORRELATION

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.08067934
Minimum1
Maximum16
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:24.239913image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile5
Q19
median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.572720332
Coefficient of variation (CV)0.2552129916
Kurtosis0.6234440748
Mean10.08067934
Median Absolute Deviation (MAD)1
Skewness-0.3116758679
Sum328237
Variance6.618889907
MonotonicityNot monotonic
2022-09-06T19:01:24.327461image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=16)
ValueCountFrequency (%)
910501
32.3%
107291
22.4%
135355
16.4%
141723
 
5.3%
111382
 
4.2%
71175
 
3.6%
121067
 
3.3%
6933
 
2.9%
4646
 
2.0%
15576
 
1.8%
Other values (6)1912
 
5.9%
ValueCountFrequency (%)
151
 
0.2%
2168
 
0.5%
3333
 
1.0%
4646
 
2.0%
5514
 
1.6%
6933
 
2.9%
71175
 
3.6%
8433
 
1.3%
910501
32.3%
107291
22.4%
ValueCountFrequency (%)
16413
 
1.3%
15576
 
1.8%
141723
 
5.3%
135355
16.4%
121067
 
3.3%
111382
 
4.2%
107291
22.4%
910501
32.3%
8433
 
1.3%
71175
 
3.6%

marital-status
Categorical

HIGH CORRELATION

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.2 MiB
Married-civ-spouse
14976 
Never-married
10683 
Divorced
4443 
Separated
 
1025
Widowed
 
993
Other values (2)
 
441

Length

Max length22
Median length19
Mean length15.41405362
Min length8

Characters and Unicode

Total characters501897
Distinct characters25
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Never-married
2nd row Married-civ-spouse
3rd row Divorced
4th row Married-civ-spouse
5th row Married-civ-spouse

Common Values

ValueCountFrequency (%)
Married-civ-spouse14976
46.0%
Never-married10683
32.8%
Divorced4443
 
13.6%
Separated1025
 
3.1%
Widowed993
 
3.0%
Married-spouse-absent418
 
1.3%
Married-AF-spouse23
 
0.1%

Length

2022-09-06T19:01:24.449592image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-09-06T19:01:24.569324image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
married-civ-spouse14976
46.0%
never-married10683
32.8%
divorced4443
 
13.6%
separated1025
 
3.1%
widowed993
 
3.0%
married-spouse-absent418
 
1.3%
married-af-spouse23
 
0.1%

Most occurring characters

ValueCountFrequency (%)
e70787
14.1%
r68351
13.6%
i46512
9.3%
-41517
8.3%
d33554
 
6.7%
32561
 
6.5%
s31252
 
6.2%
v30102
 
6.0%
a28568
 
5.7%
o20853
 
4.2%
Other values (15)97840
19.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter395212
78.7%
Dash Punctuation41517
 
8.3%
Uppercase Letter32607
 
6.5%
Space Separator32561
 
6.5%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e70787
17.9%
r68351
17.3%
i46512
11.8%
d33554
8.5%
s31252
7.9%
v30102
7.6%
a28568
7.2%
o20853
 
5.3%
c19419
 
4.9%
p16442
 
4.2%
Other values (6)29372
7.4%
Uppercase Letter
ValueCountFrequency (%)
M15417
47.3%
N10683
32.8%
D4443
 
13.6%
S1025
 
3.1%
W993
 
3.0%
A23
 
0.1%
F23
 
0.1%
Dash Punctuation
ValueCountFrequency (%)
-41517
100.0%
Space Separator
ValueCountFrequency (%)
32561
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin427819
85.2%
Common74078
 
14.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
e70787
16.5%
r68351
16.0%
i46512
10.9%
d33554
7.8%
s31252
7.3%
v30102
7.0%
a28568
6.7%
o20853
 
4.9%
c19419
 
4.5%
p16442
 
3.8%
Other values (13)61979
14.5%
Common
ValueCountFrequency (%)
-41517
56.0%
32561
44.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII501897
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e70787
14.1%
r68351
13.6%
i46512
9.3%
-41517
8.3%
d33554
 
6.7%
32561
 
6.5%
s31252
 
6.2%
v30102
 
6.0%
a28568
 
5.7%
o20853
 
4.2%
Other values (15)97840
19.5%

occupation
Categorical

HIGH CORRELATION
MISSING

Distinct14
Distinct (%)< 0.1%
Missing1843
Missing (%)5.7%
Memory size2.1 MiB
Prof-specialty
4140 
Craft-repair
4099 
Exec-managerial
4066 
Adm-clerical
3770 
Sales
3650 
Other values (9)
10993 

Length

Max length18
Median length16
Mean length13.87398268
Min length6

Characters and Unicode

Total characters426181
Distinct characters32
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Adm-clerical
2nd row Exec-managerial
3rd row Handlers-cleaners
4th row Handlers-cleaners
5th row Prof-specialty

Common Values

ValueCountFrequency (%)
Prof-specialty4140
12.7%
Craft-repair4099
12.6%
Exec-managerial4066
12.5%
Adm-clerical3770
11.6%
Sales3650
11.2%
Other-service3295
10.1%
Machine-op-inspct2002
6.1%
Transport-moving1597
 
4.9%
Handlers-cleaners1370
 
4.2%
Farming-fishing994
 
3.1%
Other values (4)1735
5.3%
(Missing)1843
5.7%

Length

2022-09-06T19:01:24.686099image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
prof-specialty4140
13.5%
craft-repair4099
13.3%
exec-managerial4066
13.2%
adm-clerical3770
12.3%
sales3650
11.9%
other-service3295
10.7%
machine-op-inspct2002
6.5%
transport-moving1597
 
5.2%
handlers-cleaners1370
 
4.5%
farming-fishing994
 
3.2%
Other values (4)1735
5.6%

Most occurring characters

ValueCountFrequency (%)
e42979
 
10.1%
r40333
 
9.5%
a39289
 
9.2%
30718
 
7.2%
-29219
 
6.9%
i28751
 
6.7%
c26001
 
6.1%
l22136
 
5.2%
s20302
 
4.8%
t17359
 
4.1%
Other values (22)129094
30.3%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter335517
78.7%
Uppercase Letter30727
 
7.2%
Space Separator30718
 
7.2%
Dash Punctuation29219
 
6.9%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e42979
12.8%
r40333
12.0%
a39289
11.7%
i28751
8.6%
c26001
 
7.7%
l22136
 
6.6%
s20302
 
6.1%
t17359
 
5.2%
n15992
 
4.8%
p15696
 
4.7%
Other values (10)66679
19.9%
Uppercase Letter
ValueCountFrequency (%)
P4938
16.1%
C4099
13.3%
E4066
13.2%
A3779
12.3%
S3650
11.9%
O3295
10.7%
T2525
8.2%
M2002
6.5%
H1370
 
4.5%
F1003
 
3.3%
Space Separator
ValueCountFrequency (%)
30718
100.0%
Dash Punctuation
ValueCountFrequency (%)
-29219
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin366244
85.9%
Common59937
 
14.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
e42979
11.7%
r40333
11.0%
a39289
10.7%
i28751
 
7.9%
c26001
 
7.1%
l22136
 
6.0%
s20302
 
5.5%
t17359
 
4.7%
n15992
 
4.4%
p15696
 
4.3%
Other values (20)97406
26.6%
Common
ValueCountFrequency (%)
30718
51.3%
-29219
48.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII426181
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e42979
 
10.1%
r40333
 
9.5%
a39289
 
9.2%
30718
 
7.2%
-29219
 
6.9%
i28751
 
6.7%
c26001
 
6.1%
l22136
 
5.2%
s20302
 
4.8%
t17359
 
4.1%
Other values (22)129094
30.3%

relationship
Categorical

HIGH CORRELATION
HIGH CORRELATION

Distinct6
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
Husband
13193 
Not-in-family
8305 
Own-child
5068 
Unmarried
3446 
Wife
1568 

Length

Max length15
Median length14
Mean length10.11974448
Min length5

Characters and Unicode

Total characters329509
Distinct characters26
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Not-in-family
2nd row Husband
3rd row Not-in-family
4th row Husband
5th row Wife

Common Values

ValueCountFrequency (%)
Husband13193
40.5%
Not-in-family8305
25.5%
Own-child5068
 
15.6%
Unmarried3446
 
10.6%
Wife1568
 
4.8%
Other-relative981
 
3.0%

Length

2022-09-06T19:01:24.786515image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-09-06T19:01:25.019307image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
husband13193
40.5%
not-in-family8305
25.5%
own-child5068
 
15.6%
unmarried3446
 
10.6%
wife1568
 
4.8%
other-relative981
 
3.0%

Most occurring characters

ValueCountFrequency (%)
32561
 
9.9%
n30012
 
9.1%
i27673
 
8.4%
a25925
 
7.9%
-22659
 
6.9%
d21707
 
6.6%
l14354
 
4.4%
H13193
 
4.0%
u13193
 
4.0%
s13193
 
4.0%
Other values (16)115039
34.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter241728
73.4%
Space Separator32561
 
9.9%
Uppercase Letter32561
 
9.9%
Dash Punctuation22659
 
6.9%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
n30012
12.4%
i27673
11.4%
a25925
10.7%
d21707
 
9.0%
l14354
 
5.9%
u13193
 
5.5%
s13193
 
5.5%
b13193
 
5.5%
m11751
 
4.9%
t10267
 
4.2%
Other values (9)60460
25.0%
Uppercase Letter
ValueCountFrequency (%)
H13193
40.5%
N8305
25.5%
O6049
18.6%
U3446
 
10.6%
W1568
 
4.8%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
-22659
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin274289
83.2%
Common55220
 
16.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
n30012
 
10.9%
i27673
 
10.1%
a25925
 
9.5%
d21707
 
7.9%
l14354
 
5.2%
H13193
 
4.8%
u13193
 
4.8%
s13193
 
4.8%
b13193
 
4.8%
m11751
 
4.3%
Other values (14)90095
32.8%
Common
ValueCountFrequency (%)
32561
59.0%
-22659
41.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII329509
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
 
9.9%
n30012
 
9.1%
i27673
 
8.4%
a25925
 
7.9%
-22659
 
6.9%
d21707
 
6.6%
l14354
 
4.4%
H13193
 
4.0%
u13193
 
4.0%
s13193
 
4.0%
Other values (16)115039
34.9%

race
Categorical

HIGH CORRELATION

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.0 MiB
White
27816 
Black
3124 
Asian-Pac-Islander
 
1039
Amer-Indian-Eskimo
 
311
Other
 
271

Length

Max length19
Median length6
Mean length6.53898836
Min length6

Characters and Unicode

Total characters212916
Distinct characters23
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row White
2nd row White
3rd row White
4th row Black
5th row Black

Common Values

ValueCountFrequency (%)
White27816
85.4%
Black3124
 
9.6%
Asian-Pac-Islander1039
 
3.2%
Amer-Indian-Eskimo311
 
1.0%
Other271
 
0.8%

Length

2022-09-06T19:01:25.127328image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-09-06T19:01:25.235594image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
white27816
85.4%
black3124
 
9.6%
asian-pac-islander1039
 
3.2%
amer-indian-eskimo311
 
1.0%
other271
 
0.8%

Most occurring characters

ValueCountFrequency (%)
32561
15.3%
i29477
13.8%
e29437
13.8%
h28087
13.2%
t28087
13.2%
W27816
13.1%
a6552
 
3.1%
l4163
 
2.0%
c4163
 
2.0%
k3435
 
1.6%
Other values (13)19138
9.0%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter142394
66.9%
Uppercase Letter35261
 
16.6%
Space Separator32561
 
15.3%
Dash Punctuation2700
 
1.3%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
i29477
20.7%
e29437
20.7%
h28087
19.7%
t28087
19.7%
a6552
 
4.6%
l4163
 
2.9%
c4163
 
2.9%
k3435
 
2.4%
n2700
 
1.9%
s2389
 
1.7%
Other values (4)3904
 
2.7%
Uppercase Letter
ValueCountFrequency (%)
W27816
78.9%
B3124
 
8.9%
A1350
 
3.8%
I1350
 
3.8%
P1039
 
2.9%
E311
 
0.9%
O271
 
0.8%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
-2700
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin177655
83.4%
Common35261
 
16.6%

Most frequent character per script

Latin
ValueCountFrequency (%)
i29477
16.6%
e29437
16.6%
h28087
15.8%
t28087
15.8%
W27816
15.7%
a6552
 
3.7%
l4163
 
2.3%
c4163
 
2.3%
k3435
 
1.9%
B3124
 
1.8%
Other values (11)13314
7.5%
Common
ValueCountFrequency (%)
32561
92.3%
-2700
 
7.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII212916
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
15.3%
i29477
13.8%
e29437
13.8%
h28087
13.2%
t28087
13.2%
W27816
13.1%
a6552
 
3.1%
l4163
 
2.0%
c4163
 
2.0%
k3435
 
1.6%
Other values (13)19138
9.0%

sex
Categorical

HIGH CORRELATION
HIGH CORRELATION

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.9 MiB
Male
21790 
Female
10771 

Length

Max length7
Median length5
Mean length5.661589018
Min length5

Characters and Unicode

Total characters184347
Distinct characters7
Distinct categories3 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Male
2nd row Male
3rd row Male
4th row Male
5th row Female

Common Values

ValueCountFrequency (%)
Male21790
66.9%
Female10771
33.1%

Length

2022-09-06T19:01:25.337409image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-09-06T19:01:25.439943image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
male21790
66.9%
female10771
33.1%

Most occurring characters

ValueCountFrequency (%)
e43332
23.5%
32561
17.7%
a32561
17.7%
l32561
17.7%
M21790
11.8%
F10771
 
5.8%
m10771
 
5.8%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter119225
64.7%
Space Separator32561
 
17.7%
Uppercase Letter32561
 
17.7%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e43332
36.3%
a32561
27.3%
l32561
27.3%
m10771
 
9.0%
Uppercase Letter
ValueCountFrequency (%)
M21790
66.9%
F10771
33.1%
Space Separator
ValueCountFrequency (%)
32561
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin151786
82.3%
Common32561
 
17.7%

Most frequent character per script

Latin
ValueCountFrequency (%)
e43332
28.5%
a32561
21.5%
l32561
21.5%
M21790
14.4%
F10771
 
7.1%
m10771
 
7.1%
Common
ValueCountFrequency (%)
32561
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII184347
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e43332
23.5%
32561
17.7%
a32561
17.7%
l32561
17.7%
M21790
11.8%
F10771
 
5.8%
m10771
 
5.8%

capital-gain
Real number (ℝ≥0)

ZEROS

Distinct119
Distinct (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1077.648844
Minimum0
Maximum99999
Zeros29849
Zeros (%)91.7%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:25.545480image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range (IQR)0

Descriptive statistics

Standard deviation7385.292085
Coefficient of variation (CV)6.853152702
Kurtosis154.7994379
Mean1077.648844
Median Absolute Deviation (MAD)0
Skewness11.95384769
Sum35089324
Variance54542539.18
MonotonicityNot monotonic
2022-09-06T19:01:25.671658image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
029849
91.7%
15024347
 
1.1%
7688284
 
0.9%
7298246
 
0.8%
99999159
 
0.5%
517897
 
0.3%
310397
 
0.3%
438670
 
0.2%
501369
 
0.2%
861455
 
0.2%
Other values (109)1288
 
4.0%
ValueCountFrequency (%)
029849
91.7%
1146
 
< 0.1%
4012
 
< 0.1%
59434
 
0.1%
9148
 
< 0.1%
9915
 
< 0.1%
105525
 
0.1%
10864
 
< 0.1%
11111
 
< 0.1%
11518
 
< 0.1%
ValueCountFrequency (%)
99999159
0.5%
413102
 
< 0.1%
340955
 
< 0.1%
2782834
 
0.1%
2523611
 
< 0.1%
251244
 
< 0.1%
220401
 
< 0.1%
2005137
 
0.1%
184812
 
< 0.1%
158316
 
< 0.1%

capital-loss
Real number (ℝ≥0)

ZEROS

Distinct92
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean87.30382973
Minimum0
Maximum4356
Zeros31042
Zeros (%)95.3%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:25.801519image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range (IQR)0

Descriptive statistics

Standard deviation402.9602186
Coefficient of variation (CV)4.615607584
Kurtosis20.37680171
Mean87.30382973
Median Absolute Deviation (MAD)0
Skewness4.594629122
Sum2842700
Variance162376.9378
MonotonicityNot monotonic
2022-09-06T19:01:25.921239image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
031042
95.3%
1902202
 
0.6%
1977168
 
0.5%
1887159
 
0.5%
184851
 
0.2%
148551
 
0.2%
241549
 
0.2%
160247
 
0.1%
174042
 
0.1%
159040
 
0.1%
Other values (82)710
 
2.2%
ValueCountFrequency (%)
031042
95.3%
1551
 
< 0.1%
2134
 
< 0.1%
3233
 
< 0.1%
4193
 
< 0.1%
62512
 
< 0.1%
6533
 
< 0.1%
8102
 
< 0.1%
8806
 
< 0.1%
9742
 
< 0.1%
ValueCountFrequency (%)
43563
 
< 0.1%
39002
 
< 0.1%
37702
 
< 0.1%
36832
 
< 0.1%
30042
 
< 0.1%
282410
< 0.1%
27542
 
< 0.1%
26035
< 0.1%
255912
< 0.1%
25474
 
< 0.1%

hours-per-week
Real number (ℝ≥0)

Distinct94
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40.43745585
Minimum1
Maximum99
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2022-09-06T19:01:26.046285image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile18
Q140
median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range (IQR)5

Descriptive statistics

Standard deviation12.34742868
Coefficient of variation (CV)0.3053463286
Kurtosis2.916686796
Mean40.43745585
Median Absolute Deviation (MAD)3
Skewness0.2276425368
Sum1316684
Variance152.4589951
MonotonicityNot monotonic
2022-09-06T19:01:26.173065image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
4015217
46.7%
502819
 
8.7%
451824
 
5.6%
601475
 
4.5%
351297
 
4.0%
201224
 
3.8%
301149
 
3.5%
55694
 
2.1%
25674
 
2.1%
48517
 
1.6%
Other values (84)5671
 
17.4%
ValueCountFrequency (%)
120
 
0.1%
232
 
0.1%
339
 
0.1%
454
 
0.2%
560
 
0.2%
664
 
0.2%
726
 
0.1%
8145
0.4%
918
 
0.1%
10278
0.9%
ValueCountFrequency (%)
9985
0.3%
9811
 
< 0.1%
972
 
< 0.1%
965
 
< 0.1%
952
 
< 0.1%
941
 
< 0.1%
921
 
< 0.1%
913
 
< 0.1%
9029
 
0.1%
892
 
< 0.1%

native-country
Categorical

HIGH CORRELATION
MISSING

Distinct41
Distinct (%)0.1%
Missing583
Missing (%)1.8%
Memory size2.2 MiB
United-States
29170 
Mexico
 
643
Philippines
 
198
Germany
 
137
Canada
 
121
Other values (36)
 
1709

Length

Max length27
Median length14
Mean length13.49974983
Min length5

Characters and Unicode

Total characters431695
Distinct characters45
Distinct categories7 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique1 ?
Unique (%)< 0.1%

Sample

1st row United-States
2nd row United-States
3rd row United-States
4th row United-States
5th row Cuba

Common Values

ValueCountFrequency (%)
United-States29170
89.6%
Mexico643
 
2.0%
Philippines198
 
0.6%
Germany137
 
0.4%
Canada121
 
0.4%
Puerto-Rico114
 
0.4%
El-Salvador106
 
0.3%
India100
 
0.3%
Cuba95
 
0.3%
England90
 
0.3%
Other values (31)1204
 
3.7%
(Missing)583
 
1.8%

Length

2022-09-06T19:01:26.294927image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
united-states29170
91.2%
mexico643
 
2.0%
philippines198
 
0.6%
germany137
 
0.4%
canada121
 
0.4%
puerto-rico114
 
0.4%
el-salvador106
 
0.3%
india100
 
0.3%
cuba95
 
0.3%
england90
 
0.3%
Other values (31)1204
 
3.8%

Most occurring characters

ValueCountFrequency (%)
t88030
20.4%
e59820
13.9%
31978
 
7.4%
a31774
 
7.4%
i31372
 
7.3%
n30568
 
7.1%
d29801
 
6.9%
-29503
 
6.8%
s29416
 
6.8%
S29396
 
6.8%
Other values (35)40037
9.3%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter308611
71.5%
Uppercase Letter61556
 
14.3%
Space Separator31978
 
7.4%
Dash Punctuation29503
 
6.8%
Other Punctuation19
 
< 0.1%
Open Punctuation14
 
< 0.1%
Close Punctuation14
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
t88030
28.5%
e59820
19.4%
a31774
 
10.3%
i31372
 
10.2%
n30568
 
9.9%
d29801
 
9.7%
s29416
 
9.5%
o1448
 
0.5%
c1124
 
0.4%
l949
 
0.3%
Other values (11)4309
 
1.4%
Uppercase Letter
ValueCountFrequency (%)
S29396
47.8%
U29198
47.4%
M643
 
1.0%
P440
 
0.7%
C369
 
0.6%
I254
 
0.4%
G244
 
0.4%
E224
 
0.4%
R184
 
0.3%
J143
 
0.2%
Other values (9)461
 
0.7%
Space Separator
ValueCountFrequency (%)
31978
100.0%
Dash Punctuation
ValueCountFrequency (%)
-29503
100.0%
Other Punctuation
ValueCountFrequency (%)
&19
100.0%
Open Punctuation
ValueCountFrequency (%)
(14
100.0%
Close Punctuation
ValueCountFrequency (%)
)14
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin370167
85.7%
Common61528
 
14.3%

Most frequent character per script

Latin
ValueCountFrequency (%)
t88030
23.8%
e59820
16.2%
a31774
 
8.6%
i31372
 
8.5%
n30568
 
8.3%
d29801
 
8.1%
s29416
 
7.9%
S29396
 
7.9%
U29198
 
7.9%
o1448
 
0.4%
Other values (30)9344
 
2.5%
Common
ValueCountFrequency (%)
31978
52.0%
-29503
48.0%
&19
 
< 0.1%
(14
 
< 0.1%
)14
 
< 0.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII431695
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
t88030
20.4%
e59820
13.9%
31978
 
7.4%
a31774
 
7.4%
i31372
 
7.3%
n30568
 
7.1%
d29801
 
6.9%
-29503
 
6.8%
s29416
 
6.8%
S29396
 
6.8%
Other values (35)40037
9.3%

Interactions

2022-09-06T19:01:21.607576image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-09-06T19:01:17.979039image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-09-06T19:01:18.690927image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-09-06T19:01:19.541737image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-09-06T19:01:20.243535image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/