Overview

Dataset statistics

Number of variables14
Number of observations32561
Missing cells4262
Missing cells (%)0.9%
Duplicate rows24
Duplicate rows (%)0.1%
Total size in memory18.1 MiB
Average record size in memory583.0 B

Variable types

Numeric6
Categorical8

Dataset

DescriptionPredict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year.
CreatorBarry Becker
AuthorRonny Kohavi and Barry Becker
URLhttps://archive.ics.uci.edu/ml/datasets/adult

Variable descriptions

agedefinition 0
workclassdefinition 1
fnlwgtdefinition 2
educationdefinition 3
education-numdefinition 4
marital-statusdefinition 5
occupationdefinition 6
relationshipdefinition 7
racedefinition 8
sexdefinition 9
capital-gaindefinition 10
capital-lossdefinition 11
hours-per-weekdefinition 12
native-countrydefinition 13

Alerts

Dataset has 24 (0.1%) duplicate rowsDuplicates
education-num is highly overall correlated with educationHigh correlation
education is highly overall correlated with education-numHigh correlation
relationship is highly overall correlated with sexHigh correlation
sex is highly overall correlated with relationshipHigh correlation
workclass is highly imbalanced (52.8%)Imbalance
race is highly imbalanced (65.6%)Imbalance
native-country is highly imbalanced (84.5%)Imbalance
workclass has 1836 (5.6%) missing valuesMissing
occupation has 1843 (5.7%) missing valuesMissing
native-country has 583 (1.8%) missing valuesMissing
capital-gain has 29849 (91.7%) zerosZeros
capital-loss has 31042 (95.3%) zerosZeros

Reproduction

Analysis started2023-01-25 14:19:24.118415
Analysis finished2023-01-25 14:19:31.401280
Duration7.28 seconds
Software versionpandas-profiling v0.0.dev0
Download configurationconfig.json

Variables

age
Real number (ℝ)

Distinct73
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean38.581647
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:31.476568image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum17
5-th percentile19
Q128
median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range (IQR)20

Descriptive statistics

Standard deviation13.640433
Coefficient of variation (CV)0.35354718
Kurtosis-0.16612746
Mean38.581647
Median Absolute Deviation (MAD)10
Skewness0.55874337
Sum1256257
Variance186.0614
MonotonicityNot monotonic
2023-01-25T14:19:31.622035image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
36 898
 
2.8%
31 888
 
2.7%
34 886
 
2.7%
23 877
 
2.7%
35 876
 
2.7%
33 875
 
2.7%
28 867
 
2.7%
30 861
 
2.6%
37 858
 
2.6%
25 841
 
2.6%
Other values (63) 23834
73.2%
ValueCountFrequency (%)
17 395
1.2%
18 550
1.7%
19 712
2.2%
20 753
2.3%
21 720
2.2%
22 765
2.3%
23 877
2.7%
24 798
2.5%
25 841
2.6%
26 785
2.4%
ValueCountFrequency (%)
90 43
0.1%
88 3
 
< 0.1%
87 1
 
< 0.1%
86 1
 
< 0.1%
85 3
 
< 0.1%
84 10
 
< 0.1%
83 6
 
< 0.1%
82 12
 
< 0.1%
81 20
0.1%
80 22
0.1%

workclass
Categorical

IMBALANCE  MISSING 

Distinct8
Distinct (%)< 0.1%
Missing1836
Missing (%)5.6%
Memory size2.0 MiB
Private
22696 
Self-emp-not-inc
2541 
Local-gov
 
2093
State-gov
 
1298
Self-emp-inc
 
1116
Other values (3)
 
981

Length

Max length17
Median length8
Mean length9.2745972
Min length8

Characters and Unicode

Total characters284962
Distinct characters28
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row State-gov
2nd row Self-emp-not-inc
3rd row Private
4th row Private
5th row Private

Common Values

ValueCountFrequency (%)
Private 22696
69.7%
Self-emp-not-inc 2541
 
7.8%
Local-gov 2093
 
6.4%
State-gov 1298
 
4.0%
Self-emp-inc 1116
 
3.4%
Federal-gov 960
 
2.9%
Without-pay 14
 
< 0.1%
Never-worked 7
 
< 0.1%
(Missing) 1836
 
5.6%

Length

2023-01-25T14:19:31.759830image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-01-25T14:19:31.898877image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
private 22696
73.9%
self-emp-not-inc 2541
 
8.3%
local-gov 2093
 
6.8%
state-gov 1298
 
4.2%
self-emp-inc 1116
 
3.6%
federal-gov 960
 
3.1%
without-pay 14
 
< 0.1%
never-worked 7
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 209285
73.4%
Space Separator 30725
 
10.8%
Uppercase Letter 30725
 
10.8%
Dash Punctuation 14227
 
5.0%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 33249
15.9%
t 27861
13.3%
a 27061
12.9%
v 27054
12.9%
i 26367
12.6%
r 23670
11.3%
o 9006
 
4.3%
l 6710
 
3.2%
n 6198
 
3.0%
c 5750
 
2.7%
Other values (10) 16359
7.8%
Uppercase Letter
ValueCountFrequency (%)
P 22696
73.9%
S 4955
 
16.1%
L 2093
 
6.8%
F 960
 
3.1%
W 14
 
< 0.1%
N 7
 
< 0.1%
Space Separator
ValueCountFrequency (%)
30725
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 14227
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 240010
84.2%
Common 44952
 
15.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 33249
13.9%
t 27861
11.6%
a 27061
11.3%
v 27054
11.3%
i 26367
11.0%
r 23670
9.9%
P 22696
9.5%
o 9006
 
3.8%
l 6710
 
2.8%
n 6198
 
2.6%
Other values (16) 30138
12.6%
Common
ValueCountFrequency (%)
30725
68.4%
- 14227
31.6%

Most occurring blocks

ValueCountFrequency (%)
ASCII 284962
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 33249
11.7%
30725
10.8%
t 27861
9.8%
a 27061
9.5%
v 27054
9.5%
i 26367
9.3%
r 23670
8.3%
P 22696
8.0%
- 14227
 
5.0%
o 9006
 
3.2%
Other values (18) 43046
15.1%

fnlwgt
Real number (ℝ)

Distinct21648
Distinct (%)66.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean189778.37
Minimum12285
Maximum1484705
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:32.047987image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum12285
5-th percentile39460
Q1117827
median178356
Q3237051
95-th percentile379682
Maximum1484705
Range1472420
Interquartile range (IQR)119224

Descriptive statistics

Standard deviation105549.98
Coefficient of variation (CV)0.55617497
Kurtosis6.218811
Mean189778.37
Median Absolute Deviation (MAD)59894
Skewness1.4469801
Sum6.1793734 × 109
Variance1.1140798 × 1010
MonotonicityNot monotonic
2023-01-25T14:19:32.318361image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
164190 13
 
< 0.1%
203488 13
 
< 0.1%
123011 13
 
< 0.1%
148995 12
 
< 0.1%
121124 12
 
< 0.1%
113364 12
 
< 0.1%
126675 12
 
< 0.1%
111483 11
 
< 0.1%
120277 11
 
< 0.1%
123983 11
 
< 0.1%
Other values (21638) 32441
99.6%
ValueCountFrequency (%)
12285 1
 
< 0.1%
13769 1
 
< 0.1%
14878 1
 
< 0.1%
18827 1
 
< 0.1%
19214 1
 
< 0.1%
19302 5
< 0.1%
19395 2
 
< 0.1%
19410 1
 
< 0.1%
19491 1
 
< 0.1%
19520 1
 
< 0.1%
ValueCountFrequency (%)
1484705 1
< 0.1%
1455435 1
< 0.1%
1366120 1
< 0.1%
1268339 1
< 0.1%
1226583 1
< 0.1%
1184622 1
< 0.1%
1161363 1
< 0.1%
1125613 1
< 0.1%
1097453 1
< 0.1%
1085515 1
< 0.1%

education
Categorical

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
HS-grad
10501 
Some-college
7291 
Bachelors
5355 
Masters
1723 
Assoc-voc
1382 
Other values (11)
6309 

Length

Max length13
Median length12
Mean length9.433709
Min length4

Characters and Unicode

Total characters307171
Distinct characters32
Distinct categories5 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Bachelors
2nd row Bachelors
3rd row HS-grad
4th row 11th
5th row Bachelors

Common Values

ValueCountFrequency (%)
HS-grad 10501
32.3%
Some-college 7291
22.4%
Bachelors 5355
16.4%
Masters 1723
 
5.3%
Assoc-voc 1382
 
4.2%
11th 1175
 
3.6%
Assoc-acdm 1067
 
3.3%
10th 933
 
2.9%
7th-8th 646
 
2.0%
Prof-school 576
 
1.8%
Other values (6) 1912
 
5.9%

Length

2023-01-25T14:19:32.459499image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
hs-grad 10501
32.3%
some-college 7291
22.4%
bachelors 5355
16.4%
masters 1723
 
5.3%
assoc-voc 1382
 
4.2%
11th 1175
 
3.6%
assoc-acdm 1067
 
3.3%
10th 933
 
2.9%
7th-8th 646
 
2.0%
prof-school 576
 
1.8%
Other values (6) 1912
 
5.9%

Most occurring characters

ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 205896
67.0%
Uppercase Letter 38860
 
12.7%
Space Separator 32561
 
10.6%
Dash Punctuation 21964
 
7.2%
Decimal Number 7890
 
2.6%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 29415
14.3%
o 26424
12.8%
l 20564
10.0%
a 19059
9.3%
r 18619
9.0%
c 18584
9.0%
g 17792
8.6%
s 14494
7.0%
d 11568
 
5.6%
h 11163
 
5.4%
Other values (4) 18214
8.8%
Decimal Number
ValueCountFrequency (%)
1 3884
49.2%
0 933
 
11.8%
7 646
 
8.2%
8 646
 
8.2%
9 514
 
6.5%
2 433
 
5.5%
5 333
 
4.2%
6 333
 
4.2%
4 168
 
2.1%
Uppercase Letter
ValueCountFrequency (%)
S 17792
45.8%
H 10501
27.0%
B 5355
 
13.8%
A 2449
 
6.3%
M 1723
 
4.4%
P 627
 
1.6%
D 413
 
1.1%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 21964
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 244756
79.7%
Common 62415
 
20.3%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 29415
12.0%
o 26424
10.8%
l 20564
8.4%
a 19059
 
7.8%
r 18619
 
7.6%
c 18584
 
7.6%
S 17792
 
7.3%
g 17792
 
7.3%
s 14494
 
5.9%
d 11568
 
4.7%
Other values (11) 50445
20.6%
Common
ValueCountFrequency (%)
32561
52.2%
- 21964
35.2%
1 3884
 
6.2%
0 933
 
1.5%
7 646
 
1.0%
8 646
 
1.0%
9 514
 
0.8%
2 433
 
0.7%
5 333
 
0.5%
6 333
 
0.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII 307171
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
 
10.6%
e 29415
 
9.6%
o 26424
 
8.6%
- 21964
 
7.2%
l 20564
 
6.7%
a 19059
 
6.2%
r 18619
 
6.1%
c 18584
 
6.1%
S 17792
 
5.8%
g 17792
 
5.8%
Other values (22) 84397
27.5%

education-num
Real number (ℝ)

Distinct16
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.080679
Minimum1
Maximum16
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:32.563992image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile5
Q19
median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.5727203
Coefficient of variation (CV)0.25521299
Kurtosis0.62344407
Mean10.080679
Median Absolute Deviation (MAD)1
Skewness-0.31167587
Sum328237
Variance6.6188899
MonotonicityNot monotonic
2023-01-25T14:19:32.666078image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=16)
ValueCountFrequency (%)
9 10501
32.3%
10 7291
22.4%
13 5355
16.4%
14 1723
 
5.3%
11 1382
 
4.2%
7 1175
 
3.6%
12 1067
 
3.3%
6 933
 
2.9%
4 646
 
2.0%
15 576
 
1.8%
Other values (6) 1912
 
5.9%
ValueCountFrequency (%)
1 51
 
0.2%
2 168
 
0.5%
3 333
 
1.0%
4 646
 
2.0%
5 514
 
1.6%
6 933
 
2.9%
7 1175
 
3.6%
8 433
 
1.3%
9 10501
32.3%
10 7291
22.4%
ValueCountFrequency (%)
16 413
 
1.3%
15 576
 
1.8%
14 1723
 
5.3%
13 5355
16.4%
12 1067
 
3.3%
11 1382
 
4.2%
10 7291
22.4%
9 10501
32.3%
8 433
 
1.3%
7 1175
 
3.6%

marital-status
Categorical

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.2 MiB
Married-civ-spouse
14976 
Never-married
10683 
Divorced
4443 
Separated
 
1025
Widowed
 
993
Other values (2)
 
441

Length

Max length22
Median length19
Mean length15.414054
Min length8

Characters and Unicode

Total characters501897
Distinct characters25
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Never-married
2nd row Married-civ-spouse
3rd row Divorced
4th row Married-civ-spouse
5th row Married-civ-spouse

Common Values

ValueCountFrequency (%)
Married-civ-spouse 14976
46.0%
Never-married 10683
32.8%
Divorced 4443
 
13.6%
Separated 1025
 
3.1%
Widowed 993
 
3.0%
Married-spouse-absent 418
 
1.3%
Married-AF-spouse 23
 
0.1%

Length

2023-01-25T14:19:32.797901image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-01-25T14:19:32.945057image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
married-civ-spouse 14976
46.0%
never-married 10683
32.8%
divorced 4443
 
13.6%
separated 1025
 
3.1%
widowed 993
 
3.0%
married-spouse-absent 418
 
1.3%
married-af-spouse 23
 
0.1%

Most occurring characters

ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 395212
78.7%
Dash Punctuation 41517
 
8.3%
Uppercase Letter 32607
 
6.5%
Space Separator 32561
 
6.5%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 70787
17.9%
r 68351
17.3%
i 46512
11.8%
d 33554
8.5%
s 31252
7.9%
v 30102
7.6%
a 28568
7.2%
o 20853
 
5.3%
c 19419
 
4.9%
p 16442
 
4.2%
Other values (6) 29372
7.4%
Uppercase Letter
ValueCountFrequency (%)
M 15417
47.3%
N 10683
32.8%
D 4443
 
13.6%
S 1025
 
3.1%
W 993
 
3.0%
A 23
 
0.1%
F 23
 
0.1%
Dash Punctuation
ValueCountFrequency (%)
- 41517
100.0%
Space Separator
ValueCountFrequency (%)
32561
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 427819
85.2%
Common 74078
 
14.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 70787
16.5%
r 68351
16.0%
i 46512
10.9%
d 33554
7.8%
s 31252
7.3%
v 30102
7.0%
a 28568
6.7%
o 20853
 
4.9%
c 19419
 
4.5%
p 16442
 
3.8%
Other values (13) 61979
14.5%
Common
ValueCountFrequency (%)
- 41517
56.0%
32561
44.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 501897
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 70787
14.1%
r 68351
13.6%
i 46512
9.3%
- 41517
8.3%
d 33554
 
6.7%
32561
 
6.5%
s 31252
 
6.2%
v 30102
 
6.0%
a 28568
 
5.7%
o 20853
 
4.2%
Other values (15) 97840
19.5%

occupation
Categorical

Distinct14
Distinct (%)< 0.1%
Missing1843
Missing (%)5.7%
Memory size2.1 MiB
Prof-specialty
4140 
Craft-repair
4099 
Exec-managerial
4066 
Adm-clerical
3770 
Sales
3650 
Other values (9)
10993 

Length

Max length18
Median length16
Mean length13.873983
Min length6

Characters and Unicode

Total characters426181
Distinct characters32
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Adm-clerical
2nd row Exec-managerial
3rd row Handlers-cleaners
4th row Handlers-cleaners
5th row Prof-specialty

Common Values

ValueCountFrequency (%)
Prof-specialty 4140
12.7%
Craft-repair 4099
12.6%
Exec-managerial 4066
12.5%
Adm-clerical 3770
11.6%
Sales 3650
11.2%
Other-service 3295
10.1%
Machine-op-inspct 2002
6.1%
Transport-moving 1597
 
4.9%
Handlers-cleaners 1370
 
4.2%
Farming-fishing 994
 
3.1%
Other values (4) 1735
5.3%
(Missing) 1843
5.7%

Length

2023-01-25T14:19:33.081572image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
prof-specialty 4140
13.5%
craft-repair 4099
13.3%
exec-managerial 4066
13.2%
adm-clerical 3770
12.3%
sales 3650
11.9%
other-service 3295
10.7%
machine-op-inspct 2002
6.5%
transport-moving 1597
 
5.2%
handlers-cleaners 1370
 
4.5%
farming-fishing 994
 
3.2%
Other values (4) 1735
5.6%

Most occurring characters

ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 335517
78.7%
Uppercase Letter 30727
 
7.2%
Space Separator 30718
 
7.2%
Dash Punctuation 29219
 
6.9%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 42979
12.8%
r 40333
12.0%
a 39289
11.7%
i 28751
8.6%
c 26001
 
7.7%
l 22136
 
6.6%
s 20302
 
6.1%
t 17359
 
5.2%
n 15992
 
4.8%
p 15696
 
4.7%
Other values (10) 66679
19.9%
Uppercase Letter
ValueCountFrequency (%)
P 4938
16.1%
C 4099
13.3%
E 4066
13.2%
A 3779
12.3%
S 3650
11.9%
O 3295
10.7%
T 2525
8.2%
M 2002
6.5%
H 1370
 
4.5%
F 1003
 
3.3%
Space Separator
ValueCountFrequency (%)
30718
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 29219
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 366244
85.9%
Common 59937
 
14.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 42979
11.7%
r 40333
11.0%
a 39289
10.7%
i 28751
 
7.9%
c 26001
 
7.1%
l 22136
 
6.0%
s 20302
 
5.5%
t 17359
 
4.7%
n 15992
 
4.4%
p 15696
 
4.3%
Other values (20) 97406
26.6%
Common
ValueCountFrequency (%)
30718
51.3%
- 29219
48.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII 426181
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 42979
 
10.1%
r 40333
 
9.5%
a 39289
 
9.2%
30718
 
7.2%
- 29219
 
6.9%
i 28751
 
6.7%
c 26001
 
6.1%
l 22136
 
5.2%
s 20302
 
4.8%
t 17359
 
4.1%
Other values (22) 129094
30.3%

relationship
Categorical

Distinct6
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.1 MiB
Husband
13193 
Not-in-family
8305 
Own-child
5068 
Unmarried
3446 
Wife
1568 

Length

Max length15
Median length14
Mean length10.119744
Min length5

Characters and Unicode

Total characters329509
Distinct characters26
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Not-in-family
2nd row Husband
3rd row Not-in-family
4th row Husband
5th row Wife

Common Values

ValueCountFrequency (%)
Husband 13193
40.5%
Not-in-family 8305
25.5%
Own-child 5068
 
15.6%
Unmarried 3446
 
10.6%
Wife 1568
 
4.8%
Other-relative 981
 
3.0%

Length

2023-01-25T14:19:33.200441image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-01-25T14:19:33.333212image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
husband 13193
40.5%
not-in-family 8305
25.5%
own-child 5068
 
15.6%
unmarried 3446
 
10.6%
wife 1568
 
4.8%
other-relative 981
 
3.0%

Most occurring characters

ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
H 13193
 
4.0%
u 13193
 
4.0%
s 13193
 
4.0%
Other values (16) 115039
34.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 241728
73.4%
Space Separator 32561
 
9.9%
Uppercase Letter 32561
 
9.9%
Dash Punctuation 22659
 
6.9%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
n 30012
12.4%
i 27673
11.4%
a 25925
10.7%
d 21707
 
9.0%
l 14354
 
5.9%
u 13193
 
5.5%
s 13193
 
5.5%
b 13193
 
5.5%
m 11751
 
4.9%
t 10267
 
4.2%
Other values (9) 60460
25.0%
Uppercase Letter
ValueCountFrequency (%)
H 13193
40.5%
N 8305
25.5%
O 6049
18.6%
U 3446
 
10.6%
W 1568
 
4.8%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 22659
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 274289
83.2%
Common 55220
 
16.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
n 30012
 
10.9%
i 27673
 
10.1%
a 25925
 
9.5%
d 21707
 
7.9%
l 14354
 
5.2%
H 13193
 
4.8%
u 13193
 
4.8%
s 13193
 
4.8%
b 13193
 
4.8%
m 11751
 
4.3%
Other values (14) 90095
32.8%
Common
ValueCountFrequency (%)
32561
59.0%
- 22659
41.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 329509
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
 
9.9%
n 30012
 
9.1%
i 27673
 
8.4%
a 25925
 
7.9%
- 22659
 
6.9%
d 21707
 
6.6%
l 14354
 
4.4%
H 13193
 
4.0%
u 13193
 
4.0%
s 13193
 
4.0%
Other values (16) 115039
34.9%

race
Categorical

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.0 MiB
White
27816 
Black
3124 
Asian-Pac-Islander
 
1039
Amer-Indian-Eskimo
 
311
Other
 
271

Length

Max length19
Median length6
Mean length6.5389884
Min length6

Characters and Unicode

Total characters212916
Distinct characters23
Distinct categories4 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row White
2nd row White
3rd row White
4th row Black
5th row Black

Common Values

ValueCountFrequency (%)
White 27816
85.4%
Black 3124
 
9.6%
Asian-Pac-Islander 1039
 
3.2%
Amer-Indian-Eskimo 311
 
1.0%
Other 271
 
0.8%

Length

2023-01-25T14:19:33.459554image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-01-25T14:19:33.592673image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
white 27816
85.4%
black 3124
 
9.6%
asian-pac-islander 1039
 
3.2%
amer-indian-eskimo 311
 
1.0%
other 271
 
0.8%

Most occurring characters

ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
h 28087
13.2%
t 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
l 4163
 
2.0%
c 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 142394
66.9%
Uppercase Letter 35261
 
16.6%
Space Separator 32561
 
15.3%
Dash Punctuation 2700
 
1.3%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
i 29477
20.7%
e 29437
20.7%
h 28087
19.7%
t 28087
19.7%
a 6552
 
4.6%
l 4163
 
2.9%
c 4163
 
2.9%
k 3435
 
2.4%
n 2700
 
1.9%
s 2389
 
1.7%
Other values (4) 3904
 
2.7%
Uppercase Letter
ValueCountFrequency (%)
W 27816
78.9%
B 3124
 
8.9%
A 1350
 
3.8%
I 1350
 
3.8%
P 1039
 
2.9%
E 311
 
0.9%
O 271
 
0.8%
Space Separator
ValueCountFrequency (%)
32561
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 2700
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 177655
83.4%
Common 35261
 
16.6%

Most frequent character per script

Latin
ValueCountFrequency (%)
i 29477
16.6%
e 29437
16.6%
h 28087
15.8%
t 28087
15.8%
W 27816
15.7%
a 6552
 
3.7%
l 4163
 
2.3%
c 4163
 
2.3%
k 3435
 
1.9%
B 3124
 
1.8%
Other values (11) 13314
7.5%
Common
ValueCountFrequency (%)
32561
92.3%
- 2700
 
7.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII 212916
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
32561
15.3%
i 29477
13.8%
e 29437
13.8%
h 28087
13.2%
t 28087
13.2%
W 27816
13.1%
a 6552
 
3.1%
l 4163
 
2.0%
c 4163
 
2.0%
k 3435
 
1.6%
Other values (13) 19138
9.0%

sex
Categorical

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.9 MiB
Male
21790 
Female
10771 

Length

Max length7
Median length5
Mean length5.661589
Min length5

Characters and Unicode

Total characters184347
Distinct characters7
Distinct categories3 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row Male
2nd row Male
3rd row Male
4th row Male
5th row Female

Common Values

ValueCountFrequency (%)
Male 21790
66.9%
Female 10771
33.1%

Length

2023-01-25T14:19:33.710984image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-01-25T14:19:33.838197image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
male 21790
66.9%
female 10771
33.1%

Most occurring characters

ValueCountFrequency (%)
e 43332
23.5%
32561
17.7%
a 32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 119225
64.7%
Space Separator 32561
 
17.7%
Uppercase Letter 32561
 
17.7%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 43332
36.3%
a 32561
27.3%
l 32561
27.3%
m 10771
 
9.0%
Uppercase Letter
ValueCountFrequency (%)
M 21790
66.9%
F 10771
33.1%
Space Separator
ValueCountFrequency (%)
32561
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 151786
82.3%
Common 32561
 
17.7%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 43332
28.5%
a 32561
21.5%
l 32561
21.5%
M 21790
14.4%
F 10771
 
7.1%
m 10771
 
7.1%
Common
ValueCountFrequency (%)
32561
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 184347
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 43332
23.5%
32561
17.7%
a 32561
17.7%
l 32561
17.7%
M 21790
11.8%
F 10771
 
5.8%
m 10771
 
5.8%

capital-gain
Real number (ℝ)

Distinct119
Distinct (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1077.6488
Minimum0
Maximum99999
Zeros29849
Zeros (%)91.7%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:33.955779image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range (IQR)0

Descriptive statistics

Standard deviation7385.2921
Coefficient of variation (CV)6.8531527
Kurtosis154.79944
Mean1077.6488
Median Absolute Deviation (MAD)0
Skewness11.953848
Sum35089324
Variance54542539
MonotonicityNot monotonic
2023-01-25T14:19:34.102100image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 29849
91.7%
15024 347
 
1.1%
7688 284
 
0.9%
7298 246
 
0.8%
99999 159
 
0.5%
5178 97
 
0.3%
3103 97
 
0.3%
4386 70
 
0.2%
5013 69
 
0.2%
8614 55
 
0.2%
Other values (109) 1288
 
4.0%
ValueCountFrequency (%)
0 29849
91.7%
114 6
 
< 0.1%
401 2
 
< 0.1%
594 34
 
0.1%
914 8
 
< 0.1%
991 5
 
< 0.1%
1055 25
 
0.1%
1086 4
 
< 0.1%
1111 1
 
< 0.1%
1151 8
 
< 0.1%
ValueCountFrequency (%)
99999 159
0.5%
41310 2
 
< 0.1%
34095 5
 
< 0.1%
27828 34
 
0.1%
25236 11
 
< 0.1%
25124 4
 
< 0.1%
22040 1
 
< 0.1%
20051 37
 
0.1%
18481 2
 
< 0.1%
15831 6
 
< 0.1%

capital-loss
Real number (ℝ)

Distinct92
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean87.30383
Minimum0
Maximum4356
Zeros31042
Zeros (%)95.3%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:34.249763image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range (IQR)0

Descriptive statistics

Standard deviation402.96022
Coefficient of variation (CV)4.6156076
Kurtosis20.376802
Mean87.30383
Median Absolute Deviation (MAD)0
Skewness4.5946291
Sum2842700
Variance162376.94
MonotonicityNot monotonic
2023-01-25T14:19:34.388058image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 31042
95.3%
1902 202
 
0.6%
1977 168
 
0.5%
1887 159
 
0.5%
1848 51
 
0.2%
1485 51
 
0.2%
2415 49
 
0.2%
1602 47
 
0.1%
1740 42
 
0.1%
1590 40
 
0.1%
Other values (82) 710
 
2.2%
ValueCountFrequency (%)
0 31042
95.3%
155 1
 
< 0.1%
213 4
 
< 0.1%
323 3
 
< 0.1%
419 3
 
< 0.1%
625 12
 
< 0.1%
653 3
 
< 0.1%
810 2
 
< 0.1%
880 6
 
< 0.1%
974 2
 
< 0.1%
ValueCountFrequency (%)
4356 3
 
< 0.1%
3900 2
 
< 0.1%
3770 2
 
< 0.1%
3683 2
 
< 0.1%
3004 2
 
< 0.1%
2824 10
< 0.1%
2754 2
 
< 0.1%
2603 5
< 0.1%
2559 12
< 0.1%
2547 4
 
< 0.1%

hours-per-week
Real number (ℝ)

Distinct94
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40.437456
Minimum1
Maximum99
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size254.5 KiB
2023-01-25T14:19:34.532648image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile18
Q140
median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range (IQR)5

Descriptive statistics

Standard deviation12.347429
Coefficient of variation (CV)0.30534633
Kurtosis2.9166868
Mean40.437456
Median Absolute Deviation (MAD)3
Skewness0.22764254
Sum1316684
Variance152.459
MonotonicityNot monotonic
2023-01-25T14:19:34.677222image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
40 15217
46.7%
50 2819
 
8.7%
45 1824
 
5.6%
60 1475
 
4.5%
35 1297
 
4.0%
20 1224
 
3.8%
30 1149
 
3.5%
55 694
 
2.1%
25 674
 
2.1%
48 517
 
1.6%
Other values (84) 5671
 
17.4%
ValueCountFrequency (%)
1 20
 
0.1%
2 32
 
0.1%
3 39
 
0.1%
4 54
 
0.2%
5 60
 
0.2%
6 64
 
0.2%
7 26
 
0.1%
8 145
0.4%
9 18
 
0.1%
10 278
0.9%
ValueCountFrequency (%)
99 85
0.3%
98 11
 
< 0.1%
97 2
 
< 0.1%
96 5
 
< 0.1%
95 2
 
< 0.1%
94 1
 
< 0.1%
92 1
 
< 0.1%
91 3
 
< 0.1%
90 29
 
0.1%
89 2
 
< 0.1%

native-country
Categorical

IMBALANCE  MISSING 

Distinct41
Distinct (%)0.1%
Missing583
Missing (%)1.8%
Memory size2.2 MiB
United-States
29170 
Mexico
 
643
Philippines
 
198
Germany
 
137
Canada
 
121
Other values (36)
 
1709

Length

Max length27
Median length14
Mean length13.49975
Min length5

Characters and Unicode

Total characters431695
Distinct characters45
Distinct categories7 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique1 ?
Unique (%)< 0.1%

Sample

1st row United-States
2nd row United-States
3rd row United-States
4th row United-States
5th row Cuba

Common Values

ValueCountFrequency (%)
United-States 29170
89.6%
Mexico 643
 
2.0%
Philippines 198
 
0.6%
Germany 137
 
0.4%
Canada 121
 
0.4%
Puerto-Rico 114
 
0.4%
El-Salvador 106
 
0.3%
India 100
 
0.3%
Cuba 95
 
0.3%
England 90
 
0.3%
Other values (31) 1204
 
3.7%
(Missing) 583
 
1.8%

Length

2023-01-25T14:19:34.810605image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
united-states 29170
91.2%
mexico 643
 
2.0%
philippines 198
 
0.6%
germany 137
 
0.4%
canada 121
 
0.4%
puerto-rico 114
 
0.4%
el-salvador 106
 
0.3%
india 100
 
0.3%
cuba 95
 
0.3%
england 90
 
0.3%
Other values (31) 1204
 
3.8%

Most occurring characters

ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 308611
71.5%
Uppercase Letter 61556
 
14.3%
Space Separator 31978
 
7.4%
Dash Punctuation 29503
 
6.8%
Other Punctuation 19
 
< 0.1%
Open Punctuation 14
 
< 0.1%
Close Punctuation 14
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
t 88030
28.5%
e 59820
19.4%
a 31774
 
10.3%
i 31372
 
10.2%
n 30568
 
9.9%
d 29801
 
9.7%
s 29416
 
9.5%
o 1448
 
0.5%
c 1124
 
0.4%
l 949
 
0.3%
Other values (11) 4309
 
1.4%
Uppercase Letter
ValueCountFrequency (%)
S 29396
47.8%
U 29198
47.4%
M 643
 
1.0%
P 440
 
0.7%
C 369
 
0.6%
I 254
 
0.4%
G 244
 
0.4%
E 224
 
0.4%
R 184
 
0.3%
J 143
 
0.2%
Other values (9) 461
 
0.7%
Space Separator
ValueCountFrequency (%)
31978
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 29503
100.0%
Other Punctuation
ValueCountFrequency (%)
& 19
100.0%
Open Punctuation
ValueCountFrequency (%)
( 14
100.0%
Close Punctuation
ValueCountFrequency (%)
) 14
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 370167
85.7%
Common 61528
 
14.3%

Most frequent character per script

Latin
ValueCountFrequency (%)
t 88030
23.8%
e 59820
16.2%
a 31774
 
8.6%
i 31372
 
8.5%
n 30568
 
8.3%
d 29801
 
8.1%
s 29416
 
7.9%
S 29396
 
7.9%
U 29198
 
7.9%
o 1448
 
0.4%
Other values (30) 9344
 
2.5%
Common
ValueCountFrequency (%)
31978
52.0%
- 29503
48.0%
& 19
 
< 0.1%
( 14
 
< 0.1%
) 14
 
< 0.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII 431695
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
t 88030
20.4%
e 59820
13.9%
31978
 
7.4%
a 31774
 
7.4%
i 31372
 
7.3%
n 30568
 
7.1%
d 29801
 
6.9%
- 29503
 
6.8%
s 29416
 
6.8%
S 29396
 
6.8%
Other values (35) 40037
9.3%

Interactions

2023-01-25T14:19:29.901864image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
2023-01-25T14:19:26.025082image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
2023-01-25T14:19:26.794189image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
2023-01-25T14:19:27.670308image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
2023-01-25T14:19:28.432847image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/