Overview

Dataset statistics

Number of variables3
Number of observations1000
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory23.6 KiB
Average record size in memory24.1 B

Variable types

Categorical3

Alerts

russian has a high cardinality: 995 distinct valuesHigh cardinality
english has a high cardinality: 961 distinct valuesHigh cardinality
russian is uniformly distributedUniform
english is uniformly distributedUniform

Reproduction

Analysis started2023-01-25 14:18:56.233239
Analysis finished2023-01-25 14:18:56.759987
Duration0.53 seconds
Software versionpandas-profiling v0.0.dev0
Download configurationconfig.json

Variables

russian
Categorical

HIGH CARDINALITY  UNIFORM 

Distinct995
Distinct (%)99.5%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
знать
 
2
пора
 
2
много
 
2
что
 
2
мало
 
2
Other values (990)
990 

Length

Max length19
Median length13
Mean length6.117
Min length1

Characters and Unicode

Total characters6117
Distinct characters43
Distinct categories7 ?
Distinct scripts3 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique990 ?
Unique (%)99.0%

Sample

1st rowи
2nd rowв
3rd rowне
4th rowон
5th rowна

Common Values

ValueCountFrequency (%)
знать 2
 
0.2%
пора 2
 
0.2%
много 2
 
0.2%
что 2
 
0.2%
мало 2
 
0.2%
сорок 1
 
0.1%
привести 1
 
0.1%
оказываться 1
 
0.1%
служба 1
 
0.1%
лейтенант 1
 
0.1%
Other values (985) 985
98.5%

Length

2023-01-25T14:18:56.827363image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
знать 2
 
0.2%
много 2
 
0.2%
что 2
 
0.2%
мало 2
 
0.2%
пора 2
 
0.2%
тот 1
 
0.1%
это 1
 
0.1%
весь 1
 
0.1%
а 1
 
0.1%
с 1
 
0.1%
Other values (987) 987
98.5%

Most occurring characters

ValueCountFrequency (%)
о 645
 
10.5%
т 526
 
8.6%
а 484
 
7.9%
е 395
 
6.5%
с 364
 
6.0%
и 345
 
5.6%
н 339
 
5.5%
ь 316
 
5.2%
р 306
 
5.0%
в 263
 
4.3%
Other values (33) 2134
34.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 6106
99.8%
Decimal Number 3
 
< 0.1%
Uppercase Letter 3
 
< 0.1%
Space Separator 2
 
< 0.1%
Open Punctuation 1
 
< 0.1%
Other Punctuation 1
 
< 0.1%
Close Punctuation 1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
о 645
 
10.6%
т 526
 
8.6%
а 484
 
7.9%
е 395
 
6.5%
с 364
 
6.0%
и 345
 
5.7%
н 339
 
5.6%
ь 316
 
5.2%
р 306
 
5.0%
в 263
 
4.3%
Other values (24) 2123
34.8%
Uppercase Letter
ValueCountFrequency (%)
S 1
33.3%
М 1
33.3%
Р 1
33.3%
Decimal Number
ValueCountFrequency (%)
6 2
66.7%
3 1
33.3%
Space Separator
ValueCountFrequency (%)
2
100.0%
Open Punctuation
ValueCountFrequency (%)
( 1
100.0%
Other Punctuation
ValueCountFrequency (%)
# 1
100.0%
Close Punctuation
ValueCountFrequency (%)
) 1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Cyrillic 6106
99.8%
Common 8
 
0.1%
Latin 3
 
< 0.1%

Most frequent character per script

Cyrillic
ValueCountFrequency (%)
о 645
 
10.6%
т 526
 
8.6%
а 484
 
7.9%
е 395
 
6.5%
с 364
 
6.0%
и 345
 
5.7%
н 339
 
5.6%
ь 316
 
5.2%
р 306
 
5.0%
в 263
 
4.3%
Other values (25) 2123
34.8%
Common
ValueCountFrequency (%)
2
25.0%
6 2
25.0%
( 1
12.5%
# 1
12.5%
3 1
12.5%
) 1
12.5%
Latin
ValueCountFrequency (%)
e 2
66.7%
S 1
33.3%

Most occurring blocks

ValueCountFrequency (%)
Cyrillic 6106
99.8%
ASCII 11
 
0.2%

Most frequent character per block

Cyrillic
ValueCountFrequency (%)
о 645
 
10.6%
т 526
 
8.6%
а 484
 
7.9%
е 395
 
6.5%
с 364
 
6.0%
и 345
 
5.7%
н 339
 
5.6%
ь 316
 
5.2%
р 306
 
5.0%
в 263
 
4.3%
Other values (25) 2123
34.8%
ASCII
ValueCountFrequency (%)
2
18.2%
e 2
18.2%
6 2
18.2%
( 1
9.1%
S 1
9.1%
# 1
9.1%
3 1
9.1%
) 1
9.1%

english
Categorical

HIGH CARDINALITY  UNIFORM 

Distinct961
Distinct (%)96.1%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
to ask
 
3
to fit, fall; have to
 
3
again
 
2
long
 
2
or
 
2
Other values (956)
988 

Length

Max length130
Median length39
Mean length13.169
Min length1

Characters and Unicode

Total characters13169
Distinct characters75
Distinct categories9 ?
Distinct scripts4 ?
Distinct blocks4 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique924 ?
Unique (%)92.4%

Sample

1st rowand, though
2nd rowin, at
3rd rownot
4th rowhe
5th rowon, it, at, to

Common Values

ValueCountFrequency (%)
to ask 3
 
0.3%
to fit, fall; have to 3
 
0.3%
again 2
 
0.2%
long 2
 
0.2%
or 2
 
0.2%
to see 2
 
0.2%
this 2
 
0.2%
to return 2
 
0.2%
also, as well, too 2
 
0.2%
to enter, come in 2
 
0.2%
Other values (951) 978
97.8%

Length

2023-01-25T14:18:56.983245image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
to 256
 
11.0%
see 33
 
1.4%
be 20
 
0.9%
in 20
 
0.9%
as 18
 
0.8%
for 16
 
0.7%
come 15
 
0.6%
of 14
 
0.6%
the 13
 
0.6%
by 12
 
0.5%
Other values (1240) 1914
82.1%

Most occurring characters

ValueCountFrequency (%)
e 1370
 
10.4%
1334
 
10.1%
t 1048
 
8.0%
o 1021
 
7.8%
a 788
 
6.0%
r 730
 
5.5%
, 673
 
5.1%
n 656
 
5.0%
i 616
 
4.7%
s 611
 
4.6%
Other values (65) 4322
32.8%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 10588
80.4%
Space Separator 1334
 
10.1%
Other Punctuation 834
 
6.3%
Decimal Number 191
 
1.5%
Close Punctuation 75
 
0.6%
Open Punctuation 75
 
0.6%
Uppercase Letter 59
 
0.4%
Dash Punctuation 8
 
0.1%
Nonspacing Mark 5
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 1370
12.9%
t 1048
 
9.9%
o 1021
 
9.6%
a 788
 
7.4%
r 730
 
6.9%
n 656
 
6.2%
i 616
 
5.8%
s 611
 
5.8%
l 534
 
5.0%
h 350
 
3.3%
Other values (34) 2864
27.0%
Other Punctuation
ValueCountFrequency (%)
, 673
80.7%
; 67
 
8.0%
# 65
 
7.8%
' 11
 
1.3%
" 6
 
0.7%
4
 
0.5%
. 3
 
0.4%
! 2
 
0.2%
? 2
 
0.2%
: 1
 
0.1%
Decimal Number
ValueCountFrequency (%)
3 28
14.7%
1 24
12.6%
9 24
12.6%
6 20
10.5%
4 20
10.5%
7 20
10.5%
5 18
9.4%
8 15
7.9%
2 13
6.8%
0 9
 
4.7%
Uppercase Letter
ValueCountFrequency (%)
S 49
83.1%
R 3
 
5.1%
M 3
 
5.1%
G 2
 
3.4%
A 1
 
1.7%
I 1
 
1.7%
Space Separator
ValueCountFrequency (%)
1334
100.0%
Close Punctuation
ValueCountFrequency (%)
) 75
100.0%
Open Punctuation
ValueCountFrequency (%)
( 75
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 8
100.0%
Nonspacing Mark
ValueCountFrequency (%)
́ 5
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 10604
80.5%
Common 2517
 
19.1%
Cyrillic 43
 
0.3%
Inherited 5
 
< 0.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 1370
12.9%
t 1048
 
9.9%
o 1021
 
9.6%
a 788
 
7.4%
r 730
 
6.9%
n 656
 
6.2%
i 616
 
5.8%
s 611
 
5.8%
l 534
 
5.0%
h 350
 
3.3%
Other values (22) 2880
27.2%
Common
ValueCountFrequency (%)
1334
53.0%
, 673
26.7%
) 75
 
3.0%
( 75
 
3.0%
; 67
 
2.7%
# 65
 
2.6%
3 28
 
1.1%
1 24
 
1.0%
9 24
 
1.0%
6 20
 
0.8%
Other values (14) 132
 
5.2%
Cyrillic
ValueCountFrequency (%)
о 6
14.0%
к 5
11.6%
и 5
11.6%
н 3
 
7.0%
а 3
 
7.0%
м 3
 
7.0%
е 3
 
7.0%
в 3
 
7.0%
р 2
 
4.7%
ч 2
 
4.7%
Other values (8) 8
18.6%
Inherited
ValueCountFrequency (%)
́ 5
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 13117
99.6%
Cyrillic 43
 
0.3%
Diacriticals 5
 
< 0.1%
Punctuation 4
 
< 0.1%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 1370
 
10.4%
1334
 
10.2%
t 1048
 
8.0%
o 1021
 
7.8%
a 788
 
6.0%
r 730
 
5.6%
, 673
 
5.1%
n 656
 
5.0%
i 616
 
4.7%
s 611
 
4.7%
Other values (45) 4270
32.6%
Cyrillic
ValueCountFrequency (%)
о 6
14.0%
к 5
11.6%
и 5
11.6%
н 3
 
7.0%
а 3
 
7.0%
м 3
 
7.0%
е 3
 
7.0%
в 3
 
7.0%
р 2
 
4.7%
ч 2
 
4.7%
Other values (8) 8
18.6%
Diacriticals
ValueCountFrequency (%)
́ 5
100.0%
Punctuation
ValueCountFrequency (%)
4
100.0%

part of speech
Categorical

Distinct37
Distinct (%)3.7%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
noun
374 
verb
232 
adjective
127 
adverb
112 
preposition
 
37
Other values (32)
118 

Length

Max length26
Median length4
Mean length5.885
Min length3

Characters and Unicode

Total characters5885
Distinct characters24
Distinct categories5 ?
Distinct scripts3 ?
Distinct blocks2 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique20 ?
Unique (%)2.0%

Sample

1st rowconjunction
2nd rowpreposition
3rd rowparticle
4th rowpronoun
5th rowpreposition

Common Values

ValueCountFrequency (%)
noun 374
37.4%
verb 232
23.2%
adjective 127
 
12.7%
adverb 112
 
11.2%
preposition 37
 
3.7%
pronoun 36
 
3.6%
misc 12
 
1.2%
conjunction 12
 
1.2%
cardinal number 11
 
1.1%
particle 7
 
0.7%
Other values (27) 40
 
4.0%

Length

2023-01-25T14:18:57.128000image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
noun 378
36.0%
verb 234
22.3%
adjective 129
 
12.3%
adverb 118
 
11.2%
pronoun 40
 
3.8%
preposition 39
 
3.7%
particle 19
 
1.8%
number 18
 
1.7%
cardinal 16
 
1.5%
conjunction 15
 
1.4%
Other values (15) 43
 
4.1%

Most occurring characters

ValueCountFrequency (%)
n 984
16.7%
e 698
11.9%
o 588
10.0%
r 497
8.4%
v 481
8.2%
u 456
7.7%
b 373
 
6.3%
a 309
 
5.3%
i 283
 
4.8%
d 268
 
4.6%
Other values (14) 948
16.1%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 5806
98.7%
Space Separator 51
 
0.9%
Other Punctuation 26
 
0.4%
Open Punctuation 1
 
< 0.1%
Close Punctuation 1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
n 984
16.9%
e 698
12.0%
o 588
10.1%
r 497
8.6%
v 481
8.3%
u 456
7.9%
b 373
 
6.4%
a 309
 
5.3%
i 283
 
4.9%
d 268
 
4.6%
Other values (10) 869
15.0%
Space Separator
ValueCountFrequency (%)
51
100.0%
Other Punctuation
ValueCountFrequency (%)
, 26
100.0%
Open Punctuation
ValueCountFrequency (%)
( 1
100.0%
Close Punctuation
ValueCountFrequency (%)
) 1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 5805
98.6%
Common 79
 
1.3%
Cyrillic 1
 
< 0.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
n 984
17.0%
e 698
12.0%
o 588
10.1%
r 497
8.6%
v 481
8.3%
u 456
7.9%
b 373
 
6.4%
a 309
 
5.3%
i 283
 
4.9%
d 268
 
4.6%
Other values (9) 868
15.0%
Common
ValueCountFrequency (%)
51
64.6%
, 26
32.9%
( 1
 
1.3%
) 1
 
1.3%
Cyrillic
ValueCountFrequency (%)
с 1
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 5884
> 99.9%
Cyrillic 1
 
< 0.1%

Most frequent character per block

ASCII
ValueCountFrequency (%)
n 984
16.7%
e 698
11.9%
o 588
10.0%
r 497
8.4%
v 481
8.2%
u 456
7.7%
b 373
 
6.3%
a 309
 
5.3%
i 283
 
4.8%
d 268
 
4.6%
Other values (13) 947
16.1%
Cyrillic
ValueCountFrequency (%)
с 1
100.0%

Missing values

2023-01-25T14:18:56.611557image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
A simple visualization of nullity by column.
2023-01-25T14:18:56.711049image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

russianenglishpart of speech
0иand, thoughconjunction
1вin, atpreposition
2неnotparticle
3онhepronoun
4наon, it, at, topreposition
5яIpronoun
6чтоwhat, that, whyсonjunction, pronoun
7тотthatadjective, pronoun
8бытьto beverb
9сwith, and, from, ofpreposition
russianenglishpart of speech
990художникpainter, artistnoun
991знакsignnoun
992заводfactorynoun
993кулакfistnoun
994использоватьto use, utilize, make use ofverb
995стаканglassnoun
996пахнутьto smellverb
997отсюдаfrom hereadverb
998ротmouthnoun
999пораit's time;at times, now and then(See #279)misc