Running The First Program – Frequency Distribution





As mentioned in the last post, I am doing an analysis on the GapMinder dataset. I’ve selected five variables of interest: life expectancy, breast cancer per 100, suicide per 100, internet user rate, and employment rate. The first three variables do not need to be normalized, while the second two variables do. The code below loads in the data, makes two different lists (variables to normalize, and non-normalized) and then for the two lists returns a frequency distribution along with the five most common occurrence values. Jump to the bottom to see analysis of missing values and most common values.

import pandas as pd

# load data
dataframe = pd.read_csv("gapminder.csv")

#print(dataframe.head())
#print(dataframe.isnull().values.any())

# split into two lists
cols = ['lifeexpectancy', 'breastcancerper100th', 'suicideper100th']

norm_cols = ['internetuserate', 'employrate']

# print frequency distribution and 5 most common occurences
def freq_dist(dataframe, cols, norm_cols):

    for col in cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print("-----")

    for col in norm_cols:
        print("Fred dist for: {}".format(col))

        count = dataframe[col].value_counts(sort=False, normalize=True)
        top_5 = dataframe[col].value_counts()[:4].index.tolist()
        print(count)
        print("Top 5 most common:")
        print(top_5)
        print(count)


freq_dist(dataframe, cols, norm_cols)

Here’s the results of the frequency distributions:

l

Fred dist for: lifeexpectancy
62.465 1
73.127 1
76.918 1
62.475 1
79.839 1
67.484 1
55.442 1
73.126 1
81.907 1
51.61 1
56.081 1
22
74.522 1
58.199 1
79.158 1
72.64 1
75.62 1
75.181 1
76.142 1
67.529 1
74.221 1
76.64 1
51.444 1
73.371 1
57.937 1
48.397 1
79.977 1
79.915 1
68.978 1
50.411 1
..
57.379 1
64.666 1
49.025 1
72.283 1
77.005 1
68.749 1
74.573 1
80.17 1
55.439 1
81.539 1
75.246 1
79.311 1
54.116 1
68.846 1
77.685 1
79.499 1
72.15 1
75.901 1
65.438 1
80.642 1
73.403 1
73.99 1
68.823 1
73.488 1
58.582 1
81.618 1
72.231 1
81.097 1
51.879 1
73.979 2
Name: lifeexpectancy, Length: 190, dtype: int64
Top 5 most common:

[‘ ‘, ‘72.974’, ‘73.979’, ‘73.911’]

Fred dist for: breastcancerper100th
50.9 1
25.2 1
31.7 1
18.3 1
48 1
84.3 1
19.6 1
17.9 1
21.5 1
30 1
33.3 1
74.4 1
23.9 1
24 1
24.1 1
20.4 2
38.7 1
46.2 1
12.3 1
38.8 1
29.7 1
91.9 2
34.3 1
62.5 1
31.2 3
43.9 1
33.4 1
29 1
88.7 1
62.1 1
..
51.8 1
55.5 1
82.5 1
52.5 1
67.2 1
51.1 1
26.4 1
23 1
30.9 1
86.7 1
73.9 1
23.1 1
23.5 2
19.5 6
50.4 2
101.1 1
90 1
18.4 1
52.1 1
44.8 1
49.6 1
63 1
43.5 1
29.8 2
19 1
13.6 1
50.1 1
13.2 2
17.1 1
30.6 1
Name: breastcancerper100th, Length: 137, dtype: int64
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
6.02188205718994 1
6.26578903198242 1
11.9804973602295 1
6.1052818998346 1
14.09153 1
6.44915676116943 1
2.0341784954071 1
7.765584 1
5.83525085449219 1
4.409532 1
14.77625 1
11.11583 1
35.752872467041 1
11.9569406509399 1
22
3.94025897979736 1
8.21094799041748 1
4.37336492538452 1
20.3179302215576 1
9.127511 1
14.4696483612061 1
9.211085 1
11.2139701843262 1
11.6533222198486 1
3.74158787727356 1
7.44382619857788 1
5.36217880249023 1
10.0719423294067 1
10.1149969100952 1
3.10860252380371 1
..
1.65890777111053 1
4.8487696647644 1
9.6331148147583 1
7.69932985305786 1
12.2167692184448 1
8.28307056427002 1
1.37000155448914 1
10.823 1
12.17976 1
7.87687826156616 1
.20144872367382 1
11.4261808395386 1
12.8698148727417 1
9.927033 1
6.68438529968262 1
20.16201 1
13.23981 1
7.06018447875976 1
5.931845 1
7.7450647354126 1
3.56332468986511 1
29.864164352417 1
9.70955562591553 1
15.5426025390625 1
4.7510838508606 1
4.52785158157349 1
5.55427646636963 1
4.41499042510986 1
6.3698878288269 1
13.1179485321045 1
Name: suicideper100th, Length: 192, dtype: int64
Top 5 most common:

[‘ ‘, ‘13.1179485321045’, ‘1.57435011863708’, ‘14.5546770095825’]

Fred dist for: internetuserate
44.5853546903272 0.004695
47.2804360294118 0.004695
40.0200948796529 0.004695
62.8119000060846 0.004695
76.5875384615385 0.004695
90.0161900191939 0.004695
8.37020688381867 0.004695
31.0043782824698 0.004695
77.6385351546869 0.004695
0.098592
11.0000554403336 0.004695
42.7478120557293 0.004695
7.0002138207311 0.004695
12.5002554302298 0.004695
44.9899469578783 0.004695
69.3399707174231 0.004695
12.3348932627873 0.004695
47.8674686327078 0.004695
81.3383926859286 0.004695
2.25997588483994 0.004695
82.166659877332 0.004695
14.0002467348544 0.004695
15.9996499919575 0.004695
15.8999820280962 0.004695
95.6381132075472 0.004695
42.9845801749271 0.004695
29.999939516129 0.004695
2.10021270579814 0.004695
1.25993360916614 0.004695
38.2602335526316 0.004695

2.45036224422442 0.004695
90.0795266272189 0.004695
1.4000606995385 0.004695
20.0017101420083 0.004695
3.70000325975843 0.004695
25.8997967022931 0.004695
14.8307358837209 0.004695
61.987412863816 0.004695
51.2804783981952 0.004695
2.99980317919075 0.004695
65.3877859391396 0.004695
39.8201778851441 0.004695
40.7728505747126 0.004695
65.163250915 0.004695
16.7800370218845 0.004695
77.9967811501598 0.004695
9.00773590909091 0.004695
80 0.004695
73.7339344713656 0.004695
12.3497504635596 0.004695
.99995892606692 0.004695
77.4986193497188 0.004695
6.00343714285714 0.004695
49.0006318425088 0.004695
31.050012866438 0.004695
82.526897905279 0.004695
9.99995388324075 0.004695
83.0025842490842 0.004695
2.19999781832606 0.004695
36.5625529623661 0.004695
Name: internetuserate, Length: 193, dtype: float64
Top 5 most common:

[‘ ‘, ‘36.5625529623661’, ‘36.4991147241897’, ‘15.8999703410908’]

Fred dist for: employrate
78.1999969482422 0.009390
57.5 0.009390
50.7000007629394 0.004695
48.7000007629394 0.009390
44.2999992370606 0.004695
47.0999984741211 0.004695
58.2000007629394 0.009390
55.9000015258789 0.014085
58.5 0.004695
52.0999984741211 0.004695
59.9000015258789 0.014085
0.164319
46.4000015258789 0.004695
58.4000015258789 0.009390
66.8000030517578 0.004695
78.9000015258789 0.004695
68.0999984741211 0.004695
65.0999984741211 0.004695
56 0.009390
57.2000007629394 0.004695
81.3000030517578 0.004695
50.5 0.004695
56.9000015258789 0.004695
41.5999984741211 0.004695
83 0.004695
42 0.004695
49.5 0.004695
39 0.004695
63.7999992370606 0.009390
63.7000007629394 0.004695

54.4000015258789 0.004695
56.2999992370606 0.009390
63.5 0.004695
42.7999992370606 0.004695
57.2999992370606 0.004695
46.7999992370606 0.004695
43.0999984741211 0.004695
80.6999969482422 0.004695
62.7000007629394 0.004695
65.6999969482422 0.004695
56.4000015258789 0.004695
72 0.004695
53.5 0.014085
71.6999969482422 0.004695
71.8000030517578 0.004695
41.0999984741211 0.004695
74.6999969482422 0.004695
51 0.009390
57.9000015258789 0.004695
45.7000007629394 0.004695
63.5999984741211 0.004695
71 0.004695
59.7999992370606 0.004695
63.0999984741211 0.004695
42.5 0.004695
83.1999969482422 0.009390
53.4000015258789 0.009390
76 0.004695
65.9000015258789 0.004695
46 0.009390
Name: employrate, Length: 140, dtype: float64
Top 5 most common:
[‘ ‘, ‘58.9000015258789’, ‘55.9000015258789’, ‘59.9000015258789’]

Here’s the 5 most common values for each category:

Top 5 most common:

[‘ ‘, ‘73.979’, ‘72.974’, ‘77.653’]

Fred dist for: breastcancerper100th
Top 5 most common:

[‘ ‘, ‘28.1’, ‘19.5’, ‘24.7’]

Fred dist for: suicideper100th
Top 5 most common:

[‘ ‘, ‘12.8722219467163’, ‘3.1468141078949’, ‘7.7450647354126’]

Fred dist for: internetuserate
Top 5 most common:

[‘ ‘, ‘11.0907646263158’, ‘82.526897905279’, ‘42.7478120557293’]

Fred dist for: employrate
Top 5 most common:
[‘ ‘, ’65’, ‘55.9000015258789’, ‘61.5’]

Notice that blank is the most common value for all of them, this means we have quite a few cells that are just filled in with blank values. Finally, this checks to see if there are any blank cell values (blank does not mean missing):

for col in cols:
print((dataframe[col].values == ' ').sum())

for col in norm_cols:
print((dataframe[col].values == ' ').sum())

We do have a number of blank values for those columns:

22
40
22
21
35

Leave a comment

Design a site like this with WordPress.com
Get started