Generating data in Python with Pandas

Last updated on Feb 6, 2021

I’ve found myself with another free morning to practice Python. One of the things I’ve always heard about learning any new skill is that you should leave as long as possible after you learn anything before you practice it again. Helps it to settle. With that in mind, I’m going to keep on practicing with Pandas. This time I’ll be looking into how to 1) generate random numbers and 2) put them into a dataframe. That way next time I can start with a new dataframe to play around with.

import pandas as pd
import numpy as np

The first step is deciding on the variables I want. I’m imagining a survey of voters. I think I’ll go for a uniform distibution of region (meaning every region has equal probability). A random normal distribution of age. A bernoulli trial for political party. Finally another normal distribution of income. If I can do all of this without too much trouble then I’ll try to make it so the different political parties have different mean income levels. Although that sounds difficult.

I’ll start with regions because that should be the easiest.

np.random.seed(12345)
regions = ["North", "South", "East", "West"]
region = np.random.choice(regions, replace = True, size = 10000)
print(region[range(1,10)])

['South' 'South' 'South' 'North' 'South' 'East' 'East' 'South' 'East']

This looks good to me. 10,000 different regions. I’ll check how often each one appeared. You can do a histogram with a pandas series. I don’t know if you can then add a pandas series to a dataframe. I assume so.

region = pd.Series(region)
region.value_counts().plot(kind = "bar", color='#36b33a')

<AxesSubplot:>

png

This looks pretty good. They all have around the same number of observations. Now to add it to the dataframe.

df = pd.DataFrame(data = region, columns= ["region"])
df

	region
0	East
1	South
2	South
3	South
4	North
...	...
9995	West
9996	East
9997	North
9998	West
9999	East

10000 rows × 1 columns

Not bad. Next I’ll make the first of the normal distributions: age.

age = np.random.normal(loc = 45, scale = 15, size = 10000)
age = pd.Series(age)
age.plot.hist(color='#36b33a', bins = 20)

<AxesSubplot:ylabel='Frequency'>

png

Not great. It’s definitely normally distributed, but I want it to start at 18, and I really don’t want anyone to have a negative age. I think it might be worth using a for loop to replace all the values below 18 with another random number.

for i in range(0, len(age)):
    while (age[i] < 18):
        age[i] = np.random.normal(loc = 45, scale = 15, size = 1)

And check to see if that worked.

min(age)

18.006346512121425

The minimum is good.

age.plot.hist(color="#36b33a", bins = 20)

<AxesSubplot:ylabel='Frequency'>

png

The histogram is also decent. Looks like what you would expect.

len(age)

And the length is the same. It could be worth defining a function that does this, because I’ll have to do the same with income in a minute. I’ll come back to that. Next, though. Age is a bit too specific, I’ll want to round the numbers.

age

0       61.195320
1       69.635091
2       43.444925
3       31.755761
4       42.039641
          ...    
9995    38.561477
9996    63.562329
9997    35.922581
9998    34.018789
9999    45.871190
Length: 10000, dtype: float64

age = round(age)
age

0       61.0
1       70.0
2       43.0
3       32.0
4       42.0
        ... 
9995    39.0
9996    64.0
9997    36.0
9998    34.0
9999    46.0
Length: 10000, dtype: float64

I’m happy with this now. I’ll add it to the existing dataframe.

df["age"] = age
df

	region	age
0	East	61.0
1	South	70.0
2	South	43.0
3	South	32.0
4	North	42.0
...	...	...
9995	West	39.0
9996	East	64.0
9997	North	36.0
9998	West	34.0
9999	East	46.0

10000 rows × 2 columns

Bernoulli trials next. I’m imagining a circumstance where there’s only two political parties so I can use binary values to represent them. In this case a 1 will mean a vote for the party of the tenants, and a 0 will mean a vote for the party of the landlords.

party = np.random.binomial(n = 1,size = 10000, p = 0.6)
party

array([0, 1, 1, ..., 1, 1, 1])

party = pd.Series(party)
party.value_counts().plot(kind = "bar", color="#36b33a")

<AxesSubplot:>

png

This looks like what I wanted. I specified that any 1 voter had a 0.6 probabilty of voting for the tenants, and it looks like around 6000 out of the 10,000 did. Exactly what we would expect. Let’s add it to the dataframe.

df["party"] = party
df

	region	age	party
0	East	61.0	0
1	South	70.0	1
2	South	43.0	1
3	South	32.0	0
4	North	42.0	1
...	...	...	...
9995	West	39.0	0
9996	East	64.0	1
9997	North	36.0	1
9998	West	34.0	1
9999	East	46.0	1

10000 rows × 3 columns

Now for the difficult bit. I want to generate income as two different normal distributions. One with a higher mean for the landlord voters. How do I do this? I’ll start by adding the empty vector to the dataframe.

df["income"] = -99
df

	region	age	party	income
0	East	61.0	0	-99
1	South	70.0	1	-99
2	South	43.0	1	-99
3	South	32.0	0	-99
4	North	42.0	1	-99
...	...	...	...	...
9995	West	39.0	0	-99
9996	East	64.0	1	-99
9997	North	36.0	1	-99
9998	West	34.0	1	-99
9999	East	46.0	1	-99

10000 rows × 4 columns

Then I have to try to generate these two different normal distributions, also keeping in mind that nobody can have a negative income (unlike in real life). A major disclaimer on this bit of code: I do not know how best to do this. For convenience I made the placeholder series into -99 so I could have a while loop which repeated the generation of random values when income < 0. This is probably unnecessary and is definitely a strain on the computer. I’ll try to find a faster way to do this.

for i in range(0, len(df["party"])):
    while df["income"][i] < 0: 
        if df["party"][i] == 0:
            df.loc[[i],"income"] = np.random.normal(loc = 40000, scale = 4000, size = 1)
        elif df["party"][i] == 1:
            df.loc[[i],"income"] = np.random.normal(loc = 28000, scale = 6000, size = 1)

df

	region	age	party	income
0	East	61.0	0	41240.175985
1	South	70.0	1	30820.712654
2	South	43.0	1	35118.509739
3	South	32.0	0	40804.859910
4	North	42.0	1	32241.494249
...	...	...	...	...
9995	West	39.0	0	42027.557719
9996	East	64.0	1	25218.748753
9997	North	36.0	1	31797.389325
9998	West	34.0	1	27783.427853
9999	East	46.0	1	27967.796060

10000 rows × 4 columns

And rounding these

df.loc[:,"income"] = round(df.loc[:,"income"], 2)
df

	region	age	party	income
0	East	61.0	0	41240.18
1	South	70.0	1	30820.71
2	South	43.0	1	35118.51
3	South	32.0	0	40804.86
4	North	42.0	1	32241.49
...	...	...	...	...
9995	West	39.0	0	42027.56
9996	East	64.0	1	25218.75
9997	North	36.0	1	31797.39
9998	West	34.0	1	27783.43
9999	East	46.0	1	27967.80

10000 rows × 4 columns

This all looks good to me. I can look at the mean values of income for each party now, just to check. I can do this using a pivot table which works roughly the same as tapply() in R.

df.pivot_table(columns = "party", values = "income", aggfunc=("mean"))

party	0	1
income	39965.636696	27913.368541

This looks almost exactly right. Now to check out the histograms. Unfortunately, just like with ggplot2 this will involve reshaping the dataframe.

df_wide = df.pivot(columns = "party", values = "income")
df_wide

party	0	1
0	41240.18	NaN
1	NaN	30820.71
2	NaN	35118.51
3	40804.86	NaN
4	NaN	32241.49
...	...	...
9995	42027.56	NaN
9996	NaN	25218.75
9997	NaN	31797.39
9998	NaN	27783.43
9999	NaN	27967.80

10000 rows × 2 columns

This now has the different incomes for the different parties on each column.

df_wide.plot.hist(bins=100, alpha=0.7, color=["#36b33a", "blue"])

<AxesSubplot:ylabel='Frequency'>

png

That looks pretty reasonable to me. And a good place to stop.

Generating data in Python with Pandas

Dr Greg Stride

Researcher

Related