Plotting Distributions

Let us first learn how to read files line by line in Python. Reading a file line by line is not only fast but also gives lot of control over reading files. As an example we will use the data set:

http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html

The data gives the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (For each year from 1931 to 1960, the daily minimum temperatures in January were added together and divided by 31. Then, the averages for each year were averaged over the 30 years.)

In [1]:
# Import all the packages we will use
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt 
In [2]:
with open('temp1.txt','r') as f:
    lines = f.readlines()
f.close()
In [3]:
print lines
['City\tJanTemp\tLat\tLong\n', 'Mobile, AL\t44\t31.2\t88.5\n', 'Montgomery, AL\t38\t32.9\t86.8\n', 'Phoenix, AZ\t35\t33.6\t112.5\n', 'Little Rock, AR\t31\t35.4\t92.8\n', 'Los Angeles, CA\t47\t34.3\t118.7\n', 'San Francisco, CA\t42\t38.4\t123.0\n', 'Denver, CO\t15\t40.7\t105.3\n', 'New Haven, CT\t22\t41.7\t73.4\n', 'Wilmington, DE\t26\t40.5\t76.3\n', 'Washington, DC\t30\t39.7\t77.5\n', 'Jacksonville, FL\t45\t31.0\t82.3\n', 'Key West, FL\t65\t25.0\t82.0\n', 'Miami, FL\t58\t26.3\t80.7\n', 'Atlanta, GA\t37\t33.9\t85.0\n', 'Boise, ID\t22\t43.7\t117.1\n', 'Chicago, IL\t19\t42.3\t88.0\n', 'Indianapolis, IN\t21\t39.8\t86.9\n', 'Des Moines, IA\t11\t41.8\t93.6\n', 'Wichita, KS\t22\t38.1\t97.6\n', 'Louisville, KY\t27\t39.0\t86.5\n', 'New Orleans, LA\t45\t30.8\t90.2\n', 'Portland, ME\t12\t44.2\t70.5\n', 'Baltimore, MD\t25\t39.7\t77.3\n', 'Boston, MA\t23\t42.7\t71.4\n', 'Detroit, MI\t21\t43.1\t83.9\n', 'Minneapolis, MN\t2\t45.9\t93.9\n', 'St. Louis, MO\t24\t39.3\t90.5\n', 'Helena, MT\t8\t47.1\t112.4\n', 'Omaha, NE\t13\t41.9\t96.1\n', 'Concord, NH\t11\t43.5\t71.9\n', 'Atlantic City, NJ\t27\t39.8\t75.3\n', 'Albuquerque, NM\t24\t35.1\t106.7\n', 'Albany, NY\t14\t42.6\t73.7\n', 'New York, NY\t27\t40.8\t74.6\n', 'Charlotte, NC\t34\t35.9\t81.5\n', 'Raleigh, NC\t31\t36.4\t78.9\n', 'Bismarck, ND\t0\t47.1\t101.0\n', 'Cincinnati, OH\t26\t39.2\t85.0\n', 'Cleveland, OH\t21\t42.3\t82.5\n', 'Oklahoma City, OK\t28\t35.9\t97.5\n', 'Portland, OR\t33\t45.6\t123.2\n', 'Harrisburg, PA\t24\t40.9\t77.8\n', 'Philadelphia, PA\t24\t40.9\t75.5\n', 'Charleston, SC\t38\t33.3\t80.8\n', 'Nashville, TN\t31\t36.7\t87.6\n', 'Amarillo, TX\t24\t35.6\t101.9\n', 'Galveston, TX\t49\t29.4\t95.5\n', 'Houston, TX\t44\t30.1\t95.9\n', 'Salt Lake City, UT\t18\t41.1\t112.3\n', 'Burlington, VT\t7\t45.0\t73.9\n', 'Norfolk, VA\t32\t37.0\t76.6\n', 'Seattle, WA\t33\t48.1\t122.5\n', 'Spokane, WA\t19\t48.1\t117.9\n', 'Madison, WI\t9\t43.4\t90.2\n', 'Milwaukee, WI\t13\t43.3\t88.1\n', 'Cheyenne, WY\t14\t41.2\t104.9\n']
In [4]:
# Removing '\n' and splitting a line in columns (remember '\t' is the delimiter)
linesX=lines[1].strip('\n')
line = linesX.split('\t')
print 'line=',line 
print 'line[0]=',line[0]
line= ['Mobile, AL', '44', '31.2', '88.5']
line[0]= Mobile, AL
In [5]:
# Lets us just read the column 1 (city,state) and col 2 (temperature)

line=[];xdata = np.zeros(len(lines),dtype={'names':['city', 'temp'], 'formats':['S32','i4']})
for j in range(1,len(lines)):
    line=lines[j].strip('\n');line=line.split('\t')
    xdata[j-1]=(line[0],line[1])
In [6]:
print xdata['city'],xdata['temp']
['Mobile, AL' 'Montgomery, AL' 'Phoenix, AZ' 'Little Rock, AR'
 'Los Angeles, CA' 'San Francisco, CA' 'Denver, CO' 'New Haven, CT'
 'Wilmington, DE' 'Washington, DC' 'Jacksonville, FL' 'Key West, FL'
 'Miami, FL' 'Atlanta, GA' 'Boise, ID' 'Chicago, IL' 'Indianapolis, IN'
 'Des Moines, IA' 'Wichita, KS' 'Louisville, KY' 'New Orleans, LA'
 'Portland, ME' 'Baltimore, MD' 'Boston, MA' 'Detroit, MI'
 'Minneapolis, MN' 'St. Louis, MO' 'Helena, MT' 'Omaha, NE' 'Concord, NH'
 'Atlantic City, NJ' 'Albuquerque, NM' 'Albany, NY' 'New York, NY'
 'Charlotte, NC' 'Raleigh, NC' 'Bismarck, ND' 'Cincinnati, OH'
 'Cleveland, OH' 'Oklahoma City, OK' 'Portland, OR' 'Harrisburg, PA'
 'Philadelphia, PA' 'Charleston, SC' 'Nashville, TN' 'Amarillo, TX'
 'Galveston, TX' 'Houston, TX' 'Salt Lake City, UT' 'Burlington, VT'
 'Norfolk, VA' 'Seattle, WA' 'Spokane, WA' 'Madison, WI' 'Milwaukee, WI'
 'Cheyenne, WY' ''] [44 38 35 31 47 42 15 22 26 30 45 65 58 37 22 19 21 11 22 27 45 12 25 23 21
  2 24  8 13 11 27 24 14 27 34 31  0 26 21 28 33 24 24 38 31 24 49 44 18  7
 32 33 19  9 13 14  0]
In [19]:
%matplotlib inline
plt.hist(xdata['temp'],bins=range(-1,60),color='g');
plt.xlabel('Temperature')
plt.ylabel('Frequency');

Random numbers from different distributions

In [10]:
# Random numbers between 0 and 1 from a uniform distribution 
ur=np.random.rand(5000)

# Random numbers from Gaussian (or normal) distribution with mean =0 and varaince=1
gr=np.random.normal(0,1,5000)

# Random numbers from exponential distribution with mean=1
ex=np.random.exponential(1,1000)
In [16]:
# plotting both the above distributions in subplot 

plt.subplot(1,3,1)
plt.hist(ur,20,color='m')

plt.subplot(1,3,2)
plt.hist(gr,20,color='b')

plt.subplot(1,3,3)
plt.hist(ex,20,color='c')

plt.tight_layout(); # improves the spacing between subplots 

Defining a density function

Let us define a density function for normal (Gaussian) distribution

In [13]:
def gaussianpdf(mu,sigma,x):
    fdensity=1/(sigma*np.sqrt(2 * np.pi))*np.exp(-(x-mu)**2 / (2 * sigma**2))
    return fdensity
In [20]:
plt.subplot(1,2,2)
plt.hist(gr,20,color='k',normed='True')
x=np.linspace(-4,4,num=125)
fx=gaussianpdf(0,1,x)
plt.plot(x,fx,'r-',linewidth=3);