We have 3 types of Distribution which we have talked about:-
A Uniform Distribution will look like this:-
The probability will be defined as 1/b-a
We can plot two type of graph for uniform distribution, and there respective plots
Both the graph looks a quite alike, the only difference being the value of X
axis and Y
axis.
A Normal Distribution will look like this, we some empirical rules:-
A Normal Distribution have these property:-
The function for normal distribution is denoted by:-
The parameter in this definition is the mean or expectation of the distribution (and also its median and mode). The parameter is its standard deviation; its variance is therefore . A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.
In probability theory and statistics, the exponential distribution (a.k.a. negative exponential distribution) is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate.
The function for Exponential distribution will be:-
Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.
One of the good application of Monte Carlo simulation is to solve the Monty Hall Problem, as the code shows below:-
import random def chooseDoor(): return random.choice([1,2,3]) def playMontyHall(numTrails = 1000): stayWins = 0 switchWins = 0 for trails in range(numTrails): prizeDoor = chooseDoor() playerDoor = chooseDoor() if prizeDoor == playerDoor: stayWins += 1 elif prizeDoor != playerDoor: switchWins += 1 print "Stay Wins: ", stayWins/float(numTrails) print "Switch Wins: ", switchWins/float(numTrails) playMontyHall()
We can estimate PI using the below code:-
import math import random import pylab ##### # # Computing Pi # # Square with 2r length Sides # Inscribe circle with radius r # Area of square = (2r)*2 = 4r^2 # Area of circle = pi*r^2 # Ratio of circle area to sqare area is (pi*r^2)/(4r^2) = pi/4 # # Implication: Of N random points picked from inside square, N*pi/4 # will be inside circle # # So if M = number of points inside circle # M = N*pi/4 # pi = M/N * 4 def randomPoints(r): x = random.uniform(-r,r) y = random.uniform(-r,r) return (x,y) def makePoints(r,n): """ Make n random points inside square with 2*r side. """ points = [] for i in range(n): points.append(randomPoints(r)) return points def inCircle(r,points): x = points[0] y = points[1] return x ** 2 + y ** 2 <= r ** 2 def numInCircle(r,points): """ Figure out no of points inside circle of radius r """ count = 0 for point in points: if inCircle(r,point): count += 1 return count def computePi(numPoints,points = None): """ Computes Pi using Monte Carlo simulation of n points """ if points is None: points = makePoints(1.0,numPoints) inCircle = numInCircle(1.0, points) return float(inCircle)/float(numPoints) * 4.0 def runTrails(numTrailsPerPoints,numPointsList): results = [] for numPoints in numPointsList: print numPoints for trails in range(numTrailsPerPoints): results.append((numPoints,computePi(numPoints))) return results def plotPi(trails,trailsResults): numPoints = [] results = [] for result in trailsResults: numPoints.append(result[0]) results.append(result[1]) pylab.figure() pylab.clf() pylab.scatter(numPoints, results, c="r") pylab.plot(trails,[math.pi for trails in trails], c="b") pylab.xlabel("Number of Points") pylab.ylabel("Pi") pylab.title("Pi Vs Number of Points") pylab.show() numTrailsPerPoints = 50 numPointsList = range(10,10000,1000) trailsResults = runTrails(numTrailsPerPoints, numPointsList) plotPi(numPointsList, trailsResults)
The plot will look like this:-
We can also check where the needles are dropped.
Here is the code:-
def PlotPiScatter(r,n): points = makePoints(1.0, n) piEstimate = computePi(n,points) squarePoints_X = [] squarePoints_Y = [] circlePoints_X = [] circlePoints_Y = [] for point in points: if inCircle(r,point): circlePoints_X.append(point[0]) circlePoints_Y.append(point[1]) else: squarePoints_X.append(point[0]) squarePoints_Y.append(point[1]) pylab.figure() pylab.clf() pylab.scatter(squarePoints_X, squarePoints_Y,c="r") pylab.scatter(circlePoints_X, circlePoints_Y,c="b") pylab.title("With " +str(n) +" points ") pylab.axis([-1.5,1.5,-1.5,1.5]) pylab.text(-1.4, -1.4, "Pi is estimated to be " +str(piEstimate)) pylab.show()
We can use the same method of needle drop to find the area of a function using integration.
Kindly understand this code:-
import math import random import numpy import pylab # float range def frange(start,stop,step): l = [] for i in range(int((stop - start)/step)): l.append(start + step*i) return l # # rSquared # def mse(measured,predicted): """Compute Sum of Residual Square""" sum_sq = 0 for i in xrange(len(measured)): sum_sq = (measured[i] - predicted[i]) ** 2 return float(sum_sq) def sstot(measured): """Compute total of sum square""" nMean = sum(measured) /float(len(measured)) ntot = 0 for m in measured: ntot += (m - nMean) ** 2 return ntot def rSquared(measured,predicted): """ measured: list of measured values predicted: list of predicted values """ SSerr = mse(measured, predicted) SStot = sstot(measured) return 1.0 - SSerr/SStot def makeCurve(f,xs): ys = [] for x in xs: ys.append(f(x)) return ys def addNoise(ys, stddev = 1): ysp = [] for y in ys: ysp.append(random.gauss(y,stddev)) return ysp def makeObservations(xs,f,noise): # Make a theoritical curve ys = makeCurve(f,xs) # Add some noise to the simulate environment nys = addNoise(ys) return nys def showExamplePolyFit(xs,ys,fitDegree1 = 1,fitDegree2 = 2): pylab.figure() pylab.plot(xs,ys,'r.',ms=2.0,label = "measured") # poly fit to noise coeeff = numpy.polyfit(xs, ys, fitDegree1) # Predict the curve pys = numpy.polyval(numpy.poly1d(coeeff), xs) se = mse(ys, pys) r2 = rSquared(ys, pys) pylab.plot(xs,pys, 'g--', lw=5,label="%d degree fit, SE = %0.10f, R2 = %0.10f" %(fitDegree1,se,r2)) # Poly fit to noise coeeffs = numpy.polyfit(xs, ys, fitDegree2) # Predict the curve pys = numpy.polyval(numpy.poly1d(coeeffs), xs) se = mse(ys, pys) r2 = rSquared(ys, pys) pylab.plot(xs,pys, 'b--', lw=5,label="%d degree fit, SE = %0.10f, R2 = %0.10f" %(fitDegree2,se,r2)) pylab.legend() def f(x): return x**3 + 5*x xs = frange(-2,2,0.001) ys = makeObservations(xs, f, 3) showExamplePolyFit(xs, ys, 1,2) showExamplePolyFit(xs, ys, 2,3) showExamplePolyFit(xs, ys, 3,4) showExamplePolyFit(xs, ys, 4,5) pylab.show()
These are graph which we will get.
The Standard error are not very good for the range -2
to 2
, change it to -5
to 5
and see the difference.
In the last lecture, when we were finding the value of Pi, we started with an assumption that if we drop large enough pins and if we repeat this experiment multiple times, if we get a very small standard deviation we will have a correct value of PI.
But this assumption is not correct. This was because we were believing that a statistically sound argument is equivalent to truth.
The use of every statistical test is based on some ground rules:-
So in the last lecture when we calculated PI, we based our calculation on Buffon Laplace Math, then we did some algebraic calculation and then based on this algebraic calculation derived our code for estimating PI, we got our estimation of PI based on small standard deviation which was as shown below:-
import random def stdDev(X): mean = sum(X)/float(len(X)) total = 0.0 for x in X: total += (x - mean) ** 2 return (total/len(X))**0.5 def throwNeedles(numNeedles): inCircle = 0 for needle in xrange(1,numNeedles+1,1): x = random.random() y = random.random() if (x * x + y * y) **0.5 <= 1.0: inCircle += 1 return 4*(inCircle/float(needle)) def estPi(precision = 0.01, numTrials = 20): numNeedles = 1000 numTrials = 20 sDev = precision while sDev >= (precision/4.0): estimates = [] for t in range(numTrials): piGuess = throwNeedles(numNeedles) estimates.append(piGuess) sDev = stdDev(estimates) curEst = sum(estimates)/len(estimates) curEst = sum(estimates)/len(estimates) print 'Est. = ' + str(curEst) +\ ', Std. dev. = ' + str(sDev)\ + ', Needles = ' + str(numNeedles) numNeedles *= 2 return curEst estPi()
But suppose we did a mistake in our code where in place of in the throwNeedles()
method, we modified this line return 4*(inCircle/float(needle))
to return 2*(inCircle/float(needle))
, we will get out estimation of PI as 1.56969296875
, which will be wrong.
So as per the above code, we have nothing wrong with the statistics, but still it is nowhere close to the real value of PI.
The moral of the story is:-
How can we achieve this:-
To test our result, we can calculate the circumference of circle based on the estimation of PI which we get, so immediately we could have identified that the value of PI is nowhere close to the actual value.
A real scientist when they derives a simulation model, that run some experiments to verify that there model is giving results which is near reality or at least plausible. Statistic identifies that we have got the minute details right but we should also do a sanity check.
In real engineering we have to find a cohesive existence of these 3.
We can understand these concept we can relive our high school days, when in time for a practical exam for Physics, Chemistry or Biology. In the exam, we have studied the theory and when after doing experiments we get results which are no where close to the theory which we studied, so we know something is wrong, We have three possibility.
The smart solution is the best thing to do, but to model this case, we also have to model for experimental errors.
We can model the error which are introduced in experiments, this can be done when we know how best to model reality in addition to model error.
The best way to model experimental error, we have to assume there is some sort of perturbation, i.e. deviation from standard flow of the actual data. As per Gauss’s analysis, errors are also distributed normally.
We can see how to model error, with the help of Hook’s Law
Hook’s Law states that
Hooke’s law is a principle of physics that states that the force needed to extend or compress a spring by some distance is proportional to that distance
is a constant factor characteristic of the spring, its stiffness. It is also called Spring Constant.
Hook’s law hold for a wide variety of materials, but it does not hold for arbitrary large force. All materials have a elastic limits, and if we stretch beyond this limit, the law fails.
Spring constant tells us how stiff a materials is, like the suspension of an automobile, the spring constant value is very high.
The negative sign in the Hook’s law equation means that the Force exerted is in the reverse direction of the displacement.
We can calculate the spring constant using this experiments.
So we know can do some algebra:-
F = -Kx ---- (1) F = ma ---- (2) #mass * acceleration from (1) and (2) k = (m * g)/x # a is changed to g, which is acceleration due to gravity
So if we know m
that is the mass suspended on the spring, and the x
which is the displacement caused but the mass, we can calculate k
because g
is 9.81
mts per sec.
This is a straight forward calculation and we have can get the result fairly easily but the problem is we have experimental error, which is due to environments, so what we do is we put different weight and we calculated different displacement. So we will have series of weight to k points. Since the errors are evenly distributed, so if we do good enough no of experiments we can get a nice estimation of spring constant.
Calculation of spring constant
* we have the experimental data shared at Spring Constant .
* We have a algebraic equation, as show above.
* We will make a computational model based on above data.
Here is the code for computational model:-
import pylab def getData(fileName): dataFile = open(fileName,'r') distances = [] masses = [] discardHeader = dataFile.readline() print discardHeader for line in dataFile: d,m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses,distances) def plotData(fileName): xVals,yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 #acc. due to gravity pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') plotData('../data/springData.txt') pylab.show()
Lets dissect the code mentioned above:-
getData()
: This is the method which reads from a file which is ../data/springData.txt
and returns the masses and distances list.dataFile = open(fileName,'r')
this opens the file in a read only mode.discardHeader = dataFile.readline()
just removes the header or the first line from the text file.line.split()
splits each line with space as the separator, which gives us d
distance and m
mass for each iteration of experiments, which is stored in a list distances
& masses
that is returned as a tuple.dataFile.close()
plotData()
: uses the information from the files and plot some interesting statistics as shown below:-distances
& masses
is saved into xVals
and yVals
xVals
and yVals
is converted to a pylab.array()
which helps us to do a lot of manipulation on each element of the array. This pylab.array()
is built on top of numpy.array.
append
but have some other valuable methods.xVals = xVals*9.81
append()
methods etc, and once we have the list we convert them to an array()
so we can do maths on them.C
or java
xVals = xVals*9.81
which will multiply each item in xVals
with 9.81
because we changed it to a pylab.array()
xVals
and yVals
The plot will look like this:-
So the big question here is how to calculate the spring constant?
To find the spring constant, we have to plot a line based on the above points which we get and the slope of that line will be k
Now you might be thinking what the hell just happened we have a simple formula why not use it, or what is this slope and how it is related.
To explain the above phenomenon, let do a little bit of maths.
We know that from the equation of Hook’s law.
F = -kx
We have plotted the graph for F
and x
.
Now if we see the Plot shown above and remember the Equation of straight line will be:-
y = mx + b b = Y intercept, if b = 0 y = mx
Since from the plot we have seen that there is no Y intercept so based on the equation y = mx
we can derive that m = k
, so if we can find the slope of the line we will get k
Now we have to get that line. Now to get the line passing through two points is easy, we can simply use this formula:-
Y - Y1 = m(X - X1) where m = Y2 - Y1/ X2 - X1
We have a Fit line connecting 2 point, using the equation given above.
When we have a bunch of points scattered around, we have to find a line which is closest enough to fit these points. So find the proper fit for a line through multiple points we need a measure which will tell us the goodness of the fit. We have to chose a Best Fit. To find the Best Fit we have to take help of Objective Function, which tells the goodness of the fit.
One of the Objective function will be to draw a line which touches most or all points, the problem with that is it is very hard to find such a line.
There is a standard measure to find the best fit which is called Least Square Fit.
This is the best objective function to identify how well a curve fits a set of points.
The problem statement from wikipedia is:-
The objective consists of adjusting the parameters of a model function to best fit a data set. A simple data set consists of n points (data pairs) i = 1, …, n, where is an independent variable and is a dependent variable whose value is found by observation. The model function has the form where the m adjustable parameters are held in the vector . The goal is to find the parameter values for the model which “best” fits the data. The least squares method finds its optimum when the sum, S, of squared residuals
is a minimum. A residual is defined as the difference between the actual value of the dependent variable and the value predicted by the model.
So using the above Least square fit, we will get a graphs something like this:-
So in our situation we can explain that, the Line is generated based on the Independent variable i.e the Mass, and the predicted value of the dependent variable i.e. the displacement. Now the Points around the Lines scattered around are the observed values of Mass and Displacement. We calculated the difference between the predicted and the observed value, square it, So by squaring we have discarded the fact if the points are above or below the line, we are just interested in the displacement, then we sum this difference, and if we get a small enough value we can safely say that our fit is correct.
We have a way to validate the fit, but how to really make the fit. We can calculate the best fit using Newton methods, or we can use a simple analytic model which will give us the answer. Th good news is that Pylab has a inbuilt method which helps in doing this polyfit
.
polyfit()
: takes 3 arguments:-1
, it will return a
and b
, which we can use in the equation y = ax + b
, so x
is the independent value, in our case it will be mass.So lets use these concepts to draw the fit line:-
import pylab def getData(fileName): dataFile = open(fileName,'r') distances = [] masses = [] discardHeader = dataFile.readline() print discardHeader for line in dataFile: d,m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses,distances) def fitData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') a,b = pylab.polyfit(xVals, yVals, 1) estYVals = a*pylab.array(xVals) + b k = 1/a pylab.plot(xVals, estYVals, label = 'Linear fit, k = ' + str(round(k, 5))) pylab.legend(loc = 'best') fitData('../data/springData.txt') pylab.show()
The plot of the above code will be:-
The polyfit()
uses a concept called Liner Regression, to compute the values.
It is not because we are using a line to represent the polyfit
, but because even if we had a equation of parabola, we can say that the dependent variable Y
is a linear equation of independent variable X
, because we will sum the values.
Now can we say that the plot which we just got gives a best fit. Now if we see closely we will find that the points are pretty far away from the best fit line, and I am beginning to have my concerns, because if this fit is not correct, it will directly impact the calculation of spring constant.
Now lets check another variation.
We will make a cubic plot:-
import pylab def getData(fileName): dataFile = open(fileName,'r') distances = [] masses = [] discardHeader = dataFile.readline() print discardHeader for line in dataFile: d,m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses,distances) def fitData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') a,b = pylab.polyfit(xVals, yVals, 1) estYVals = a*pylab.array(xVals) + b pylab.plot(xVals, estYVals, label = 'Linear fit') a,b,c,d = pylab.polyfit(xVals, yVals, 3) estYVals = a*(xVals**3) + b*xVals**2 + c*xVals + d k = 1/a print "k: ", k pylab.plot(xVals, estYVals, label = 'Cubic fit') pylab.legend(loc = 'best') fitData('../data/springData.txt') pylab.show()
The plot of the graph will look like this:-
So looking into the plot, the cubic fit looks like a better fit, but we should always remember we want to make a model so that we can find values for which we cannot run experiments.
So lets use our model to predict some values.
import pylab def getData(fileName): dataFile = open(fileName,'r') distances = [] masses = [] discardHeader = dataFile.readline() print discardHeader for line in dataFile: d,m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses,distances) def fitData(fileName): xVals, yVals = getData(fileName) extX = pylab.array(xVals + [1.5]) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 extX = extX*9.81 pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') a,b = pylab.polyfit(xVals, yVals, 1) extY = a*pylab.array(extX) + b pylab.plot(extX, extY, label = 'Linear fit') a,b,c,d = pylab.polyfit(xVals, yVals, 3) extY = a*(extX**3) + b*extX**2 + c*extX + d pylab.plot(extX, extY, label = 'Cubic fit') pylab.legend(loc = 'best') fitData('../data/springData.txt') pylab.show()
So in the above code, we have add one extra point for calculation which is the weight of 1.5
.
The plot of the graph will look like this:-
Now from the above plot we can check that, the cubic curve fitted our existing data very well, but it failed to predict correctly the addition weight’s displacement. What could be the reason for this?
The curve fits the existing data very nicely but it has a very bad predictive value.
If we just look into the raw data, we will find that at the end it is flattening out, which is not what Hook’s law predict because it should be liner as per Hook’s Law, the only possible reason can be that, we have crossed the elastic limit of the spring.
Now we can discard the last 6 points which is beyond the elastic limit and check.
import pylab def getData(fileName): dataFile = open(fileName,'r') distances = [] masses = [] discardHeader = dataFile.readline() print discardHeader for line in dataFile: d,m = line.split() distances.append(float(d)) masses.append(float(m)) dataFile.close() return (masses,distances) def fitData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals[:-6]) yVals = pylab.array(yVals[:-6]) xVals = xVals*9.81 pylab.plot(xVals, yVals, 'bo', label = 'Measured displacements') pylab.title('Measured Displacement of Spring') pylab.xlabel('|Force| (Newtons)') pylab.ylabel('Distance (meters)') a,b = pylab.polyfit(xVals, yVals, 1) estYVals = a*pylab.array(xVals) + b k = 1/a pylab.plot(xVals, estYVals, label = 'Linear fit, k = ' + str(round(k, 5))) pylab.legend(loc = 'best') fitData('../data/springData.txt') pylab.show()
And the plot will look like this:-
Now this looks like a much better fix, and even if we fit a cubic it will be pretty close to the line.
Now we have a problem to identify which line is a best fit, were we correct in deleting the last 6 points, because we can also delete all the point except just two and we will get a perfect line.
Justification of the above is based on the fact that we have deleted the last 6 point based on our theory of Hook’s law that we might have exceeded out elastic limit. We do not have Theoretical justification of deleted arbitrary point which does not fit our curve.
This proves out point with which we started, i.e, there is a interplay between the 3 important pillars:-
So lets use the above concept on a different spring which is the Bow and arrow.
Now when we draw two curve, one is Line and another is a hyperbola, we find that the line is no where close to the data and the theory, but the hyperbola fits the data perfectly and also if we calculate the means squared error, we will find that the hyperbola is a better fit.
So the mean squared error is a very good measure of the comparison of two fits, but is very bad in validating it in absolute terms, because the means squared error does not have a upper bound.
The Mean square error is another way of identifying the co-efficient of determination, which is explained in next lecture.
We have seen some models of real world in past lecture, Gaussian or Normal Distribution is a great way to represent real world models with the help of Mean and Standard deviation.
A Normal Distribution can be fully characterized by its mean and standard deviation. This concept of characterization of a curve based on some parameters help in fully modeling real world system.
Most of time we would like to make a computational models based on normal distribution, because of how nicely it can be characterized and how it informs how closely it lies to the mean.
We have to take care if some thing is not normally distributed and we try to model it that way, we can get misleading results. Not all distribution are normal.
Ex:-
Consider rolling a single dice, each of the six outcome is equally probable, which means we cannot represent it as a normal distribution.
Similarly any fair lottery where the probability of each number coming is equally probable.
So for both of the above case we will have a flat line representing the distribution.
These distribution are called Uniform Distribution.
In a uniform distribution, each result is equally probable. It can be fully characterized by a single parameter its range.
Uniform distribution mostly occur in games devised by humans but never in nature, and it is not useful to model complex systems.
Other distribution which occur very frequently in nature is Exponential Distribution.
The key thing about them is that these are memory less and there are the only continuous distribution which are memory less.
Consider the example, where we check the concentration of a drug in human body.
Assume at time t
a molecule has a probability p
of been cleared of drug in human body. The system is memory less which means the probability of a molecule been cleared at any step is irrelevant to what happened before that step.
So at time t = 1
what is the probability of a molecule been still there in the body is: 1 - p
.
What is the probability of the molecule still there at time t = 2
: (1-p)^2
because these are independent events.
Generally, the molecule is still there at time t = t
is (1-p)^t
Consider a problem where at time t=0
there are n
molecule so how many molecule will be present at time t = t
?
See the below code to look into this:-
import pylab def clear(n,clearProb,steps): numRemaning = [n] for t in range(steps): numRemaning.append(n * ((1 - clearProb) ** t)) pylab.plot(numRemaning, label = 'Exponential Decay') clear(1000,0.01,500) pylab.semilogy() pylab.show()
The above code will give us a figure like this:-
So we get a straight line in the graph, the reason being we are using a semilogy
as the Y axis.
If we remove the semilogy
we will get a curve as shown below.
So if we see the graph, it looks like a exponential decay, which drop sharply and then asymptotes towards 0 but it never quite gets there in a continuous model, In a discrete model we might reach 0 because the last molecule will get cleared or not.
So if we plot a exponential curve on a log axis we will get a straight line.
In the above code, we are using simulation when we are aware of the mathematical equation. We can also use Monte Carlo Simulation to mimic the physical process on which the code is based on.
def clearSim(n,clearProb,steps): numRemaning = [n] for t in range(steps): numLeft = numRemaning[-1] for m in range(numRemaning[-1]): if random.random() <= clearProb: numLeft -= 1 numRemaning.append(numLeft) pylab.plot(numRemaning,"r",label = "simulation")
Now we can check the graph of this:-
Now if you check the graph above, the red Curve which is for Monte Carlo Simulation Closely mimics the mathematical representation in blue.
The only difference being the blue curve is much more smother than the Red Curve.
So we have two models:-
Which of these two is better?
Both the models shows the exponential decay, but they are not quite identical, which one should be preferred?
So the way we can answer this question is, when we are simulating a real world problem we should ask the question :-
Consider a scenario where we can ask what-if questions, which is very difficult to answer in an analytic model, for example Say the molecule had a property that every 100
time step the molecule can clone themselves. So how will we mimic it.
It will be very difficult to create a analytic model, because we are not aware of the equation which will help in doing this, but a simulation model might help.
The code to do this is here:-
def clearSimWhatif(n,clearProb,steps): numRemaning = [n] for t in range(steps): numLeft = numRemaning[-1] for m in range(numRemaning[-1]): if random.random() <= clearProb: numLeft -= 1 if t != 0 and t%100 == 0: numLeft += numLeft numRemaning.append(numLeft) pylab.plot(numRemaning,"y",label = "simulation")
And the graph will be something like this.
The color of the graphs are like this:-
A lot of physical system mimics exponential decay, when we consider half life of nuclear material is a exponential decay, and the algae growth is a exponential growth example.
So we will now discuss a very nice problem based on probability which is Monty Hall Problem
We can understand the Monty Hall problem from these two links:-
As per Wikipedia a Monty Hall Problem is defines as:-
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?
So the whole of Month Hall problem depends on weather to switch or not?
We can analyze the probability of the Monty Hall problem.
When the person choses one door, the probability of winning is 1/3
, and the probability of the prize being on the other 2 doors is 2/3
.
Now next step is important, Monty Hall known which door has the prize, and he most of the times picks a Door to open which does not have the price. So the choice of door is not independent of the choice of the player.
Once that door is opened, if the contestant stay with his original choice the probability is still 1/3
but if he switches the probability increases to 2/3
. So if a contestant get multiple chances, there is a probability that 2 out of 3 times he will win if he switches. But alas each contestant gets only 1 chance, so this probability increase is still counter intuitive.
Switching doubles the odds of wining.
So lets run a simulation to understand complex situation like the Monty Hall problem.
import pylab,random def montyChose(guessDoor,prizeDoor): if 1 != guessDoor and 1 != prizeDoor: return 1 if 2 != guessDoor and 2 != prizeDoor: return 2 return 3 def randomChose(guessDoor,prizeDoor): if guessDoor == 1: return random.choice([2,3]) if guessDoor == 2: return random.choice([1,3]) return random.choice([1,2]) def simMontyHall(numTrials=100,chooseFcn = montyChose): stickWins = 0 switchWins = 0 noWin = 0 prizeDoorChoices = [1,2,3] guessChoices = [1,2,3] for t in range(numTrials): prizeDoor = random.choice(prizeDoorChoices) guess = random.choice(guessChoices) toOpen = chooseFcn(guess, prizeDoor) if toOpen == prizeDoor: noWin += 1 elif guess == prizeDoor: stickWins += 1 else: switchWins += 1 return (stickWins,switchWins) def displayMHSim(simResults): stickWins, switchWins = simResults pylab.pie([stickWins, switchWins], colors = ['r', 'g'], labels = ['stick', 'change'], autopct = '%.2f%%') pylab.title('To Switch or Not to Switch') simResults = simMontyHall(100000, montyChose) displayMHSim(simResults) pylab.figure() simResults = simMontyHall(100000, randomChose) displayMHSim(simResults) pylab.show()
Consider the output graph:-
* When Switch is Random
* When Monty Choose
Based on the above 2 graphs we should always switch.
From the lectures till now on Randomness and Monte Carlo Simulation we might get an idea that we can use Monte Carlo Simulation mostly for process which have an inherent randomness.
We can also Randomness to solve problem where randomness play no role. Best example is PI Estimation.
Here are some article to learn more of this:-
Monte Carlo Simulation for estimating Pi is based on the formula of dropping Needles:-
Needle in the circle/Needle in the Square = Area or circle/Area of Square Pi = area of the circle = (area of square * no of needle in circle)/No of needle in Square
So we will simulate the above equation in Code:-
import random def stdDev(X): mean = sum(X)/float(len(X)) total = 0.0 for x in X: total += (x - mean) ** 2 return (total/len(X))**0.5 def throwNeedles(numNeedles): inCircle = 0 for needle in xrange(1,numNeedles+1,1): x = random.random() y = random.random() if (x * x + y * y) **0.5 <= 1.0: inCircle += 1 return 4*(inCircle/float(needle)) def estPi(precision = 0.01, numTrials = 20): numNeedles = 1000 numTrials = 20 sDev = precision while sDev >= (precision/4.0): estimates = [] for t in range(numTrials): piGuess = throwNeedles(numNeedles) estimates.append(piGuess) sDev = stdDev(estimates) curEst = sum(estimates)/len(estimates) curEst = sum(estimates)/len(estimates) print 'Est. = ' + str(curEst) +\ ', Std. dev. = ' + str(sDev)\ + ', Needles = ' + str(numNeedles) numNeedles *= 2 return curEst estPi()
Integration means the area under the curve denoted by the equation. So we can use Monte Carlo Simulation to find the integral of a equation.
Single Integral Code:-
import random def integrate(a, b, f, numPins): pinSum = 0.0 for pin in range(numPins): pinSum += f(random.uniform(a, b)) average = pinSum/numPins return average*(b - a) def one(x): return 1.0 print integrate(0, 8, one, 100000)
Monte Carlo Simulation is not a great way to solve single integral, but it is of great help to solve Double Integral.
Double Integral Code:-
import random def doubleIntegrate(a, b, c, d, f, numPins): pinSum = 0.0 for pin in range(numPins): x = random.uniform(a, b) y = random.uniform(c, d) pinSum += f(x, y) average = pinSum/numPins return average*(b - a)*(d - c) def f(x, y): return 4 - x**2 - y**2 print doubleIntegrate(0, 1.25, 0, 1.25, f, 100000)
When the distribution is evenly distributed along the mean.
A binary variable which can have only 2 possible outcome, like heads or tails, 0 or 1, on or off.
Lets say we have a variable A
which can take values Heads
or Tails
, so the p(A=H)
i.e., probability of A
taking the value H
.
An important thing to note about probability is that it lies between 0
and 1
, i.e. 0 <= p(A=H) <= 1
. This range of 0
to 1
is continuous and not discrete.
The value of p(A) = [0,1]
, i.e. the probability of A happening is between 0
and 1
.
So what is the probability of A
not happening, p(A^) = 1 - p(A)
We have discussed the probability of a single event occurring and not occurring. Now lets check what is the probability of 2 independent events occurring. Consider the below table.
In this we are throwing 2 coins:-
A | p(A) | B | p(B) | p(A n B) |
---|---|---|---|---|
H | 1/2 |
H | 1/2 |
1/2 * 1/2 |
T | 1/2 |
T | 1/2 |
1/2 * 1/2 |
H | 1/2 |
T | 1/2 |
1/2 * 1/2 |
T | 1/2 |
H | 1/2 |
1/2 * 1/2 |
So from the above table, what is the probability of p(A = H and B = T)
. So this statement is represented as p(A n B)
i.e A intersection B.
therefore p(A n B) = p(A) * p(B)
, because both A
and B
are independent events.
So the next thing we should try is p(A = H or B = T)
which is represented by p(A u B)
i.e A union B.
So this is actually equal to
p(A = H and B = H) * p(A = H and B = T) * p(A = T and B = H)
We can represent this in a Venn Diagram.
We can show multiple event by these representation.
We represent a Tree in this way:-
Similarly we can represent a dice throw in this way:-
But as you can see, as no of throws increase, the no of branches also grows. So a easier way to represent it is Grid.
A grid for Dice will look like this:-
What will be the probability of getting same number in both the dice.
As seen in the figure above the no of such instance were we get same number on both dice is 6
, of the total possible outcome is 36
, therefore the probability is 6/36
.
What will be the probability of getting the sum 6, in two dice roll?
As seen in the figure above the no of such instance were we get sum as 6
is 5
, of the total possible outcome is 36
, therefore the probability is 5/36
.
So we can represent a probability using these 3 methods:-
Suppose we are flipping 3 coins, So when we flip 3 coins what is the probability of any one out come. It is 2^3 = 8
, because we have 3
coins and each coin can have 2
values, so totally 3
coins can have 8
values. So the probability of any one outcome will be 1/8
.
So based on the above understanding, What is the probability of getting 2 heads?
We can solve this using enumeration. The favorable outcomes are:-
HHT HTH HHT
So 3 events are favorable, so the probability of any one of these 3 happening is 3/8
.
The total no of outcome will be 4^2
, because we have 1 dice which can take 4 values so when we have 2 such dice the total possible outcomes becomes 4^2 = 16
.
So consider the above dice, What is probability of rolling a 2
and a 3
, not in the given order?
The favorable outcomes are:-
32 23
The probability of each happening is 1/16
and 1/16
. So the probability of any of the event occurring is 2/16
.
What is the probability that the sum of 1 roll is odd?
The Probability will be 1/2
, because we can either get even or odd.
What is the probability that both dice have the same value?
If we have 4 sided dice, the total number of favorable outcome will be 4
and the total possible outcome is 16
. so the probability will be 4/16
.
For an n sided dice, it will be 1/n
.
What is the probability of getting an Ace?
There are totally 4
aces in the pack of cards, and totally there are 52
possible outcomes, so totally we have 4/52 = 1/13
The probability of not getting a Ace = 1 - 1/13
What is the probability of getting a specific cards?
It is 1/52
Sample size = 52 ^ 2
What is the probability of getting at least 1 ace?
1/13 + 1/3 - (1/13 * 1/13)
Now till now we have dealt with issues where we knew about some of the probability. What happens if we did not know the probability?
In that case we will run a simulation.
Provided we do not have the probability know in advance, so we can run a simulation, where we can check if how many number of heads we got when we do n
trails.
So the probability will be no of times we got heads/ total number of trails.
We can find the value of pie by doing this:-
We can also see the details of this process like this
At the end of lecture 14, we were flipping coin, and trying to find number of sample which is enough to safely say that the probability of getting a head or tails after n trails is 0.5.
So to solve this, we can flip a coin just 2 times and get head and tails, now with this we will have the probability of 0.5 which is the correct answer, but the sample size is not good enough, because if we flipped it twice and got heads both the times, we cannot assume the probability is 1 for getting heads.
So the important question which we will answer in this lecture is How many samples do we need to believe the answer?
Fortunately there is a very solid set of mathematics which will help us get to this answer.
Variance is at the root of the answer to the above question. So variance is a measure of how much spread there is in the possible outcomes.
So to use variance, we should have different outcomes, which is we want to run multiple trials, which is why we need to run multiple trials rather than 1 trails with multiple flips in case of coin.
The question which we should think of is, why we should do a million trails of coin flips rather than 1 trails with a million flips? The reason for doing a million trails is for each trails we will get a outcome which gives us a fair idea of the spread of outcome i.e. variance.
We can formalize the concept of Variance, which is called Standard Deviation.
The fraction of values which are close to the mean.
As quoted in wikipedia .
A low standard deviation indicates that the data points tend to be very close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
If all values are the same, the standard deviation is 0
.
Here is the mathematical formula for standard deviation:-
and the code to calculate standard deviation will be:-
def stdDev(X): mean = sum(X)/float(len(X)) total = 0.0 for x in X: total += (x - mean)**2 return (total/len(X)) ** 0.5
So see the steps to find standard deviation kindly see this link, Standard Deviation Formulas
We have now understood what a standard deviation is, now what is the use of it?
Standard deviation will be used to look at the relationship between the number of samples we have looked at and how much confidence we should have in the answer.
Lets understand standard deviation with the help of this code:-
import random, pylab import example01 def flipPlot(minExp, maxExp,numTrials): meanRatios = [] meanDiffs = [] ratiosSDs = [] diffsSDs = [] xAxis = [] for exp in range(minExp,maxExp + 1): xAxis.append(2**exp) for numFlips in xAxis: ratios = [] diffs = [] for t in range(numTrials): numHeads = 0 for n in range(numFlips): if random.random() < 0.5: numHeads += 1 numTails = numFlips - numHeads ratios.append(numHeads/float(numTails)) diffs.append(abs(numHeads - numTails)) meanRatios.append(sum(ratios)/numTrials) meanDiffs.append(sum(diffs)/numTrials) ratiosSDs.append(example01.stdDev(ratios)) diffsSDs.append(example01.stdDev(diffs)) pylab.plot(xAxis, meanRatios, 'bo') pylab.title('Mean Heads/Tails Ratios (' + str(numTrials) + ' Trials)') pylab.xlabel('Number of Flips') pylab.ylabel('Mean Heads/Tails') pylab.semilogx() pylab.figure() pylab.plot(xAxis, ratiosSDs, 'bo') pylab.title('SD Heads/Tails Ratios (' + str(numTrials) + ' Trials)') pylab.xlabel('Number of Flips') pylab.ylabel('Standard Deviation') pylab.semilogx() pylab.semilogy() pylab.figure() pylab.title('Mean abs(#Heads - #Tails) (' + str(numTrials) + ' Trials)') pylab.xlabel('Number of Flips') pylab.ylabel('Mean abs(#Heads - #Tails') pylab.plot(xAxis, meanDiffs, 'bo') pylab.semilogx() pylab.semilogy() pylab.figure() pylab.plot(xAxis, diffsSDs, 'bo') pylab.title('SD abs(#Heads - #Tails) (' + str(numTrials) + ' Trials)') pylab.xlabel('Number of Flips') pylab.ylabel('Standard Deviation') pylab.semilogx() pylab.semilogy() flipPlot(4, 20, 20) pylab.show()
Now let us analyze the plots:-
In the above plot you can see that when the number of flips are small, the points varies significantly and once the number of flips increases it sort of stabilize around 1.
Co-efficient of variation is simply ratio of standard deviation to the Means.
< 1
, we think of it as low variance.There is some warning which we should consider while dealing with Co-efficient of variation.
Consider the below code:-
import pylab L = [1,2,3,3,3,4] pylab.figure() pylab.hist(L, bins = 6) pylab.show()
In the code shown in the video, if we use the code as it is, we will not get the plot, we have to add the two calls to get the plot.
pylab.figure() pylab.show()
The sample output will look like this.
Now let us see a more useful code for plotting histogram and its use:-
import pylab,random def stdDev(X): mean = sum(X)/float(len(X)) total = 0.0 for x in X: total += (x - mean)**2 return (total/len(X)) ** 0.5 def flip(numFlips): heads = 0.0 for i in range(numFlips): if random.random() < 0.5: heads += 1.0 return heads/numFlips def flipSim(numFlipsPerTrails,numTrails): fracHeads = [] for i in range(numTrails): fracHeads.append(flip(numFlipsPerTrails)) return fracHeads def labelPlot(nf,nt,mean,sd): pylab.title(str(nt) + ' trials of '+ str(nf) + ' flips each') pylab.xlabel('Fraction of Heads') pylab.ylabel('Number of Trials') xmin, xmax = pylab.xlim() ymin, ymax = pylab.ylim() pylab.text(xmin + (xmax-xmin)*0.02, (ymax-ymin)/2, 'Mean = ' + str(round(mean, 6)) + '\nSD = ' + str(round(sd, 6))) def makePlots(nf1, nf2, nt): """nt = number of trials per experiment nf1 = number of flips 1st experiment nf2 = number of flips 2nd experiment""" fracHeads1 = flipSim(nf1, nt) mean1 = sum(fracHeads1)/float(len(fracHeads1)) sd1 = stdDev(fracHeads1) pylab.hist(fracHeads1, bins = 20) xmin,xmax = pylab.xlim() ymin,ymax = pylab.ylim() labelPlot(nf1, nt, mean1, sd1) pylab.figure() fracHeads2 = flipSim(nf2, nt) mean2 = sum(fracHeads2)/float(len(fracHeads2)) sd2 = stdDev(fracHeads2) pylab.hist(fracHeads2, bins = 20) pylab.hist(fracHeads2, bins = 20) pylab.xlim(xmin, xmax) ymin, ymax = pylab.ylim() labelPlot(nf2, nt, mean2, sd2) makePlots(100, 1000, 100000) pylab.show()
Some Apis which we should discuss from the above code are:-
pylab.xlim() pylab.ylim() pylab.xlim(xmin, xmax)
The api, xlim()
and ylim()
, when called with no parameters gives the normal range of x
and y
axis. And when called with parameters sets this values to the limit of x
and y
axis, inplace of using the default limits.
The histogram will look like this.
Some interesting property of Normal Distribution:-
The above figure is not exactly a Normal distribution, because it is not symmetric on both side of mean.
Normal Distribution is used to create probabilistic models for 2 reason.
Probability is all about estimating a unknown, like What is the probability of getting a Head or tails. And till now the estimation was based on the mean.
Confidence intervals gives a way of estimating the unknown variable by giving a range that is likely to contain the unknown value, and a confidence that the unknown value lies within the range.
For an example, consider the POLL estimation given in newspaper, they inform that the chances of a candidate winning is 48% with +-4%. So this says that, the chances of the candidate winning is from 44% to 52%. If the +-4% is not mentioned then by default it is +- 5%.
When the press makes above statement their assumption is that the Polls are random trails with normal distribution.
**Empirical Rule:- ** They give a handy rule to estimate the confidence intervals provided we give them Mean and Standard Deviation.
So how did the pollster find the standard Deviation, did then perform a massive survey of people. They actually do not do this survey in place they do a trick to estimate the standard deviation, which is Standard Error.
P : Sampled Population.
n : Sample Size.
This is the estimate of the Standard Deviation with these assumption.
Consider the below example, A pollster is to sample 1000 voters(n), and 46% of them said they will vote for Abraham Lincoln. The standard error will be 1.58% we will interpret this to mean, that in 95% of time, the % votes that Lincoln will get lies between 2 Standard Deviation of 46%.
Consider the below code:-
import pylab,random def stdDev(X): mean = sum(X)/float(len(X)) total = 0.0 for x in X: total += (x - mean)**2 return (total/len(X)) ** 0.5 def poll(n,p): votes = 0.0 for i in range(n): if random.random() < p/100.0: votes += 1 return votes def testErr(n = 1000, p = 46.0, numTrials = 1000): results = [] for t in range(numTrials): results.append(poll(n, p)) print 'std = ' + str((stdDev(results)/n)*100) + '%' results = pylab.array(results)/n pylab.hist(results) pylab.xlabel('Fraction of Votes') pylab.ylabel('Number of Polls') testErr() pylab.show()
The output will look like this:-
The standard deviation is 1.58827785982%
and we estimated it to be 1.58%
by using standard error, which is very close, so Standard Error is a way of predicting the standard deviation.
Consider the below example code:-
import pylab principal = 10000 #initial investment interestRate = 0.05 years = 20 values = [] for i in range(years + 1): values.append(principal) principal += principal*interestRate pylab.plot(values) pylab.show()
The above code will show you the power of compounding in investment, but the graph plotted with the above code is useless because you cannot find out what does this graph shows i.e. it does not have a heading, also it does not represent what is X
and Y
axis.
Fortunately it is easy enough to do it:-
import pylab principal = 10000 #initial investment interestRate = 0.05 years = 20 values = [] for i in range(years + 1): values.append(principal) principal += principal*interestRate pylab.plot(values) pylab.title('5% Growth, Compounded Annually') pylab.xlabel('Years of Compounding') pylab.ylabel('Value of Principal ($)') pylab.show()
With this 3 methods of pylab
, it is easy enough to do it, pylab.title()
, pylab.xlabel
and pylab.ylabel()
.
It is better to understand probability with a example, What is the probability of not getting a single one in a roll of dice, when rolled for 10 times?
The wrong way of solving this is by doing this:-
The probability of not getting 1 in a 1 roll of dice is 5/6
.
The probability of not getting 1 in a 2 roll of dice is 5/6
.
Now the wrong way would be to add 5/6
10 times, but this is wrong because the final probability will be 8.3
, and we all know that probability lies between 1
and 0
. It can never be greater than 1
.
The right way to solve this problem will be.
How many different 10 digit numbers can be made when we roll a 6 sided dice for 10 times, it is 6 ^ 10
. Same as the number of binary digit number can be formed with 10 bytes which is 2^10
.
So as we checked in the wrong way,
what is the probability of not getting 1 in a first roll of dice = 5/6
.
what is the probability of not getting 1 in a second roll of dice = 5/6
The second roll of dice is not dependent on the outcome of the first roll of dice, this is what we call Independent Events.
So after 10 times what will be the probability =
5/6 * 5/6 * 5/6 * 5/6 * 5/6 * 5/6 * 5/6 * 5/6 * 5/6 * 5/6 = (5/6) ^ 10 = 0.16150558288984572135006520008806
So the above explanation answers our initial question What is the probability of not getting a single one in a roll of dice, when rolled for 10 times?
Now lets modify the question a little bit, What is the probability of atleast getting a single one in a roll of dice, when rolled for 10 times?
which will be 1 - (5/6) ^ 10
. Because we know that the sum of all different probability will be 1, and we have found out the probability of not getting any 1, so the probability of getting at least one 1 will be 1 - (5/6) ^ 10
.
This is a very common trick in computing probability, so when someone ask what is the probability of x, we can simply calculate the probability of not x and then subtract it from 1.
So lets solve another interesting problem.
Given a pair of fair dice, if rolled 24 times, what is the probability of rolling a double 6?
So lets solve this problem:-
What is the probability of getting 6 in a single roll on a single dice = 1/6
What is the probability of getting 6 in a single roll on a second dice = 1/6
So the probability of getting 6 in a single roll of both dice is 1/36
, and the probability of not getting 6 in a single roll of both dice is 1 - 1/36 = 35/36
.
So what is the probability of not getting 6, 24 times in a row = (35/36) ^ 24 = 0.50859612386909674041792802317515
So now lets solve this using code, what we call simulation, and see if the calculation is correct.:-
import random def rollDie(): """returns a random int between 1 and 6""" return random.choice([1,2,3,4,5,6]) def checkPascal(numTrials = 100000): yes = 0.0 for i in range(numTrials): for j in range(24): d1 = rollDie() d2 = rollDie() if d1 == 6 and d2 == 6: yes += 1 break print 'Probability of losing = ' + str(1.0 - yes/numTrials) checkPascal()
The Above code is what we call a Monte Carlo Simulation. Monte Carlo Simulation are example of what we call inferential statistics.
Inferential Statistics in brief it is based on one guiding principle, A random sample tends to exhibit the same properties as the population from which it is drawn.
The problem is, the above assumptions are sometimes wrong, which we will see later.
Consider an example, What if we flipped a coin 100 times, and we get 52 heads and 48 tails, what do this infer?
Do we infer that if we again flipped a coin 100 times we will get the same heads and tails ratio, probably not, even we will not be comfortable in saying that there will be more heads than tails based on the above sample.
A proper way to infer something will be, What is the total number of tests done and how closely the answer is when we did things in random. This is something called a Null hypothesis
Consider the below example:-
def flipCoin(numFlips): heads = 0 for i in range(numFlips): if random.random() < 0.5: heads += 1 return heads/float(numFlips)
The above code is a simple coin flip code, which will flip a coin for numFlips
times, and will tell us how may times we get heads, we already know that we should get a answer near 0.5
. So here is the output when run different number of times:-
Flip 100 times CoinFlip i = 0 flipCoin(100): 0.49 CoinFlip i = 1 flipCoin(100): 0.44 CoinFlip i = 2 flipCoin(100): 0.5 CoinFlip i = 3 flipCoin(100): 0.46 CoinFlip i = 4 flipCoin(100): 0.53 CoinFlip i = 5 flipCoin(100): 0.45 CoinFlip i = 6 flipCoin(100): 0.56 CoinFlip i = 7 flipCoin(100): 0.51 CoinFlip i = 8 flipCoin(100): 0.44 CoinFlip i = 9 flipCoin(100): 0.56 Flip 1000000 times CoinFlip i = 0 flipCoin(1000000): 0.499925 CoinFlip i = 1 flipCoin(1000000): 0.499876 CoinFlip i = 2 flipCoin(1000000): 0.500231 CoinFlip i = 3 flipCoin(1000000): 0.499441 CoinFlip i = 4 flipCoin(1000000): 0.499637 CoinFlip i = 5 flipCoin(1000000): 0.49971 CoinFlip i = 6 flipCoin(1000000): 0.49968 CoinFlip i = 7 flipCoin(1000000): 0.50031 CoinFlip i = 8 flipCoin(1000000): 0.499946
So as we can see, when the numFlips
is more like 1000000
, we get answers close to 0.5
.
The above is an example of Law of large numbers, or Bernoulli’s Law of Large Numbers.
The law states, In repeated independent tests with the same actual probability P
chance that fraction of times outcome occurs converges to P
as no of test goes to infinity.
This law does not states that if we start out with deviation in the expected behavior, those deviation will eventually be even out. That means, if in a coin flip we initially get a sequence of Heads, does not means we will have more tails towards the end, because as mentioned above it depends on tests to be independents. This is what we call Gambler’s fallacy
This laws also does not states that, with increase in number of trails, the absolute difference between the number of heads and tails will get smaller.
Now consider an example for the same:-
import random import pylab def flipPlot(minExp,maxExp): ratios = [] diffs = [] xAxis = [] for exp in range(minExp,maxExp+1): xAxis.append(2 ** exp) print "xAxis: ", xAxis for numFlips in xAxis: numHeads = 0 for n in range(numFlips): if random.random() < 0.5: numHeads += 1 numTails = numFlips - numHeads ratios.append(numHeads/float(numTails)) diffs.append(abs(numHeads - numTails)) pylab.title('Difference Between Heads and Tails') pylab.xlabel('Number of Flips') pylab.ylabel('Abs(#Heads - #Tails') pylab.plot(xAxis, diffs) pylab.figure() pylab.plot(xAxis, ratios) pylab.title('Heads/Tails Ratios') pylab.xlabel('Number of Flips') pylab.ylabel('Heads/Tails') flipPlot(4,20) pylab.show()
Here is the output of the same:-
From the graph of Difference between heads and tails, it will look like we have trend in which the difference shoots up dramatically. But is this really a happening?
The reason this is happening because, the default behavior of pylab is to connect the dots with lines. So this can be a problem because as a user you may think you have a trend but it really may be an outliers, because the number of points can be as low as 2 or 3.
So if we modify the code to plot points:-
import random import pylab def flipPlot(minExp,maxExp): ratios = [] diffs = [] xAxis = [] for exp in range(minExp,maxExp+1): xAxis.append(2 ** exp) print "xAxis: ", xAxis for numFlips in xAxis: numHeads = 0 for n in range(numFlips): if random.random() < 0.5: numHeads += 1 numTails = numFlips - numHeads ratios.append(numHeads/float(numTails)) diffs.append(abs(numHeads - numTails)) pylab.figure() pylab.title('Difference Between Heads and Tails') pylab.xlabel('Number of Flips') pylab.ylabel('Abs(#Heads - #Tails') pylab.plot(xAxis, diffs, 'bo') #do not connect, show dot pylab.figure() pylab.plot(xAxis, ratios, 'bo') #do not connect, show dot pylab.title('Heads/Tails Ratios') pylab.xlabel('Number of Flips') pylab.ylabel('Heads/Tails') # pylab.semilogx() flipPlot(4,20) pylab.show()
The output will be:-
*
*
So if we see the “Difference between heads and tails” figure, we will see the dots are petty sparse, which will make it difficult to show a trends.
Since the scale of the x and y axis is petty vast just change it to log axis and see the difference.
In the last lecture, we had a error in our code, because the Random walk code, did not output correct value of small samples as we had manually checked.
The problem was in the simWalks()
method, which used wrong arguments, we had used this:-
distances.append(walk(f, homer, numTrials))
But we should have used:-
distances.append(walk(f, homer, numSteps))
So the complete corrected code is:-
import random class Location(object): def __init__(self, x,y): """x and y are float""" self.x = x self.y = y def move(self,deltaX,deltaY): """deltaX and deltaY are float""" return Location(self.x + deltaX, self.y + deltaY) def getX(self): return self.x def getY(self): return self.y def distFrom(self,other): ox = other.x oy = other.y xDist = self.x - ox yDist = self.y - oy return (xDist**2 + yDist**2) ** 0.5 def __str__(self): return '<' + str(self.x) + ', ' + str(self.y) + '>' class Field(object): def __init__(self): self.drunks = {} def addDrunk(self,drunk,loc): if drunk in self.drunks: raise ValueError('Duplicate Drunk') else: self.drunks[drunk] = loc def moveDrunk(self,drunk): if not drunk in self.drunks: raise ValueError('Drunk not in field') xDist,yDist = drunk.takeStep() self.drunks[drunk] = self.drunks[drunk].move(xDist, yDist) def getLoc(self, drunk): if not drunk in self.drunks: raise ValueError('Drunk not in field') return self.drunks[drunk] class Drunk(object): def __init__(self, name): self.name = name def takeStep(self): stepChoices = [(0,1), (0,-1), (1, 0), (-1, 0)] return random.choice(stepChoices) def __str__(self): return 'This drunk is named ' + self.name def walk(f, d, numSteps): start = f.getLoc(d) for s in range(numSteps): f.moveDrunk(d) return(start.distFrom(f.getLoc(d))) def simWalks(numSteps, numTrials): homer = Drunk('Homer') origin = Location(0, 0) distances = [] for t in range(numTrials): f = Field() f.addDrunk(homer, origin) distances.append(walk(f, homer, numSteps)) return distances def drunkTest(numTrials): for numSteps in [10, 100, 1000, 10000, 100000]: # for numSteps in [0,1]: distances = simWalks(numSteps, numTrials) print 'Random walk of ' + str(numSteps) + ' steps' print ' Mean =', sum(distances)/len(distances) print ' Max =', max(distances), 'Min =', min(distances) homer = Drunk("homer") origin = Location(0,0) field = Field() field.addDrunk(homer,origin) print "walk(field,homer,10): ", walk(field,homer,10) drunkTest(10)
The above problem of random walk will give different output every time you run it, also we cannot infer much from it. So these type of problem are called stochastic problems.
If we consider Newtonian physics, it is very comforting. To ever cause there is a reaction. Everything is deterministic.
Then came Copenhagen Doctrines, associated with quantum physics changed this deterministic view of the world. It argued that,
natural change is necessarily by way of indeterministic physically discontinuous transitions between discrete stationary states
One can make probabilistic statements of the form “X is highly likely to happen”, but not statement of the form “X is certain to happen.” What they meant was, The world is Stochastic.
But Einstein and Schrodinger disagreed the Copenhagen Doctrines.
These two have practically divided the physics world, and at the heart of it was Causal Non Determinism.
Causal Non Determinism believed that not every event is based on the cause of a previous event. Which was disagree by Einstein and Schrodinger. Famously said by Einstein “God Does not play Dice.”
Our inability to make measurement of the physical world makes it impossible to make prediction of the future. So basically this means, things are not unpredictable, it just looks unpredictable because we do not have enough information.
A process is Stochastic, if its next step depends on both, i.e. the previous state and some random elements.
The random elements in python are introduced by the help of random.random()
, which generates a random value between 0.0
and 1.0
.
Consider the example of rolling a dice, suppose we roll a dice, which one of the following sequence is more likely to be possible, if the dice is rolled 10 times.:-
1111111111 5442462412
The answer is both the answer are equally likely, as each roll of dice is independent of the previous rolls.
In a Stochastic process, two events are independent, if the outcome of one event has no influence on the outcome of the other.
Consider one more example, which we do with a Coin. So on a coin flip, what are the maximum number of output we can get. It is 2
, heads or tails. So If we flip the coin for 10 times, the total different sequence of 1
and 0
it will create is 2^10
.
So in the flipping of coin for 10 times, both 10 times 0
and 1
is equally likely to happen.
What is the probability of getting all 1
, is 1/2^10
.
Probability = what fraction of the possible results have the property we are testing for.
Probability lies between 0
to 1
. 0
meaning will never happen, 1
means most certain to happen.
Another interesting question will be, what is the probability of getting anything other than all 1
.
1 - 1/2^10
Data Visualization is very important, because a visual data representation is always good enough, that simple print or log statements. We know this, but very rarely we draw/plot graphs in programming language as it is very difficult to plot.
In python, it is very easy, because of PyLab, which gives most of the functionality from Matlab.
Consider this example:-
import pylab pylab.plot([1,2,3,4], [1,2,3,4]) pylab.plot([1,4,2,3], [5,6,7,8]) pylab.show()
In the above code, the show()
method will finally display the result as a plot. Mostly we write intermediate steps into a file, and then finally writing it. Also show()
method should be used only once in a program and it is mostly at the end of the text.
Consider the below example:-
import pylab pylab.figure(1) pylab.plot([1,2,3,4], [1,2,3,4]) pylab.figure(2) pylab.plot([1,4,2,3], [5,6,7,8]) pylab.savefig('firstSaved') pylab.figure(1) pylab.plot([5,6,7,10]) pylab.savefig('secondSaved') pylab.show()
We can also, label x and y axis, which a graph title:–
pylab.title('5% Growth, Compounded Annually') pylab.xlabel('Years of Compounding') pylab.ylabel('Value of Principal ($)')
We will start with the example in the previous lecture, see the code here;-
import datetime class Person(object): def __init__(self, name): #Create a person with name self.name = name try: firstBlank = name.rindex(' ') # print "__init__: firstBlank: ", firstBlank self.lastName = name[firstBlank+1:] except : self.lastName = name self.birthDay = None def getLastName(self): #returns self's last name return self.lastName def setBirthday(self,birthDate): #assumes that self's birthday is of type datetime.date #sets self's birthday to birthDate assert type(birthDate) == datetime.date self.birthDay = birthDate def getAge(self): #assumes that self's birthday is set #returns self's age in days assert self.birthDay != None return (datetime.date.today() - self.birthDay).days def __lt__(self,other): #returns True if self name is lexicographically greater #than other's name, and False Otherwise if self.lastName == other.lastName: return self.name < other.name return self.lastName < other.lastName def __str__(self): return self.name class MITPerson(Person): nextIDNum = 0 def __init__(self, name): #super(MITPerson, self).__init__() Person.__init__(self,name) self.idNum = MITPerson.nextIDNum MITPerson.nextIDNum += 1 def getIdNum(self): return self.idNum def __lt__(self, other): return self.idNum < other.idNum def isStudent(self): return type(self)==UG or type(self)==G class UG(MITPerson): def __init__(self, name): MITPerson.__init__(self, name) self.year = None def setYear(self, year): if year > 5: raise OverflowError('Too many') self.year = year def getYear(self): return self.year class G(MITPerson): pass class CourseList(object): def __init__(self, number): self.number = number self.students = [] def addStudent(self, who): if not who.isStudent(): raise TypeError('Not a student') if who in self.students: raise ValueError('Duplicate student') self.students.append(who) def remStudent(self, who): try: self.students.remove(who) except: print str(who) + ' not in ' + self.number def allStudents(self): for s in self.students: yield s def ugs(self): indx = 0 while indx < len(self.students): if type(self.students[indx]) == UG: yield self.students[indx] indx += 1
As you can see, MITPerson
class is a specialization of the base class Person
. Person
class have the property like:-
__lt__
which helps in comparison, and __str__
which helps in string representation of the class.The MITPerson
class was a specialization of the Person
class, with added benefits of having a unique idNum
, which was a class level variable. This class has a method isStudent()
, which checked if a student is UG
or G
.
We also had 2 more specialization classes UG
for under graduate and G
for graduate.
Now we can create a course list using this code:-
m1 = MITPerson('Barbara Beaver') ug1 = UG('Jane Doe') ug2 = UG('John Doe') g1 = G('Mitch Peabody') g2 = G('Ryan Jackson') g3 = G('Jenny Liu') SixHundred = CourseList('6.00') SixHundred.addStudent(ug1) SixHundred.addStudent(g1) SixHundred.addStudent(ug2)
Now to get the students name in the course list we can do this:-
for student in SixHundred.students: print "student: ", student
But is this the right way to do this, because we are accessing instance variable directly.
So if we check the CourseList
class, we have a method allStudents()
which is implemented like this:-
def allStudents(self): for s in self.students: yield s def ugs(self): indx = 0 while indx < len(self.students): if type(self.students[indx]) == UG: yield self.students[indx] indx += 1
Now if we see the above 2 methods, we can directly understand that we are not creating a list anywhere which we will return, in place we are using yield
.
yield
is a generator, Generator is like a return, but the big difference is that, when we use return
so any computation which we have done before the return is just thrown out, so we will not save all the instance of student, instead we will be only able to return the first instance which matches, if we are not saving it in a list.
A generator is a function, which remembers the point in the body where it was, when it last returned and all local variables.
So yield
helps us in running a loop like this without creating the list.
print 'Students Squared' for s in SixHundred.allStudents(): for s1 in SixHundred.allStudents(): print "s= ", s ," s1: ", s1
So now on, the focus will be using computers to solve computational problems. So if we see the history of computation, we used Analytic Method to solve it.
Analytic method/ model helps to predict behavior of the system based on some initial conditions and a set of parameters. So this analytic method, helped in making Calculus, Probability theory.
This is a very nice way, but it does not work always. As the amount of information increased these Analytic model were insufficient.
There were things where making a model was not possible and these are the reason simulation was much more useful, like:-
The idea of simulation is to build a model with the following property.
Simulation will not give exact answer everytime, and also it may not give the same result everytime we run it. So we can run the simulation enough no of time to help us understand/ predict the real behavior of the system.
Start with this Wiki
Brownian motion is a example of Random Walk.
The basic idea behind Random Walk is, if we have a system of interacting objects, we model a simulation under the assumption that each one of those things is going to move some steps under some random distribution. This is useful in these scenario.
Consider the below scenario:-
Consider a drunken student out in a big field, They start in the middle of the field, every second the student can take 1 step in 1 of the 4 cardinal direction (North,South,East,West), with each direction equally likely. After 1000 steps how far aways is the student from the initial position.
So here is a table explains the steps:-
No of Steps | Distance | Probability |
---|---|---|
1 Step | 1 | 1 |
2 Step | 0 | 1/4 |
2^1/2 | 1/2 | |
2 | 1/4 | |
3 Step | 1 | 1/4 |
1 | 1/2 * 1/2 = 1/4 | |
5 ^ 1/2 | 1/4 | |
1 | 1/16 | |
5 ^ 1/2 | 1/8 | |
3 | 1/16 |
Summing it up:-
No of Steps | Distance |
---|---|
1 Step | 1 |
2 Step | (0 * 1/4) + (2^1/2 * 1/2) + (2/4) = 1.2 |
3 Step | 1.5 |
The reason for doing above steps is to get a feel of how things are going to happen. So what we derived from above is, After n-step, the drunk student will be little farther away from his initial position.
So lets us make a computational model based on our findings till now.
We have understood that we should model classes based on things we most likely to see. Some of them are:-
So here is the code for these classes:-
import random class Location(object): def __init__(self, x,y): """x and y are float""" self.x = x self.y = y def move(self,deltaX,deltaY): """deltaX and deltaY are float""" return Location(self.x + deltaX, self.y + deltaY) def getX(self): return self.x def getY(self): return self.y def distFrom(self,other): ox = other.x oy = other.y xDist = self.x - ox yDist = self.y - oy return (xDist**2 + yDist**2) ** 0.5 def __str__(self): return '<' + str(self.x) + ', ' + str(self.y) + '>' class Field(object): def __init__(self): self.drunks = {} def addDrunk(self,drunk,loc): if drunk in self.drunks: raise ValueError('Duplicate Drunk') else: self.drunks[drunk] = loc def moveDrunk(self,drunk): if not drunk in self.drunks: raise ValueError('Drunk not in field') xDist,yDist = drunk.takeStep() self.drunks[drunk] = self.drunks[drunk].move(xDist, yDist) def getLoc(self, drunk): if not drunk in self.drunks: raise ValueError('Drunk not in field') return self.drunks[drunk] class Drunk(object): def __init__(self, name): self.name = name def takeStep(self): stepChoices = [(0,1), (0,-1), (1, 0), (-1, 0)] return random.choice(stepChoices) def __str__(self): return 'This drunk is named ' + self.name
So lets dissect this code:-
Location
Class:def __init__(self, x,y)
: Takes two value, x
and y
which are initialized into instance variable.getX()
and getY()
: returns the x
and y
coordinates of the instance variable.distFrom()
: returns the distance between initial position to new position.move()
: modify the instance variable based on the deltaX
and deltaY
deltaX
and deltaY
are floats meaning, in future we can move along the diagonal and not only cardinal direction.Field
Class: Will map drunks to location.def __init__(self):
creates a dictionary of drunks
.addDrunk(self, drunk, loc)
:drunk
is not a duplicate.moveDrunk(self,drunk):
xDist,yDist = drunk.takeStep()
: gets the x,y coordinates.self.drunks[drunk] = self.drunks[drunk].move(xDist, yDist)
self.drunks[drunk]
get the particular drunk
move()
on this drunk, and updates its location.getLoc(self, drunk):
returns the location of the drunkDrunk
Classname
takeStep()
: only method, which takes a random coordinates, from a set, and every coordinates is equally likely.There is a bug in the above code, which we will see next time. Here is the test code for the above:-
def walk(f, d, numSteps): start = f.getLoc(d) for s in range(numSteps): f.moveDrunk(d) return(start.distFrom(f.getLoc(d))) def simWalks(numSteps, numTrials): homer = Drunk('Homer') origin = Location(0, 0) distances = [] for t in range(numTrials): f = Field() f.addDrunk(homer, origin) distances.append(walk(f, homer, numTrials)) return distances def drunkTest(numTrials): # for numSteps in [10, 100, 1000, 10000, 100000]: for numSteps in [0,1]: distances = simWalks(numSteps, numTrials) print 'Random walk of ' + str(numSteps) + ' steps' print ' Mean =', sum(distances)/len(distances) print ' Max =', max(distances), 'Min =', min(distances) drunkTest(10)
Classes allow to define a custom type. It allows to group data (attributes) and methods together, which is called encapsulation. We were already using inbuilt classes like int
, float
, dict
etc.
Classes also allows for Inheritance and Polymorphism. Polymorphism allows to expect the same functionality as the base class.
We can implement class with a dictionary but will it provide all the functionality of a class. Consider the below class implemented as a dict.
def makePerson(name,age,height,weight): person = {} person['name'] = name person['age'] = age person['height'] = height person['weight'] = weight return person def getName(person): return person['name'] def setName(person,name): person['name'] = name def getAge(person): return person['age'] def setAge(person,age): person['age'] = age def getHeight(person): return person['height'] def setHeight(person,height): person['height'] = height def getWeight(person): return person['weight'] def setWeight(person,weight): person['weight'] = weight def printPerson(person): print 'Name: ', getName(person)," Age: ", getAge(person)," height: ", getHeight(person), " weight: ", getWeight(person) def equalPerson(person1,person2): return getName(person1) == getName(person2)
Few things to remember in the above class implemented in dict are, these methods getName()
, getAge()
,getHeight()
, getWeight()
are called getter methods.
These methods setName()
, setAge()
,setHeight()
, setWeight()
are called setter methods.
Together these are called accessors and mutator. For accessing values and mutating values.
We can do most of the things which a class can do, using the above implementation, but in a few places it fails.
Like:-
print "type(mitch): ", type(mitch)
will return
<type 'dict'>
So a lot of things which we can do using class can also be done using a dict, but it is not intuitive i.e. it does not work like built-in data type.
So lets see how a class implementation will look like.
class Person(object): def __init__(self, name,age,height,weight): self.name = name self.age = age self.height = height self.weight = weight # This type of method is called accessor (or getter) def getName(self): return self.name # This type of method is called mutator (or setter) def setName(self,name): self.name = name # This type of method is called accessor (or getter) def getAge(self): return self.name # This type of method is called mutator (or setter) def setAge(self,age): self.age = age # This type of method is called accessor (or getter) def getHeight(self): return self.height # This type of method is called mutator (or setter) def setHeight(self,height): self.height = height # This type of method is called accessor (or getter) def getWeight(self): return self.weight # This type of method is called mutator (or setter) def setWeight(self,weight): self.weight = weight # UnderBar methods have special significance in python def __str__(self): return 'Name: ' +self.name +' Age: ' + str(self.age) +' height: '+str(self.height)+' weight: '+str(self.weight) def __eq__(self,other): return self.name == other.name
Now the major difference comes in these two line of code:-
print "type(mitch): ", type(mitch) print "mitch == sarina: ", mitch == sarina
See the first one will print <class '__main__.Person'>
, so now it is not a dict
, and second line makes the comparison as same as a inbuilt data type.
One more interesting thing which we should note is this code:-
print "Person.getAge(mitch): ", Person.getAge(mitch)
This will also print mitch
age, but the catch here is we are invoking the getAge()
method on a Person
class and passing mitch
as the self
parameter.
A better implementation of Inheritance and Polymorphism is shown below:-
class Shapes(object): def area(self): raise NotImplementedError def perimeter(self): raise NotImplementedError def __eq__(self,other): return self.area() == other.area() def __lt__(self,other): return self.area > other.area() class Rectangle(Shapes): def __init__(self,side1,side2): self.side1 = side1 self.side2 = side2 def area(self): return self.side1 * self.side2 def perimeter(self): return 2 * self.side1 + 2 * self.side2 def __str__(self): return 'Rectangle(' +str(self.side1) +', ' +str(self.side2) +')' class Circle(Shapes): def __init__(self,radius): self.radius = radius def area(self): return 3.14159 * (self.radius ** 2) def perimeter(self): return 2.0 * 3.14159 * self.radius def __str__(self): return "Circle(" +str(self.radius) +")" class Square(Rectangle): def __init__(self,side): self.side = side Rectangle.__init__(self,side,side) def __str__(self): return "Square(" +str(self.side) +")" s = Shapes() #print s.area() #raise NotImplementedError r = Rectangle(2,4) sq = Square(4) c = Circle(10) print "Rectangle area: ", r.area() print "Square area: ", sq.area() print "Circle area: ", c.area() print "Rectangle(2,8) == Square(4): ", r == sq print "Rectangle(2,8) < Square(4): ", r < sq print "Circle(10) == Square(4): ", c == sq print "Circle(10) < Square(4): ", c < sq # Because of polymorphism and inheritance, we don't need to # be concerned with which Shapes we are calling area() on listOfShapes = [c,sq,r] for shapes in listOfShapes: print "type(shapes): ", type(shapes), "shapes.area(): ", shapes.area() listOfShapes.sort() print "----------------" for shapes in listOfShapes: print shapes, "shapes.area(): ", shapes.area()
The only thing which we have not studied till now and it is present in the code is shown below:-
class Shapes(object): def area(self): raise NotImplementedError def perimeter(self): raise NotImplementedError
This code forces the sub class to implement the area()
and perimeter()
method, else it will throw error.
Also we can do these:-
listOfShapes = [c,sq,r] for shapes in listOfShapes: print "type(shapes): ", type(shapes), "shapes.area(): ", shapes.area()
Because of polymorphism we can call area()
on any object, based on the methods available in the base class and it will invoke the correct area()
method.
Object oriented programming started with the development of these 2 programming language.
These languages were not able to propel the OOP concepts till the advent of Java , followed by C++ and Python.
The core of all Object Oriented programming is Abstract Data Type. The fundamental use of Abstract data type is that we can add user defined data type which behave just like a Built in Data type.
The question arises that why do we call it Abstract Data Type? and not just data type.
The reason for this is, because we define
Let See an example for this:-
class intSet(object): """An intSet is a set of integer.""" def __init__(self): """Create an empty set of integer""" self.numBuckets = 47 self.vals = [] for i in range(self.numBuckets): self.vals.append([]) def hashE(self,e): #Private Function, should not be used outside the class. return abs(e)%len(self.vals) def insert(self,e): """Assumes e is an integer, and insert e into self""" for i in self.vals[self.hashE(e)]: if i == e: return self.vals[self.hashE(e)].append(e) def member(self,e): """Assumes e is an integer, Returns True, if e is in self, else False""" return e in self.vals[self.hashE(e)] def __str__(self): """Returns a string representation of self""" elem = [] for buckets in self.vals: for e in buckets: elem.append(e) elem.sort() result = '' for e in elem: result = result + str(e) +',' return '{' +result[:-1]+"}"
Lets dissect the code written above.
class intSet(object)
intSet
. Which is done by using the keyword class
.object
. We will discuss this object
when we discuss about inheritance.def __init__(self):
__
in the name like __init__
, it has a special meaning, particularly this function __init__
is called when a object of intSet
is initialize, which initialize two variable/ attribute.self.numBuckets = 47
self.vals = []
self
: what does this means. To understand this, consider the below example:-s = intSet() print self.numBucket
The above code gives error.
NameError: name 'self' is not defined
The reason being, self
is not available in the global scope.This code will however work.
print s.numBuckets
To understand this, we can say numBuckets
and vals
are attributes of the object s
which is instance of the class intSet
. self
is used to refer the object being created.
We can also note that __init__
takes a parameter self
, but when we created intSet()
we did not pass any object, the reason being, the __init__
has a special magic, in which it take self
as a parameter, even when it is not passed.
* def hashE(self,e):
– This is private function, as specified by the comment in the function, this function is not exposed as a interface. This is a convention and not enforced by the language, so this means that we can call this outside, just like any other interface, But good programmers always Program to the specification and not implementation.
* def insert(self,e):
– This is a exposed method, if we see the formal parameters of these we have two arguments, self
and e
.
– We call insert
like this s.insert(i)
, what this does is, s
before the dot is passed to the self
formal parameter and i
is passed to e
.
– By convention, the first argument is always called self
in python. It is not enforced, but it is a convention.
* def __str__(self):
– This is again a special method, what is means is when we call print "s: ", s
, it will invoke __str__
.
* We should not reference the data element of the class like, print "s.vals: ", s.vals
, because this is not part of the specification. So if we reference it, it might break in future. So the implementation of vals
may not be present in the future, so the program should work. Always remember the golden words Program to the specification and not implementation. This concept is called Data Hiding.
This is the only fundamental which makes Abstract data type useful. Java provides a mechanism to enforce data hiding, but not in python.
The things which we are hiding are:-
numBuckets
and vals
.Class
When we design program using OOP, we are not looking at the implementation details, we should be able to think in terms of abstraction. Inheritance is a important feature in this regards.
Inheritance is used to setup hierarchy of Abstraction. The idea is to identify the similarity between most of the object and make a class which will have this common functionality. We should be able to identify if there is a shared attributes between different class.
So to design a MIT Database the base class will look like this:-
import datetime class Person(object): def __init__(self, name): #Create a person with name self.name = name try: firstBlank = name.rindex(' ') # print "__init__: firstBlank: ", firstBlank self.lastName = name[firstBlank+1:] except : self.lastName = name self.birthDay = None def getLastName(self): #returns self's last name return self.lastName def setBirthday(self,birthDate): #assumes that self's birthday is of type datetime.date #sets self's birthday to birthDate assert type(birthDate) == datetime.date self.birthDay = birthDate def getAge(self): #assumes that self's birthday is set #returns self's age in days assert self.birthDay != None return (datetime.date.today() - self.birthDay).days def __lt__(self,other): #returns True if self name is lexicographically greater #than other's name, and False Otherwise if self.lastName == other.lastName: return self.name < other.name return self.lastName < other.lastName def __str__(self): return self.name
Now lets dissect this code:-
def __init__(self, name):
is the same as before, nothing new, it will be called when the object is created, with the parameter name
.def getLastName(self):
is a method which will return me the value of the attribute lastName
, the reason this method is present is because we do not want user of this abstraction to have access to the data attribute lastName
directly, so that we can put some checks in place.get
something, which will return the information about the instance of the class.def setBirthday(self,birthDate):
this is a set method, the reason for its existence is similar to get method.def __lt__(self,other):
this is another of those magic method, this will be used to compare two object of Person
type. The reason we are using __lt__
and not any function name as less
etc, is because __lt__
is predefined in Python, just like __str__
which will be called automatically when we will do comparison <
on Person object. So this Person class will behave exactly like any inbuilt data type, but if we have a special function name like less
though it will get the job done, but it will not be same as inbuilt data type.Lets consider this class:-
class MITPerson(Person): nextIDNum = 0 def __init__(self, name): #super(MITPerson, self).__init__() Person.__init__(self,name) self.idNum = MITPerson.nextIDNum MITPerson.nextIDNum += 1 def getIdNum(self): return self.idNum def __lt__(self, other): return self.idNum < other.idNum def isStudent(self): return type(self)==UG or type(self)==G
MITPerson
is a subclass of Person
, because we are passing Person
as a argument in place of Object
which we were passing till now, in this line, class MITPerson(Person):
, so what this means is I want all the attributes of class Person
, but MITPerson
will have some additional attributes and functionality.nextIDNum = 0
is a class variable, that means this is common to the class and not to just one object of the class. So this is shared across object of this class type.def __lt__(self, other):
The base class, Person
also had __lt__
method, which I am overriding here. Overriding helps in modifying the base class methods to suits the sub class attributes. Because in the case of MITPerson
we want to check the comparison based on nextIDNum
and not the firstName
or lastName
basis as done in the Person
class.Person.__lt__(p1, p2)
p4 < p3
: when we do comparision like this, what it will do is take p4
, checking is __lt__
method and pass p3
as a argument.Now consider this new class:-
class UG(MITPerson): def __init__(self, name): MITPerson.__init__(self, name) self.year = None def setYear(self, year): if year > 5: raise OverflowError('Too many') self.year = year def getYear(self): return self.year
Lets dissect this code:-
class UG(MITPerson):
UG
is a subclass of MITPerson.
MITPerson.__init__(self, name)
: What is piece of code is doing is that it will call MITPerson
s __init__
and then continue to initialize itself.ug2 < ug1
?__lt__
function in UG
class, so since it is inheriting from MITPerson
, it will invoke the __lt__
of MITPerson
, if it was not present in MITPerson
, then it will go further up in the hierarchy till it reaches object
.Now consider this class:-
class G(MITPerson): pass g1 = G('Mitch Peabody')
G
is not doing anything just saying pass
, this means, G
is a MITPerson
with no special property.type
. As this will be type(g1)
will be G
The answers to the above question can be found here.