Neural Network Models in Excel
for Prediction and Classification
Here are two little toys in Excel. Using them you can build your own
Neural Network models for prediction and classification in Excel.
Also available here is a small powerpoint based
tutorial
on neural network based modeling for the beginners.
If you happen to be interested in Self Organizing Maps (SOM), click
here to see a small
clustering tool in Excel
using SOMs. Also, if you are interested in tree based Classification
Model, here is a
tree based Classifier in Excel.
Once you download and unzip the files, you will find one
Excel file inside each of these zip files. Before running the
applications, I suggest that you go through the ReadMe worksheet of
these Excel files. They contain brief instruction on how to run the
application.
A few words about the application ...
There must be quite a few Neural Network implementations in Excel - freely available
all over the web.
So why another one ? I tell you why. ....I don't know!
May be, you should check out whether this is useful, whether this is
any good, whether this is any better than the others. Here is a very
basic implementation of FeedForward - BackPropagation Neural Network,
used for prediction and classification problems. You need to be a
little bit conversant with the lingo here. I am expecting, you are at
least familiar with the Neural Network terminology and have some vague
idea about what it is, what it does, how it is trained and used .. etc.
etc.
If you are not very familiar, search for Neural Network tutorials in Google. I bet, you will
get hundreds of them. However, nothing stops you from downloading and playing with the toy -
just for fun. Hey ! it is free ..
If you are a real pro in Neural Network modeling, I doubt you would be looking at this.
I bet you have powerful specialized software doing Neural Network. You would probably have
SAS, SPSS or a whole bunch of other stuff which will do the job for you. This is, however,
a poor man's version of the things. Not very big, doesn't need any other stuff to be installed
on your machine. Here I am guessing that like most of the mere mortals (e.g. me), you
will have MS-Excel on your machine. If not, pardon me for an uneducated guess.
Kind of uses I can think of ....
May be you have read and heard quite a bit about Neural Network, you
would like to play with it yourself to find out what it is all about.
Then, if you have Excel, just download the files and go.
May be you are a student and would like to learn more about Neural Net
by experimenting with it.
May be you are a teacher and thought that a small application in Excel would
have come real handy for your classroom demo.
May be you are an analyst in the industry - heard a lot of hype about
Data Mining and stuff, heard quite a lot about what Neural Nets can do
for you, your business and hence your salary. You wish you could have a
feel for the stuff without actually buying a costly software.
Well... A lot of May Be's. Or may be you are none of the above. Whoever you are, if you
have some spare time and a bit of curiosity, go ahead and check it out. Who knows? You might find
it useful.
Here comes the techy part ....
Here are some nice features of the application ......
The columns for input and output variables in your data need not reside in contiguous columns.
- Let me explain this with an example. Suppose you have specified
that you want to use two output variables and four input variables. Now
assume that you have eight columns in your data set. You can tell the
application that columns 7 and 8 are your output variables and columns
1,3, 5 and 6 are your input variables and you don't want to use the
columns 2 and 4. This comes in very handy if you want to try out
various
models on the same data set keeping and droping different input
variables for different
models.
You can fit a multilayered network.
You can break up your data set into
training and
validation set. See the model performance
on training and validation set as the model trains. You can either
randomly select a portion
of your data as validation set or you can specify that the last few rows of your data
should be treated as the validation set.
You can train the net either in
sequential mode or in
batch mode.
By default, the application generates the random starting weights. However, You have the option to specify your own choice of starting weights.
Once you have trained the network with a particular architecture, the
application saves those
weights. The next time you want to train the same network, you have the
option of choosing those weights, which are already saved.
You have the option of presenting the
training data in a random order
while training the net.
Application can handle
categorical data . The usual method of
using categorical data in a neural network model is to convert them to
dummy variables. A categorical variable with k categories in your data is replaced by k binary
variables (taking values 0 or 1).
Conversion from this categorical to binary variables is automatic in
the application. You don't have to convert and then present it to the
application. You just say, which column of your data is categorical and
the application will do the conversion.
Application has some capability of handling
missing data . For continuous
variable, any non-number in any of your data column will be treated as
missing value. Application will replace all the missing
values in a column by respective column means. For categorical
variables, any blank cell or any cell containing an Excel error will be
treated as missing value and application will replace it with the most
frequently occuring category.
For Prediction problems -
the application tracks two kinds of error. It tracks MSE
or Mean Squared error- which is the squared difference between the actual output and the predicted output. It also tracks ARE or Absolute relative
error - which is the absolute value of
[( Actual Output - Predicted Output) / Actual Output ].
ARE expressed in terms of % , tells you the average size of prediction error, relative to the
actual output.
For Classification problems -
the application tracks Mis-Classification error.
That is, in what percent of records in your training / validation data, the model predicted the class wrongly.
It also reports the Confusion Matrix on your training / validation data.
Application generates Profile Plots for the fitted model.
Here is what it's all about. For prediction problem, you are predicting
a response varible Y based on predictors X1, X2,...,Xp.
The fitted model is actually a surface in p-dimension. Unless p is 2 or
less, we cannot depict this surface graphically. Profile Plot is the
next best way to do this. To generate the Profile plot, we vary only one predictor, say X1 between two values a and b and keep all the other predictors fixed at certain pre-specified values. We look at the values Y as predicted by the model and plot it against X1. The curve we get is the profile of Y against X1.
Profile plot is really a one dimesional cross section of a
p-dimensional surface. Note that the curve changes as we fix the the
other predictors at some other values.
Profile plot tells you the follwoing two things -
(1) The nature of replation ship between the individual predictors and
the response. It tells you graphically whether the relationship is
linear, increasing, decreasing, non-linear or whatever the case may be.
(2) Whether there is any interaction between two predictors. Suppose you are looking at the profile plot of Y against X1 by fixing X2 at a value 10. Then you look at the same profile plot by fixing X2
at some other value, say 20. Now suppose you find that the shape of the
profile plot changes drastically (say from increasing to decreasing,
from linear to non-linear etc.) as you change the X2 from 10 to 20. This observation tell you that there is an interaction between the predictors X1 and X2. The effect of X1 on Y is not same at all levels of X2. It varies depending on at which value the predictor X2 is set.
Profile plots for Classification problem is slightly different. Suppose your class variable has k categories. Then model predicts scores
for each of these k categories. The final category predicted by the model is the one with highest score. Here each of these k
score functions are p-dimensional surfaces. So here we plot k profiles,
one for each of the class categories. The plots taken together tell you
how the scores vary with the predictor.
If you are generating profile against a categorical predictor, then
instead of a curve the application generates bar chart - one bar for
each category of the categorical predictor. The height of the bar is
predicted Y for prediction problems and the predicted Scores for the Classification problem.
In case of Classification problems, application generates Lift Chart for the fitted model. As we described above - for Class variable with k categories - model simulataneously predicts k Scores - one for each category.
If we take a paricular Class category, say category i
and look at its Score, we would expect (for a good model), for
observation receiving high scores are the ones which actually belong to
Class category i. So we sort the data according to this score from high score to low score. After sorting we look at top n observations and count what % of total "Category-i" observations we have captured. We plot the % of "Non- Category-i" against
% of "Category-i" as we vary n from 1 to the total
number of datapoints. This is the Lift Chart for "Category-i".
Lift chart essentially shows the discriminatory power of the fitted model in discriminating between Non- Category-i and "Category-i". Lift chart is also knows as Receiver's Operating Curve (ROC) . It tells you- if you are trying to capture - say 90% of Category-i of your sample, then what %
of Non-Category-i is also captured by the model by mistake.
The best possible scenario occurs when the model is able to
capture all (100%) Category-i without capturing any (0%) Non-Category-i. However for large real life Classification problems, that happens rarely.
Last but not the least, once the training is over, you have the option of
saving the model in a separate workbook.
This might not seem like a big deal. After all we all know how to
cut-paste
from one workbook to another. However, this option means that you have
a neural network model nicely set up in an Excel file, cells connected
with formulae, all ready to go. You can just feed-in a new input in a
specific range of cells and you get the model output in another range
of cells. Though it's not a big deal, setting up these formulae
could be a bit of nuisance - which the application performs happily. I
guess you will better appreciate this feature, when you actually use
it.
When you save the model in a separate workbook, along with the model
the data following two things are also saved.
The data on which the network was trained.
Along with the training data - the model predictions are also saved.
The different settings that were used to train the model.
Stuff that are not so nice ......
Size restriction - The application can
handle at most 10 output
and 50 input variables. It can fit the model with at most two hidden
layers. Each of the hidden layers should have less than 20 neurons in
it. The Classification tool can handle categorical data, however you
cannot use more than 40 categorical variables in your model. Also
maximum size of the Input Layer of your network
is set to 50. This means, counting all the continuous inputs and all
the dummy variables for categorical inputs - you cannot exceed the
limit of 50.
At one time you can run at most
500 training cycles. This is however, not such a big handicap since you can start from
the weights you have reached after 500 cycles and run it for another 500 cycles ..... and keep on doing that.
Speed - This is an implementation of the very basic training
algorithm, which is known to be slow for complicated task. On top of that, VBA
implementation is perhaps a lot slower than C / C++ implementation. It has been
developed and tested in a machine with 128MB RAM and in Excel 97. I have tested
it for 5 to 10 inputs and about 50 to 500 data points. I am yet to test it with
very large data sets. However for such small experiments, performance is not too bad.
I could run 100 training cycles in less than a minute.
Graphics - There are possibilities of a whole
lot of innovative graphics that might come with this. Unfortunately,
this application does not have that. I started out with an aim of
building lightweight basic NN package in Excel. I started with
bare essentials. And then I have managed to add some 'nice to have' features.
I feel right now, it's not so lightweight any more.
I would love to see nice graphs and plots with this, but may be in some
distant future, when I have more time and energy.
Stuff that this application does not do for you ......
The application does not do any kind of
data preprocessing for you.
Before copying your data to the Data worksheet of the Excel files,
you should really check the following things. They will improve
the quality of the model predictions.
The Input columns of your data should be independent,
which means - you should not be able to derive any of the input columns
from linear combinations of other input columns. For example, one input
column should not be sum or
difference of other input columns, it should not be constant multiple
of other input columns etc.
Presence of outliers in your
data, may badly affect the predictions. So if you have outliers in your
data - you should remove them before submitting the data to the
application.
Presence of too many categories
in a categorical input may make the predictions less robust. So if you
should reduce the number of categories by merging some of the
categories. How to merge and how many categories are 'not too many'
categories - are tricky questions, for which you need to lookup
relavent materials.
These are some of the features which are nice to have and most commercial neural network packages perhaps would have. Given my
limited time and resources - I could not put them in my application.
Having said all those things, I still feel that it is not too bad an application for
small experiments and definitely pretty good just for fun. So wish you a happy
experience with it.
I would really appreciate if you could
mail me
your comments and feedback.