Look at one of the sample files to
see what it should look like.
If you have interesting strings, you must quote them (I.e. quote
things like "Allston-Brighton". Spaces and letters are fine, but many
other fancy symbols need quotes.)
Things that you must quote: "Foster, Dean", "Allston-Brighton",
"The whale said, \"Call me Ishmael,\" no that was the whale killer.", etc
You have to have names of variables in the first row.
The "Y" variable should come before the X's.
Don't include any X's that you don't want used in the regression!
If you want to fit the entire data set then use the order: Y,
X1, X2, ...
Y, X1, X2
10, 2, 3
20, 4, 5
...
To do cross validation:
Put the indicator of which obserations are in sample in the first
column (use values in / out)
Variable order: CROSS, Y, X1, X2, X3, ...
Example:
CROSS ,Y ,X1 ,X2
in, 10, 2, 3
in, 20, 4, 5
out, , 5, 4
out, , 10, 3
You can include values for the Y's that are out of sample--they
will be ignored.
The system will make predictions for the out-of-sample Y's.
The bankruptcy data is discussed in our JASA paper.
The baseball data is from here.
It has no particular significance, except I was discussing it in class
when I wrote this page.
The Hadamard data set (simulated) is motivated by Manfred's paper, Leaving the
span. He argues that linear methods can't find the fit here. So
if our code find it, we must be non-linear! (We search the space of
linear models, but our method is clearly non-linear.)
Code
If you want the code, please
email either Bob or Dean. We will try to keep a .tgz file on line and upto date, but it is
best to email for a fresh copy.
What happens:
This page grabs the file
Then a perl script reads the file and passes it to the C++ code
(that runs in background)
All output for this run is saved in a unique directory under this directory.