Ace123 Posted January 19, 2008 Share Posted January 19, 2008 ive got a dataset of of 8120 matches and i want to predict for an away win. im building a logistic model and there are 2339 away wins in my dataset. therefore my other 2339 should consist of home winds and draws. but how do i choose the split? should it be 2339 away wins and 1170 draws and 1170 home wins? please help thanks Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 19, 2008 Share Posted January 19, 2008 Re: Random Sampling I don't understand. Why do you need the same amount of homes/draws as aways?:unsure Quote Link to comment Share on other sites More sharing options...
Ace123 Posted January 19, 2008 Author Share Posted January 19, 2008 Re: Random Sampling what should my population be then if im trying to model away wins?? Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 21, 2008 Share Posted January 21, 2008 Re: Random Sampling As you've got a pretty decent number of games I'd start by using half of the data for training and half for validation. Quote Link to comment Share on other sites More sharing options...
Ace123 Posted January 22, 2008 Author Share Posted January 22, 2008 Re: Random Sampling the whole point of doing logistic regression is that ur "goods" are the same volume as your "bads" so that there is no bias. I'm just wondering whether in football modelling u should consider this standard statistical practice or include everybody in ur sample?? in a season, on average, there are 50% home wins and the other 50% is made up of draws and away wins. my question is whether ur sample should model all observations even thogh there may be a bias. or u should evenly split out the population so that ur dealing with equal volumes. Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling the whole point of doing logistic regression is that ur "goods" are the same volume as your "bads" so that there is no bias. I'm just wondering whether in football modelling u should consider this standard statistical practice or include everybody in ur sample?? in a season, on average, there are 50% home wins and the other 50% is made up of draws and away wins. my question is whether ur sample should model all observations even thogh there may be a bias. or u should evenly split out the population so that ur dealing with equal volumes. It's been over a year but as far as I remember the whole point of logistic regression is to work out the probability of a "good" from binomially distributed data using an appropriate link function. You know n so you're wanting to work out p. If you sample the data so you know p=0.5 then what's the point?:unsure Quote Link to comment Share on other sites More sharing options...
slapdash Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling If you artificially force your samples of "bads" to have 50% home wins and 50% draws, then you'll be introducing much more of a bias. Since home wins are actually more frequent than draws, you'll probably be heavily biasing the "bads" in favour of factors that correlate with the home team doing badly. I know what logistic regression is about, more or less, though I don't know much about the nuts and bolts. But I don't understand why you need the samples of goods and bads to have the same size? Quote Link to comment Share on other sites More sharing options...
slapdash Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling It's been over a year but as far as I remember the whole point of logistic regression is to work out the probability of a "good" from binomially distributed data using an appropriate link function. You know n so you're wanting to work out p. If you sample the data so you know p=0.5 then what's the point?:unsure Isn't the point of logistic regression to work out how the probability of a "good" varies when you have knowledge of other factors? So fixing the total sample so that the overall probability is 0.5 doesn't necessarily prejudge the answer. Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling Isn't the point of logistic regression to work out how the probability of a "good" varies when you have knowledge of other factors? So fixing the total sample so that the overall probability is 0.5 doesn't necessarily prejudge the answer. Just re-read what I wrote and it sounds really dumb:lol. Setting the sample with 50% home wins is wrong but for the reasons you've stated. Quote Link to comment Share on other sites More sharing options...
Ace123 Posted January 22, 2008 Author Share Posted January 22, 2008 Re: Random Sampling so basically for each sample (build set and validation set) you should have the true proportions of home wins, draws and away wins?? Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling so basically for each sample (build set and validation set) you should have the true proportions of home wins' date=' draws and away wins??[/quote'] No, then you're forcing the data, which is bad. Do as the thread title says - choose randomly. Quote Link to comment Share on other sites More sharing options...
Ace123 Posted January 22, 2008 Author Share Posted January 22, 2008 Re: Random Sampling ok so what should be my "good" and "bad" outcomes?? i still dont get what the splits should be. lets take an example. say i want to model the probability of a home win and my data set size is 8000. 4000 are home wins, 2000 are draws and 2000 are away wins. could you possibly explain to me how i would build a logistic model based on the above info?? thanks Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling What software are you using? Easiest way to do it is to take your data, order it by date, take the first 4000 results and use that as your training data. You need to decide which factors you want to include. This is easy. I'd start by including everything you might want to include. You then want to create your model using the software and do an Analysis of Deviance. Your software should add terms sequentially, so you have a forward stepwise approach and can do chi-squared tests to get a P-value and use hypothesis tests to determine which factors to keep in. Then when you have decided which factors to keep in you can run the model again using different link functions to determine which is the best. That should give you a model to start with. Then you can start messing around and use the testing data;) Sorry if that's a bit patronizing, from you're posts not sure how much you know;) Quote Link to comment Share on other sites More sharing options...
Ace123 Posted January 22, 2008 Author Share Posted January 22, 2008 Re: Random Sampling thanks very much for that advice mr intensity. im actually using SAS. so what would be my target variable? and how would you define the target variable? as in what would the "1" represent and what would "0" represent?? Quote Link to comment Share on other sites More sharing options...
Mr Intensity Posted January 22, 2008 Share Posted January 22, 2008 Re: Random Sampling Replyed on msn;) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.