Jump to content

Random Sampling


Ace123

Recommended Posts

ive got a dataset of of 8120 matches and i want to predict for an away win. im building a logistic model and there are 2339 away wins in my dataset. therefore my other 2339 should consist of home winds and draws. but how do i choose the split? should it be 2339 away wins and 1170 draws and 1170 home wins? please help thanks

Link to comment
Share on other sites

Re: Random Sampling the whole point of doing logistic regression is that ur "goods" are the same volume as your "bads" so that there is no bias. I'm just wondering whether in football modelling u should consider this standard statistical practice or include everybody in ur sample?? in a season, on average, there are 50% home wins and the other 50% is made up of draws and away wins. my question is whether ur sample should model all observations even thogh there may be a bias. or u should evenly split out the population so that ur dealing with equal volumes.

Link to comment
Share on other sites

Re: Random Sampling

the whole point of doing logistic regression is that ur "goods" are the same volume as your "bads" so that there is no bias. I'm just wondering whether in football modelling u should consider this standard statistical practice or include everybody in ur sample?? in a season, on average, there are 50% home wins and the other 50% is made up of draws and away wins. my question is whether ur sample should model all observations even thogh there may be a bias. or u should evenly split out the population so that ur dealing with equal volumes.
It's been over a year but as far as I remember the whole point of logistic regression is to work out the probability of a "good" from binomially distributed data using an appropriate link function. You know n so you're wanting to work out p. If you sample the data so you know p=0.5 then what's the point?:unsure
Link to comment
Share on other sites

Re: Random Sampling If you artificially force your samples of "bads" to have 50% home wins and 50% draws, then you'll be introducing much more of a bias. Since home wins are actually more frequent than draws, you'll probably be heavily biasing the "bads" in favour of factors that correlate with the home team doing badly. I know what logistic regression is about, more or less, though I don't know much about the nuts and bolts. But I don't understand why you need the samples of goods and bads to have the same size?

Link to comment
Share on other sites

Re: Random Sampling

It's been over a year but as far as I remember the whole point of logistic regression is to work out the probability of a "good" from binomially distributed data using an appropriate link function. You know n so you're wanting to work out p. If you sample the data so you know p=0.5 then what's the point?:unsure
Isn't the point of logistic regression to work out how the probability of a "good" varies when you have knowledge of other factors? So fixing the total sample so that the overall probability is 0.5 doesn't necessarily prejudge the answer.
Link to comment
Share on other sites

Re: Random Sampling

Isn't the point of logistic regression to work out how the probability of a "good" varies when you have knowledge of other factors? So fixing the total sample so that the overall probability is 0.5 doesn't necessarily prejudge the answer.
Just re-read what I wrote and it sounds really dumb:lol. Setting the sample with 50% home wins is wrong but for the reasons you've stated.
Link to comment
Share on other sites

Re: Random Sampling ok so what should be my "good" and "bad" outcomes?? i still dont get what the splits should be. lets take an example. say i want to model the probability of a home win and my data set size is 8000. 4000 are home wins, 2000 are draws and 2000 are away wins. could you possibly explain to me how i would build a logistic model based on the above info?? thanks

Link to comment
Share on other sites

Re: Random Sampling What software are you using? Easiest way to do it is to take your data, order it by date, take the first 4000 results and use that as your training data. You need to decide which factors you want to include. This is easy. I'd start by including everything you might want to include. You then want to create your model using the software and do an Analysis of Deviance. Your software should add terms sequentially, so you have a forward stepwise approach and can do chi-squared tests to get a P-value and use hypothesis tests to determine which factors to keep in. Then when you have decided which factors to keep in you can run the model again using different link functions to determine which is the best. That should give you a model to start with. Then you can start messing around and use the testing data;) Sorry if that's a bit patronizing, from you're posts not sure how much you know;)

Link to comment
Share on other sites

Re: Random Sampling thanks very much for that advice mr intensity. im actually using SAS. so what would be my target variable? and how would you define the target variable? as in what would the "1" represent and what would "0" represent??

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...