Help! Can I normalize statistical coefficients?
August 13, 2014 8:13 AM Subscribe
With deadline looming, stats consultant has bailed. Simple queries need resolution. Help?
I am working on a data graphic that involves statistical calculations about survival rates for startup businesses, correlated with certain tangible and intangible factors. The raw data (about survival/closure/merger outcomes) has already been investigated, and the original researchers (who are awesome) have generated some interesting correlations using univariate regressions and Cox regressions. For my output I am relying on their statistically significant findings, wanting to create comparisons among the univariate coefficients. Not sure my methods are kosher and would appreciate consultation. Avalanche inside.
To make the information clearer to a lay audience, editors and I want to present the figures in a comparative framing. Ideally, we're aiming to say "firms with factor [foo] have a XX% greater likelihood of survival than firms with factor [not-foo1] and a YY% greater likelihood of survival than firms with factor [not-foo2]."
Univariate:
The factors are characteristics of the owner and/or firm that survives/closes/merges. Specifically, I have univariate coefficients for owner age (binned categorical), race (categorical) owner years of experience (continuous), owner's experience in same industry (dummy: yes/no), owner's college degree (dummy: yes/no), firm diversification (dummy: yes/no), possession of IP (dummy: yes/no), and startup capital (binned categorical). I have firms categorized as high-, medium- and non-tech.
What I am hoping to do is normalize the coefficients. Simple example:
female-owned: survive 8.5 close 17.81 merger or sale 12.09
male-owned: survive 91.5 close 82.19 merger or sale 87.91
(If it matters, t-test was used to determine statistical significance; all of the above are stat-sig.)
Can I say that female-owned businesses in this category are more than twice as likely to close as they are to survive [(17.81/8.5) = 209%]? And, similarly, 42.2% [(12.09 - 8.5)/8.5] more likely to merge or be sold than to survive?
If I calculate similarly for male-owned businesses (89.8% as likely to close as to survive), can I compare the 209%-as-likely women's closure likelihood to the 89.8%-as-likely men's closure likelihood?
And does the 89.8% comparative likelihood of closure mean that men's firms are actually 11% more likely to survive? Or do I need to use a different denominator in calculating that?)
Cox - competing risks:
With Cox regression tables, the researchers calculated the comparative likelihood of competing risks (closure vs. M&A -- both are considered business exits).
It seems clear that these figures are meant to be compared in pairs - closure vs. M&A. As I understand it, the coefficients represent positive/negative correlation (with each type of exit) and the absolute value (intensity) of the influence...?
For example:
Duration regression analysis - Cox regression (competing risks)
White-owned high-tech firms:
closure 0.56/m&a 0.45
White-owned medium-tech firms:
closure 0.69/m&a 0.8
White-owned non-tech firms:
closure 0.67/m&a 0.64
I suppose I can compare the closure figures to one another, but how? The scale is not necessarily -1 to 1, as some of these coefficients are upwards of 3 (for factors other than race, such as whether the business is a franchise).
And given this type of info, can any information about survival likelihoods possibly be derived (via...umm...subtraction or something)?
I feel a bit sheepish about these simple-minded queries, but for that reason I was sure they would be no-brainers for someone here. Sincere thanks to anyone who can shed some light.
To make the information clearer to a lay audience, editors and I want to present the figures in a comparative framing. Ideally, we're aiming to say "firms with factor [foo] have a XX% greater likelihood of survival than firms with factor [not-foo1] and a YY% greater likelihood of survival than firms with factor [not-foo2]."
Univariate:
The factors are characteristics of the owner and/or firm that survives/closes/merges. Specifically, I have univariate coefficients for owner age (binned categorical), race (categorical) owner years of experience (continuous), owner's experience in same industry (dummy: yes/no), owner's college degree (dummy: yes/no), firm diversification (dummy: yes/no), possession of IP (dummy: yes/no), and startup capital (binned categorical). I have firms categorized as high-, medium- and non-tech.
What I am hoping to do is normalize the coefficients. Simple example:
female-owned: survive 8.5 close 17.81 merger or sale 12.09
male-owned: survive 91.5 close 82.19 merger or sale 87.91
(If it matters, t-test was used to determine statistical significance; all of the above are stat-sig.)
Can I say that female-owned businesses in this category are more than twice as likely to close as they are to survive [(17.81/8.5) = 209%]? And, similarly, 42.2% [(12.09 - 8.5)/8.5] more likely to merge or be sold than to survive?
If I calculate similarly for male-owned businesses (89.8% as likely to close as to survive), can I compare the 209%-as-likely women's closure likelihood to the 89.8%-as-likely men's closure likelihood?
And does the 89.8% comparative likelihood of closure mean that men's firms are actually 11% more likely to survive? Or do I need to use a different denominator in calculating that?)
Cox - competing risks:
With Cox regression tables, the researchers calculated the comparative likelihood of competing risks (closure vs. M&A -- both are considered business exits).
It seems clear that these figures are meant to be compared in pairs - closure vs. M&A. As I understand it, the coefficients represent positive/negative correlation (with each type of exit) and the absolute value (intensity) of the influence...?
For example:
Duration regression analysis - Cox regression (competing risks)
White-owned high-tech firms:
closure 0.56/m&a 0.45
White-owned medium-tech firms:
closure 0.69/m&a 0.8
White-owned non-tech firms:
closure 0.67/m&a 0.64
I suppose I can compare the closure figures to one another, but how? The scale is not necessarily -1 to 1, as some of these coefficients are upwards of 3 (for factors other than race, such as whether the business is a franchise).
And given this type of info, can any information about survival likelihoods possibly be derived (via...umm...subtraction or something)?
I feel a bit sheepish about these simple-minded queries, but for that reason I was sure they would be no-brainers for someone here. Sincere thanks to anyone who can shed some light.
Response by poster: Thank you, muhonnin! What you've said above has already been quite helpful and clarifying. I will certainly take you up on your kind memail offer as soon as I've untangled some of these knots. Very much obliged.
posted by GrammarMoses at 4:46 PM on August 13, 2014
posted by GrammarMoses at 4:46 PM on August 13, 2014
This thread is closed to new comments.
1) It looks like the data from your example of female v. male-owned businesses are column percents (91.5 + 8.5 =100, 17.81 + 82.19 = 100, etc.). But you are treating them as if they are row percents. If that's actual data, you can't infer anything at all from those figures about whether, for example, female-owned business are more likely to close than to survive, because these figures don't give us any information about the overall likelihood of businesses to close or survive--17.81% of a small number could well be less than 8.5% of a big number.
2) I'm not sure that Cox regression is required, or even appropriate, for the claims you and your editors are hoping to make. Cox regression focuses specifically on the time that it will take something for something to happen. So it only makes sense if you're specifying a particular time frame (survive/close/merge within the first n years, for example). There are also a lot of technical issues around whether and how your data are censored (i.e, if you've specified 5 years as your horizon, but you have some businesses in your sample that have only been around for 3 years, that's going to be a problem for any regression.
The good news is that the comparative framing you and your editors are looking for is probably possible without requiring techniques nearly as sophisticated as Cox regression. Faced with this task, I might do simple pairwise chi-square analyses among your factors to see which ones have a significant impact on the survive/close/merge distribution, and then report the ones that are significant in an indexed way (i.e., set the % of survival among all businesses in your sample to 100, and then compare the %s of survival among significantly differing subgroups to that baseline).
Others may be able to chime in with alternative suggestions. I hope that helps; I'm away from my desk for the next few hours but please feel free to memail me if I can be of any further use.
posted by muhonnin at 9:59 AM on August 13, 2014 [1 favorite]