What statistical measure compares a subset and its larger group?
November 28, 2017 3:01 PM   Subscribe

I'm looking to compare an employer's racial/ethnic representation to the population of the geographic region where it is located. I know nothing about statistics. Is there a statistical measure I should use to characterize the size / significance of the employer's deviation from the larger population?

Basically I have two things.

1. I have the racial/ethnic breakdown of the County where the employer is located.
2. I have the racial/ethnic breakdown of the people who work for this employer, by department.

I could, of course, just say, "Just 7% of [Department A] is African American, which is lower than the County population, which is 15% African American. Therefore African Americans are under-represented." But this doesn't really tell me how significant that under-representation is.

Keeping with the example of African American employees, and a County where African Americans are 15% of the population. Maybe there are 3 million County residents and 30,000 employees at this company.

If African-Americans are 17% of Department A, 15% of Department B, 13% of Department C, 11% of Department D, 7% of Department E, 5% of Department F, and 3% of Department G, what would help me describe the significance of this variation from the 15% County-level number, for each department?

I had assumed standard deviation but recall now that this is actually about deviation from a mean, not from a percentage of a population.
posted by kensington314 to Science & Nature (8 answers total) 5 users marked this as a favorite
You're looking for chi-square!
posted by augustimagination at 3:17 PM on November 28, 2017 [1 favorite]

You just need a test of proportion to determine whether the difference is statistically significant or not. Here are some links about that:

Test of proportion

Hypothesis Testing for Population Proportions
posted by paco758 at 3:40 PM on November 28, 2017 [1 favorite]

Yes, chi-squared is what you're looking for. It will tell you how likely it is that the sub-population was actually chosen at random from the overall population. But watch out for two things:
  1. You may be surprised by how large a disparity between the sub-population and the overall population chi-squared is willing to accept as "random" deviation, especially if the sub-population is quite small compared to the overall population.
  2. Employment is not a random selection. Just because a population contains X people does not necessarily mean it contains X people suitable for a specific job. So you can use chi-squared to find out if there is a potential problem, but you probably can't prove it with chi-squared alone.

posted by ubiquity at 3:42 PM on November 28, 2017 [4 favorites]

Yup, you need a test of proportions, aka chi-square. If you're looking for an easy calculator, MedCalc's is pretty good, free, and matches what I've gotten from fancy stats programs. Sample 1 would be the individual department, and Sample 2 would be county-level information. (Unless you want to compare Dept A directly to Dept B, but it doesn't sound like that's what you are looking for.)

Keep in mind also that statistical significance (the "p value") can be different from actual relevance, especially when dealing with large samples. I just finished a paper on gender disparities where something like 24.6% of men were positive compared to 23.9% of women. Because we had ~100k of each, it was statistically significant! But not actually relevant.
posted by basalganglia at 4:07 PM on November 28, 2017

2.Employment is not a random selection. Just because a population contains X people does not necessarily mean it contains X people suitable for a specific job. So you can use chi-squared to find out if there is a potential problem, but you probably can't prove it with chi-squared alone.

Seconding this. I had to do something similar for Aboriginal and Torres Strait Islander employee representation in an Australian Government context, and it's pretty fraught. If you're using this to commence a line of enquiry that will examine root causes / the practical significance of underrepresentation, then good. If you're using it to make some sort of ladder / shit list that's going to be published without any other context, then maybe don't, because the leap from 'there are less people here than there are in the general population' and 'therefore, there is underrepresentation' (and presumably, '...and that's bad') needs more work that that. A lot more work.

It's easy to say 'the [affected minority] population of [area] is x per cent, but for your department it's only [less than x] percent', but this doesn't say anything about why that's the case, and it can give departments a slap for something that's completely outside their control, and ignores that some minority groups might have clear preferences that drive those results. As noted above, employment isn't random, and we shouldn't expect a workforce to match a random but representative slice of the entire population.

One example was a small agency that only required people with a very narrow mathematics specialisation that almost nobody in the country has, but least of all people who are far less likely to finish high school and attend university. Telling this agency to up its game on the diversity front was futile. It had tried offering scholarships (not recently) but the answer was pretty much 'Who the hell would want to do that job?' They could've advertised ten Indigenous-only positions the next day - they would've stayed vacant for a very, very long time.

Central agencies in general faired poorly, not just because of the genuine requirement for people with qualifications in finance, economics etc, but because Indigenous people tended to prefer working with their communities in service delivery roles. So there was numerical underrepresentation, but also no demand for higher representation. 'I sit in Canberra and wear a suit and tie and make really complex financial models and look out the window at Parliament House' wasn't something they were looking for. If anything, they risked being labelled as a sell out.

Another was a small agency where just a few people made a tremendous difference in their comparative percentage, so they were way over or way under depending on what time period I selected.

Another was a large department that had heaps of Indigenous representation, but almost none were in senior roles. They had a lot of 'cadets' - recruited pretty much entirely to hit a target, it seemed, because there wasn't really an operational requirement for more Indigenous staff - who ended up sitting in junior positions for the rest of their careers.

Another was underrepresented numerically, but the Indigenous people who did work there said it was a fantastic place to work, and the reason there wasn't higher representation was that nobody ever left the department so there were no vacancies.

Conversely, another had a lot of Indigenous representation following the closure of a particular agency and the transfer of staff, but not many people were happy about working there (in part because it happened to be a central agency).

One agency had underrepresentation, but this was because it had really shit public transport options and parking was super expensive. This made it unappealing for people in lower socioeconomic brackets, and therefore disproportionately affected Indigenous people.
posted by obiwanwasabi at 4:15 PM on November 28, 2017 [8 favorites]

If you are drawing conclusions from these tests you should be doing multiple endpoint correction.

Also, remember "differences in statistical significance are seldom statistically significant." If department A *is* significantly under representing minorities relative to the county by your test, and department B is not, you cannot automatically conclude that that department B is better than department A. You'd need to compare them directly.
posted by mark k at 11:57 PM on November 28, 2017 [2 favorites]

You might consider focusing the demographics of the local population if you have the bandwidth. If you, say, live in a county in some warm place where half the white people are retired seniors, and, I dunno, a third of the Latino population is underage children (I can't guess why that would be, just making stuff up), and neither are really eligible to work at your company, then the demographics of the working-age population in your would be your local-population baseline, rather than all residents.
posted by Sunburnt at 11:14 AM on November 29, 2017 [1 favorite]

Statistics are a very deep subject. Computing statistical measures is easy (see the recommendations for various software above). This can lead to a search for metrics that "show" whatever the researcher wants. If all you're looking for is propaganda, you don't really care, but if you're sincerely asking a question and want an answer, I suggest you contact someone who is actually a statistics expert and ask them for help. Local math departments would be a good place to check.

Even if you are just looking for support for your position (and I'm not being judgemental here--there are lots of ways to know whether an employer is acting badly that don't rely on statistics), having an expert on tap will be super helpful when they trot out their expert.
posted by Gilgamesh's Chauffeur at 8:49 AM on December 2, 2017

« Older Where to find Affordable pet for Emotional Support...   |   Faux virus warning on my laptop Newer »
This thread is closed to new comments.