What are your favorite beginner resources for exploratory data analysis?
March 29, 2023 9:30 AM Subscribe
I don't have a strong quantitative background (have taken a couple of basic stats classes many years ago) but recently was asked at work to take a look at a Qualtrics data set from around 150 surveys (pre- and post-test surveys). I'm not really sure how to start off looking at the data in an organized, best practices sort of way.
The surveys are quite complex with 50 different questions and include responses in several different formats ranging from Likert scales to multiple choices to open text responses. The pre-test surveys have some of the same questions as the post-test surveys but they are not identical.
I would like to try to think through a workflow before even really getting into the data, because I've never done this sort of exploratory survey analysis at all before.
TLDR: you have a bunch of pre/post survey data and no hypothesis - how do you proceed with data cleaning and exploratory analysis in an organized, step-by-step way?
Any recs for books, articles, videos on good practices for exploratory data analysis are welcome!
The surveys are quite complex with 50 different questions and include responses in several different formats ranging from Likert scales to multiple choices to open text responses. The pre-test surveys have some of the same questions as the post-test surveys but they are not identical.
I would like to try to think through a workflow before even really getting into the data, because I've never done this sort of exploratory survey analysis at all before.
TLDR: you have a bunch of pre/post survey data and no hypothesis - how do you proceed with data cleaning and exploratory analysis in an organized, step-by-step way?
Any recs for books, articles, videos on good practices for exploratory data analysis are welcome!
Response by poster: Excel, R, and I could potentially get access to a STATA license
posted by forkisbetter at 10:52 AM on March 29, 2023
posted by forkisbetter at 10:52 AM on March 29, 2023
Best answer: So my job is this and without knowing a lot about your data types and the like here is generally how I start
- first I must know my experiment and my question. What am I trying to answer? Is it a before / after change? A relationship between variables? Interesting trends among populations? What is your changed variable (X) and what’s the response variable (Y)
- I take some time to just look at the raw data table so I understand how it is organized - by category? (Men/women, across temperature, different populations of parts made on different dates) is the data binary (pass/fail, high/low) or is something measured
- then I do some general plots first such as distributions of each Y variable (response) to see how it looks - normal distribution? Cluster of outliers? Weird skinny tail?
- I do variability charts of Y vs. X where X is a category variable. Anything pop out?
- I plot my expected X vs Y in an “xy” plot to see if normal things are happening like power consumption increasing with temperature.
- if there’s a before/after change you can also calculate the difference and do some plotting on that if your software doesn’t do it for you.
That gets me oriented to my data.
Then i go back to my fundamental questions and try to answer them. And if I see interesting sub populations misbehaving I look in to see why.
You could also spend some time with practice data (simple distributions and xy relationships, I’m sure it’s available online or just fake a dataset) and get used to the basic chart options in R. Distribution, trend charts, variability charts and xy (response, correlation) plots
I do a TON of this so feel free to pm me.
posted by St. Peepsburg at 12:46 PM on March 29, 2023 [6 favorites]
- first I must know my experiment and my question. What am I trying to answer? Is it a before / after change? A relationship between variables? Interesting trends among populations? What is your changed variable (X) and what’s the response variable (Y)
- I take some time to just look at the raw data table so I understand how it is organized - by category? (Men/women, across temperature, different populations of parts made on different dates) is the data binary (pass/fail, high/low) or is something measured
- then I do some general plots first such as distributions of each Y variable (response) to see how it looks - normal distribution? Cluster of outliers? Weird skinny tail?
- I do variability charts of Y vs. X where X is a category variable. Anything pop out?
- I plot my expected X vs Y in an “xy” plot to see if normal things are happening like power consumption increasing with temperature.
- if there’s a before/after change you can also calculate the difference and do some plotting on that if your software doesn’t do it for you.
That gets me oriented to my data.
Then i go back to my fundamental questions and try to answer them. And if I see interesting sub populations misbehaving I look in to see why.
You could also spend some time with practice data (simple distributions and xy relationships, I’m sure it’s available online or just fake a dataset) and get used to the basic chart options in R. Distribution, trend charts, variability charts and xy (response, correlation) plots
I do a TON of this so feel free to pm me.
posted by St. Peepsburg at 12:46 PM on March 29, 2023 [6 favorites]
In your case without a concrete hypothesis you could just categorize groups (men, women, age, location, income, job, etc) and the plot responses vs category and see if anything jumps out. Could also compare before / after for these groups.
posted by St. Peepsburg at 1:02 PM on March 29, 2023
posted by St. Peepsburg at 1:02 PM on March 29, 2023
Best answer: Working through some of Hadley Wickham’s R for Data Science is probably a good place to start. The chapter on EDA starts out as saying it “will show you how to use visualization and transformation to explore your data in a systematic way”, seems like what you’re after!
posted by thebots at 1:40 PM on March 29, 2023 [2 favorites]
posted by thebots at 1:40 PM on March 29, 2023 [2 favorites]
Best answer: Like St Peepsburg, I work with data all the time. The most important step that so many people skip is data cleaning: look at the raw data and see if there are any missing values, or unexpected or extreme or biologically impossible values ie birthdates in 1900, and see whether there are any patterns to these elements (e.g., missingness is more frequent in women than men etc). If you don't check all this stuff and just go ahead and bang out some tables and figures, they are likely to be incorrect or at least not as correct as they could be as they will be diluted by the inclusion of bad/missing/incorrect data.
In terms of best practices, here is a twitter thread from a statistician about programming and data analysis. Tldr: document, check, document, check and never do anything manually, always program it using a line or two of code. Annotate your code for the future.
And have fun!
posted by lulu68 at 2:52 PM on March 29, 2023 [4 favorites]
In terms of best practices, here is a twitter thread from a statistician about programming and data analysis. Tldr: document, check, document, check and never do anything manually, always program it using a line or two of code. Annotate your code for the future.
And have fun!
posted by lulu68 at 2:52 PM on March 29, 2023 [4 favorites]
Gonna push back on the above comment and say that you do not want to delete any observations with missing values, just make sure missing values are coded as NaN. If, say, some column takes a value of -9 when the data is missing, you’ll want to replace values of -9 with NaN. If you use R and ggplot2 it will not plot NA values.
posted by MisantropicPainforest at 4:53 PM on March 29, 2023
posted by MisantropicPainforest at 4:53 PM on March 29, 2023
Best answer: I would recommend Stephanie Evergreen's books, and her blog (ex).
You can accomplish your objective in Excel, R, Python, or Stata. It is certainly not necessary to use a paid option. For a programming language, Stata does have a user interface that makes it relatively friendly to use survey data (among your programming language options). On the other hand, Stata is a niche programming language (really only common in the social sciences) and some of the niceties it does have might require "unlearning some concepts" to learn other programming languages in the future.
As your first step, I would suggest reviewing the actual survey (as it was shown to respondents and as it was shown on the backend). Was there any skip logic? Skip logic affects missingness. No matter which option you choose, I would suggest creating a survey codebook. If you choose a programming approach, and if your dataset doesn't already do this... you will almost certainly want to give each question a short unique name (that follows the variable naming conventions of the language) to be able to refer to the questions in a convenient programmatic matter. Even if Qualtrics already provided a short unique name, you might want to create human friendly names like would_recommend_pre and would_recommend_post. The codebook is a good way to identify any inconsistencies in how your data was structured. E.g. I've seen the "same question" asked in the pre survey and the post survey, but the response options went in the opposite directions (e.g. 1 = bad, 2 = ok, 3 = great vs 1 = great, 2 = okay , 3 = bad.) You would want to "recode" one of the variables so that the questions relative goodness goes the the same direction. Eventually you might also want to recode variables so that the relative goodness scale always goes in the same direction (e.g. the worst option is always on the left side of a graph's scale while the best option is always on the right side of the graph's scale.)
You might want to consider cleaning your data using Open Refine or Power Query.
Regarding R vs. Python either option is up to the task. Individuals often have strong preferences for one over the other. My most neutral TLDR is the following: R was designed for statisticians as a programming language for statistics. Python was designed to be a user friendly general programming language that has gradually added statistical capability over time. Python is the more popular programming language overall, but R has dominance in a few specific industries.
If you choose R or python I think you can get a lot of bang for your buck using an EDA package/library when starting EDA.
I think you might find it helpful to review resources about working with survey data. (You can probably ignore parts about weighting, since you aren't working with a complex survey design). FYI you don't need to do statistical significance testing if the survey doesn't do any sampling.
posted by oceano at 9:28 PM on March 29, 2023 [2 favorites]
You can accomplish your objective in Excel, R, Python, or Stata. It is certainly not necessary to use a paid option. For a programming language, Stata does have a user interface that makes it relatively friendly to use survey data (among your programming language options). On the other hand, Stata is a niche programming language (really only common in the social sciences) and some of the niceties it does have might require "unlearning some concepts" to learn other programming languages in the future.
As your first step, I would suggest reviewing the actual survey (as it was shown to respondents and as it was shown on the backend). Was there any skip logic? Skip logic affects missingness. No matter which option you choose, I would suggest creating a survey codebook. If you choose a programming approach, and if your dataset doesn't already do this... you will almost certainly want to give each question a short unique name (that follows the variable naming conventions of the language) to be able to refer to the questions in a convenient programmatic matter. Even if Qualtrics already provided a short unique name, you might want to create human friendly names like would_recommend_pre and would_recommend_post. The codebook is a good way to identify any inconsistencies in how your data was structured. E.g. I've seen the "same question" asked in the pre survey and the post survey, but the response options went in the opposite directions (e.g. 1 = bad, 2 = ok, 3 = great vs 1 = great, 2 = okay , 3 = bad.) You would want to "recode" one of the variables so that the questions relative goodness goes the the same direction. Eventually you might also want to recode variables so that the relative goodness scale always goes in the same direction (e.g. the worst option is always on the left side of a graph's scale while the best option is always on the right side of the graph's scale.)
You might want to consider cleaning your data using Open Refine or Power Query.
Regarding R vs. Python either option is up to the task. Individuals often have strong preferences for one over the other. My most neutral TLDR is the following: R was designed for statisticians as a programming language for statistics. Python was designed to be a user friendly general programming language that has gradually added statistical capability over time. Python is the more popular programming language overall, but R has dominance in a few specific industries.
If you choose R or python I think you can get a lot of bang for your buck using an EDA package/library when starting EDA.
I think you might find it helpful to review resources about working with survey data. (You can probably ignore parts about weighting, since you aren't working with a complex survey design). FYI you don't need to do statistical significance testing if the survey doesn't do any sampling.
posted by oceano at 9:28 PM on March 29, 2023 [2 favorites]
Best answer: One of the few big shots in Math to face this problem head-on was John Tukey. He wrote a book called Exploratory Data Analysis. It's dated and idiosyncratic but the basic ideas remain. He starts by noting the first things you want to know about a set of numbers is how big they are (mean and median) and how spread out they are (max, min, standard deviation). He goes on to look at the relationships between pairs of sets.
It might be worth a look if you feel entirely lost.
posted by SemiSalt at 5:04 AM on March 30, 2023
It might be worth a look if you feel entirely lost.
posted by SemiSalt at 5:04 AM on March 30, 2023
Adding that R is really easy to setup and the Rstudio IDE is a gift from heaven. Python has no easy equivalent.
posted by MisantropicPainforest at 5:34 AM on March 30, 2023
posted by MisantropicPainforest at 5:34 AM on March 30, 2023
I concur that RStudio is great. Python's closest analog to R/RStudio is running a Jupyter Notebook with Anaconda. It's been my experience that Python can be fussier than R for installing packages/libraries. IMHO for your use case, using Anaconda/Python won't necessarily be that much harder to set up than R/RStudio since Anaconda already comes set up with the basic packages needed for EDA. It's also possible to run a Jupyer Notebook in Google Colab, where everything is preset up for you, but it may not be permissible to connect your data with Google's services.
I think the (Jupyter) notebook approach is particularly friendly to beginning coders since one can have formatted code. (Think formatted text in Microsoft Word vs. a basic text editor). I do appreciate that the notebook approach prevents the overwhelming block of text, because it's easy to make section headings and readable text that doesn't have to be formated as comments. R Studio's approach to formatted code is IMHO not as easy to use as a Jupyter Notebook.
For the record, I like both R and Python. At different times, I have found both gratifying and frustrating. It took me a while to realize that I often didn't appreciate the advantages of one of the languages until I was frustrated with the other.
posted by oceano at 7:43 AM on March 30, 2023
I think the (Jupyter) notebook approach is particularly friendly to beginning coders since one can have formatted code. (Think formatted text in Microsoft Word vs. a basic text editor). I do appreciate that the notebook approach prevents the overwhelming block of text, because it's easy to make section headings and readable text that doesn't have to be formated as comments. R Studio's approach to formatted code is IMHO not as easy to use as a Jupyter Notebook.
For the record, I like both R and Python. At different times, I have found both gratifying and frustrating. It took me a while to realize that I often didn't appreciate the advantages of one of the languages until I was frustrated with the other.
posted by oceano at 7:43 AM on March 30, 2023
This thread is closed to new comments.
posted by St. Peepsburg at 10:12 AM on March 29, 2023