How to implement Random Forest regression models in practice?
May 31, 2024 3:23 AM Subscribe

When using linear regressions to predict data, there are equations like Y = a X + b Y + c that can be easily implemented in Excel or elsewhere. When creating Random Forest models for regression in R or Python, can those models be implemented outside the tools used to create them?

To give an example, when I have a linear regression model, using it to make predictions just consists in creating an Excel table with the predictors values in columns and adding an extra column for the regression formula, each row being an individual with its own set of predictive values.

How does this work when using Random Forest models? I'm just discovering those and what I've read so far takes certain things for granted, so I don't quite understand how it works in practice. I can load the training data, write code to create the model, and then ask the model to predict things, but how can I implement this in practice, for instance to predict with the same model hundreds of individual values, with the predictors stored in an Excel file?

posted by elgilito to Computers & Internet (16 answers total)

prediction from an excel set? try python's approach to complex numbers
posted by HearHere at 3:59 AM on May 31

I had not heard of Random Forrest until now, but from a quick look at Wikipedia, I'd guess the two possibilities are 1) an enormous amount of tedious logic formulas in cells hidden away on separate worksheets, or 2. learning how to do the tedious logic programming in Visual Basic.
posted by SemiSalt at 4:41 AM on May 31

I think you are probably going to have to use python to run your predictions. If you have a python script that can predict from one value then it's easy enough to extend that to read a .csv file and output predictions for all the values. With a bit more coding you could have a script that read and updated the excel sheet for you.

If you are doing a lot of data science with python then it might be worth learning how to use jupyter notebooks which are code, data and often documentation all on one page - using the free Google Colab online is an easy way to get started. (Using dataframes you can get a spreadsheet like view too).
posted by samj at 5:22 AM on May 31

Can you say more about why you’d like to implement this in Excel? When I fit a more complicated model in R, I tend to prefer to let R handle predicting new values using its predict() function because reimplementing elsewhere seems pretty tedious. It’s easy enough to read in an Excel file in R, do computations on its contents (eg predict new values using a stored model object), and spit the results back out also in Excel. But my guess is you have some use case or constraint I have not thought of?
posted by eirias at 5:38 AM on May 31 [1 favorite]

Fundamentally a tree is just a pile of if statements and a forest is just a bunch of trees with an aggregation, so you could implement it in Excel, though I wouldn't recommend it. You would have to generate the Excel formula on the Python/R side by traversing the forest. I have DIYed a random forest serialization with JSON to get from Spark to Golang. It's not particularly difficult except for the part where you're talking about Excel. Now that I think about it, you might be able to do something with Visual Basic rather than just codegen-ing a heinous formula.

There's a model serialization format called PMML, but I can't imagine Excel supports it.
posted by hoyland at 5:52 AM on May 31 [1 favorite]

But my guess is you have some use case or constraint I have not thought of?

Thanks! The use case is that we're making tables of nutritional values for certain products. Traditionally, some of these values are obtained in the laboratory by chemistry and some are predicted from the former using linear regressions (for instance Energy = f(protein, fibre, fat)). We can disseminate those values (predictors and predicted) to other people in Excel tables where the calculations are dynamic: the user input their own predictor values and get the predicted values.

As I was investigating the data analysis features of ChatGPT 4 and julius.ai, both apps started pushing R-based Random Forest regressions by default (they can do linear regression too of course). The results seem promising, but I don't know whether it is possible to "encapsulate" the models like I do in Excel (where I can just send the file to the users), or if the models absolutely require R or Python to run. Ideally, the model could shipped in a blob that could be read and used by Excel, or there could be a workflow requiring the installation of the modelling package, but still allowing the user to work with Excel and change predictor values in columns A-F and see the updated predicted values in columns G-L.
posted by elgilito at 6:06 AM on May 31 [1 favorite]

Given that context, my instinct is a visual basic macro is the way to go.
posted by hoyland at 6:18 AM on May 31

One removed. Please remember that AskMetaFiler is more tightly moderated than the rest of MetaFilter and answers that do not answer the question will often be removed.
posted by Brandon Blatcher (staff) at 6:42 AM on May 31 [2 favorites]

I’ve never done it but I know I’ve seen references to calling R from Excel. Looks like there’s Microsoft support for calling python from Excel but I’m not sure how you’d bring in the model logic.

You could also explore using Shiny to build a simple python based web app that would let users upload their inputs, run the model, and download the predictions.
posted by yarrow at 7:23 AM on May 31 [1 favorite]

I second Shiny app; this can be R-based and is straightforward to set up (for some value of straightforward depending on your R experience).
posted by quadrilaterals at 7:43 AM on May 31 [1 favorite]

How important are the results of these inquiries? Last time I checked, Excel’s mathematics was sufficiently unreliable that NIST officially didn’t review it as statistical software. (Before that, when they did, it failed. )

If you can set up Excel as the GUI and Python/R doing the computation, best of both fields.
posted by clew at 8:00 AM on May 31 [1 favorite]

I would under no circumstances try to do this in excel, this is complicated enough that you want to rely on an established and well tested data science package that implements many regression models. I would personally start with scikit-learn's implementation.

Also, I gotta ask, are you choosing these models for a principled reason? E.g., are you choosing these models because you want an ensemble model where (as the scikit-learn docs put it) "Ensemble methods combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.", where random forests specifically inject randomness over regular decision tree methods because " By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias."? Blindly choosing a model because a LLM suggests it to you and for no other reason is a recipe for ... bad results. If you need the things that an ensemble model and specifically random forests are good for, then go forth, but otherwise...regression models do not form a linear scale where fancier / more complicated = better for all purposes.
posted by advil at 8:05 AM on May 31 [8 favorites]

As Hoyland says, random forest models are huge for anything moderately complicated. For one I've deployed, the actual calculation is fast, but the first prediction takes 20-30 seconds as we read the massive decision tree into memory.

The size would make me shy away from Excel. If I needed results to be available in Excel, my first approach would be by having the model run in R or Python and making it accessible via a web server (in an intranet situation) or trying to install a local copy of the programs to power the model.

RF models have excellent general performance. But (I'm sure you know this) take the time to validate performance vs. linear regression before you go through the deployment. From the way you described the problem my gut instinct is MLR will work just as well. It might not, but it's going to be easier to centralize the data on your computer and run the comparison vs RF there, rather than deploy the RF model widely.
posted by mark k at 9:03 AM on May 31

Theoretically you can, in practice you cannot do this in excel. I would just write an R or python script to read in the data from excel, and then save the output to an excel file.

A the predictive output (without any associated standard error) of a normal linear model is just a vector of parameters for however wide the design matrix is. I could send these outputs here in a metafilter comment, theres no way to do that with a random forest model. Its just too complicated.

How many covariates do you have?
posted by MisantropicPainforest at 11:36 AM on May 31

Thanks for your answers. At this stage this was more exploratory than anything. I've been using linear regression for decades so it was interesting to test alternative methods. But yes, it seems that it's not really a practical solution for this particular use case.
posted by elgilito at 12:32 AM on June 4

You mention llinear regression, but Excel has other techniques built in such as principally component analysis and various clustering methods.
posted by SemiSalt at 4:47 AM on June 4

« Older Collecting/'Treasure hunting' activities? | Favorite online social puzzles? Newer »

You are not logged in, either login or create an account to post comments

Ask MetaFilter

How to implement Random Forest regression models in practice?
May 31, 2024 3:23 AM Subscribe

Tags

Share

How to implement Random Forest regression models in practice? May 31, 2024 3:23 AM Subscribe

Tags

Share

How to implement Random Forest regression models in practice?
May 31, 2024 3:23 AM Subscribe