One of these with one of those...but how?
September 19, 2024 8:07 AM

I need help understanding how to optimize the matching of items for column A with items from column B based on two characteristics.

I have a set of items in column A that have two characteristics, x1 and y1, and another set of items in Column B, with matching characteristics x2 and y2.

x and y are numbers, and they vary over some small range.

My matching criteria is that x2 - x1 must be within some range, and I can generally say that ideally it should be inside an even tighter range within the acceptable range (i.e. "it can be between 16 and 21, but I'd like it to be 20"). The same with y2 - y1, only the different numbers.

One of my tasks is to match things from column A with things from column B. I find this task very difficult, partly because I have a mental hang up that there's a 'best' way to do this, even though I know that optimizing for two variables isn't always trivial. A second problem is to get the physical items in column A and column B is time consuming, labor intensive, and expensive, and I hate to leave parts on the table because my matching is non-ideal.

A third problem is that even if I choose an 'ideal' match between available items, there's no guarantee that the two things will actually work together for reasons. But that's not what this question is about.

Is there a tool (mental model, algorithm, computer software, spirit I can consult with runes, etc) that will help me come up with a color map or ranking or...something...of pairings that approaches an ideal match between items in my sets?

I've tried hand drawn maps with arrows, I've tried some limited Excel color coding, I've tried avoiding the problem for days on end while my boss asks me about it regularly. The first two create visual messes, and the third has its own problems.

I have some limited coding experience in python, if that helps. I could, for instance, figure out x2-x1s/y2s-y1s for all possible pairings, and then find the best pairing for each item in column A out of B, and for each item in B out of A, but what is the metric for ranking all of these possibilities?
posted by AbelMelveny to Grab Bag (24 answers total) 1 user marked this as a favorite
Could you post ten lines of sample data?
posted by Tell Me No Lies at 8:40 AM on September 19


If you have coding experience - I would solve this using Google Sheets and the App Script functionality. It's javascript applied to manipulate your spreadsheet data. I do this at work a lot lately. Get ChatGPT to help you write the script and tweak it as needed. When I have a difficult problem as you have, you don't have to do it all in one go. Get one function that establishes your ranges. Test it. Then add another function to colour code cells, another to highlight your ideal values. It's totally doable!
posted by kitcat at 8:49 AM on September 19


How many items are you choosing from, and then how many items are you choosing?

For example, is it the best match out of a dozen or so? Or the best 10 matches out of several hundred, or the best 50 matches out of several thousands or tens of thousands, etc?
posted by flug at 8:49 AM on September 19


To answer some questions:
1. I have a rolling inventory of ~30 - 50 of each. So as I make a match between A14 and B22, A14 and B22 both leave my inventory. Sometimes B22 fails later on, so I have to put A14 back in the mix and find another match. I may sit down one day and need to find the 3 best matches out of my As and Bs, or the 10 best, or just rematch one that had a partner fail. I'd probably never do more than 5 or 6 at a time, realistically, and then more As and more Bs would show up in time and I'd do that again.

2. Here's a csv of 15 lines of data:
A,x1,y1,,B,x2,y2
A1,86.63,50.12,,B1,86.81,50.38
A2,86.65,50.14,,B2,86.83,50.38
A3,86.66,50.13,,B3,86.84,50.38
A4,86.66,50.13,,B4,86.84,50.39
A5,86.67,50.17,,B5,86.86,50.41
A6,86.67,50.16,,B6,86.86,50.40
A7,86.70,50.20,,B7,86.86,50.39
A8,86.71,50.21,,B8,86.86,50.42
A9,86.71,50.17,,B9,86.86,50.42
A10,86.71,50.19,,B10,86.87,50.43
A11,86.72,50.22,,B11,86.89,50.46
A12,86.72,50.23,,B12,86.89,50.45
A13,86.72,50.20,,B13,86.90,50.44
A14,86.72,50.23,,B14,86.90,50.45
A15,86.73,50.22,,B15,86.90,50.46

The criteria using these obfuscated numbers for x2-x1 is something like .17 - .22, where I'd prefer .21, and the criteria for y2-y1 is .24 - .29, but I'd prefer .27. And I guess I understand whatever I'm doing here would probably involve some kind of ranking of or weighting of possibilities, but I'm not sure how (like, is it preferable for x2-x1 to be further from ideal or y2-y1 to be further from ideal? I could make these decisions).

Finally, a complicating factor I didn't want to mention, but I'M IN TOO DEEP:
The characteristics in column B are immutable; I cannot change them.
x1 (for column A) can be made smaller, with some risk. It cannot be made bigger. If it's to be done, there's a certain range where it can be made smaller safely (like, the magnitude of change can't be bigger than 10, and we'd prefer no more than 5).
y1 can also be made smaller, but not bigger, AND we'd prefer not to do it at all. If we do it, its magnitude of change should be <3, to continue my mostly made up numbering scheme.

I suppose all of that would be part of my ranking of outcomes? I'm not sure.
posted by AbelMelveny at 8:58 AM on September 19


I am not sure which of the following is the task you are trying to do:

1. For every A, find the B that is the 'best' match.
2. For every A, find the list of Bs that don't fail any of my hard constraints (range outside of 16-21) and rank them by how closely they match on the softer constraints.
3. For a set of As and a set of Bs, find the optimal pairing of unique As and unique Bs, even though a given B might be the optimal match for more than one A (ie, you'd have to determine which A to assign that B to and also find a different B for the other A with the least lost of optimization).
posted by jacquilynne at 9:02 AM on September 19


I hope it goes without saying that the reason I'm doing any of this is that the difference between x2 - x1 = .24 and x2 - x1 = .27 is significant enough to me and my employer to warrant the effort. There's a world inside that .03.

And also my obfuscated data does NOT represent the full possible range of x1, x2, y1, or y2. I just picked the top 14 in my data set and did some obfuscation.

De-sitting the thread starting in 3, 2,1...
posted by AbelMelveny at 9:03 AM on September 19


Say you have N of each. If N is small enough that it's not too computationally intensive, you can search among all possible sets of pairings. What you need then is to find some sort of objective function O that you want to optimize. Then you compute that function for each assignment of pairings, and take the smallest (or biggest, whatever) value and use that pairing scheme.

Given the way you talk about it, I suspect this is possible to mostly brute force. The hard thing to do is come up with an objective function that suits your needs. I'm also assuming order is not important. So say you have a set of pairs {[Aj, Bf(j)]}, where f(j) is the pair mapping, ie. it tells you for each j in (1,...N), which B gets paired with Aj.
One candidate that seems to make sense is to take O=sum( |xj-xf(j)|+ |yj-yf(j)|),j=1...N, and pick the pairing with the smallest total value. But that might make you miss a pairing scheme where you get lots of fantastic matches and a few horrible matches. So you could also look at an O of the form O=sum(indicator([Aj,Bf(j)])< Threshold)j, which detects the number of really good matches, and ignores bad ones.

The main thing is, I suggest you/your team need come up with what an objective function O should really look like.

(On preview: I assume you mean jacquilynne's case 3, and for 30-50 items this is totally brute-forcible. You just need to think about what it is you're really going for: best average case, least terrible matches, most great matches, etc. You can even throw on weights and perks to give extra value to really good matches, or extra penalty to really bad ones, etc)
posted by SaltySalticid at 9:07 AM on September 19


jacquilynne -- #3
posted by AbelMelveny at 9:07 AM on September 19


One question I had was what is the number of these items vs how many you can process and how often do they get restocked? I see you just answered that question.

It sounds like if you only need to do a few per day and there’s a supply about 10x larger and they get restocked, then you might be able to just have an algorithm which computes the best fits for the X you need that day and recompute tomorrow’s best fits tomorrow.

This actually sounds like a problem that unsupervised machine learning could help with but I’m not that fresh on that topic so I’m not sure.

What I might try for an algorithm is to put in the values of the A’s and then figure out which Bs even possibly work. So for A1, you’ve got B3, B8, B23 as possibilities. Use a factor in your calculation so that youre trying to get as close to 0 as possible, like x2-x1-0.21=0. You want a minimum amount.

Then honesty you could probably just brute force it. Start with the pieces which have the fewest potential matches then just randomly pair up matches and see which get you the X number of best fits you need that day.

I’m sure I’m not explaining this that well. And again on preview it seems like SaltySalticid may have explained what I’m getting at a bit clearer.
posted by cali59 at 9:10 AM on September 19


Conceive of these as XY coordinates on a plan and then calculcate the pairwise euclidean distance and then take the closest one. IF you store these as a Xtree ( I think)in python you can do this for millions of observations very quickly.
posted by MisantropicPainforest at 9:12 AM on September 19


A programmatic approach to solving this might go something like this.

#1. Go through the entire list A and compare each member of A to each member of B.

- If the A/B pair fails either the x test (x1-x2 is not in the range given) or the y test (y1-y2 is not in the range given) then simply discard that pair
- If the A/B pair passes both tests, save the pair in a list.

#2. Now go through this list of A/B pairs and calculate the "goodness" of the match x1 to x2 given your more restrictive criteria (see below). Save that number. Similarly, calculate the goodness of the match y1 to y2 and save that number as well. Also add the two "goodness numbers" together to create the overall "goodness ranking" for this pair of items.

- The "goodness number" for each pair is 0 if the match is perfect and grows larger as the match becomes worse.
- A typical formula is something like: goodness number = abs (x1 - x2 - 20)
- This will only work if x1 is always bigger than x2 and your ideal solution is x1-x2 = 20
- If you only want the "distance" between x1 & x2 to be 20 (so x2-x1=20 is fine, and so is x1-x2=20_ then you can use a formula like: abs(abs(x1-x2)-20)

- You end up with quite a set of numbers for each A/B pair: The original x1,y1 and x2,y2 values; the x1-x2 goodness ranking number, the y1-y2 goodness ranking number, and the overall goodness ranking number for the pair.
- You save all these numbers for each A/B pair into a giant list or array.

#3. Now you simply sort the giant list of A/B according to the overall goodness ranking and there is your answer - the best one(s) will bubble right up to the top in this sort (assuming your goodness numbers are all zero or positive with best match closest to zero, and you sort with smallest overall goodness numbers first on the list).

If you want to examine other "almost best" answers you can also sort the list by x1-x2 goodness number as top priority and y1-y2 goodness number as 2nd priority and just examine all the pairs near the top to see if you like one of them better for any reason.

You can reverse the priority to see other possibilities.

(This also works if it happens that the x1-x2 criteria is more important, and the other one only needs to be an OK match - or vice versa.)

You could do all this programmatically in python or whatever - or you could use your programming language only to generate all the candidate pairs and the goodness ranking numbers, and then import that list into Excel and do the sorting by Goodness ranking/numbers there. The advantage of using Excel at the final step is it allows you to look at all the rankings & information of the best matches and so exercise your judgement to the degree you like.

It is also really easy to sort columns in the different ways outlined above, and so look at the results from these different perspectives.
posted by flug at 9:12 AM on September 19


If you can define the 'cost' for matching a pair, as in a quantity that is higher when the combination is worse and lower when it's better, then you may be able to write some python to calculate the cost for all possible pairs as a 2D table, then feed that into an algorithm like this:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
to compute the exact optimal matching, under the assumptions described on that page. (It seems like it might be a close fit to your problem but I might be wrong.)
It should be possible to incorporate your x and y values as well as the desired tolerances into the cost definition but the results will depend on how exactly you define this, so you'd probably want to experiment a bit and sanity check the results.
to.
posted by d11 at 9:14 AM on September 19


Can you explain what the algorithm to determine the best match is? Is it just a minimum x2-x1/y2-y1?

And it is unclear if you want the solution that minimizes the total ratio (or whatever the metric is) across your whole set of As and Bs or a solution that minimizes the metric for the best N matches, ignoring the rest, or you want to just be able to find the optimal matches for a specific set of A or B. I think you want the latter because of the rolling inventory situation, but that's not quite clear to me. A solution that matches all inventory might not be optimal if you don't actually need to match all of it now and better matches might be added in the future. Fundamentally, if this is the situation, there is no perfect solution based only on what you currently have in your spreadsheet and you would need to use statistical methods based on past flows to fully optimize.

That said the simplest solution of just picking the best matches for the inventory you need to match today might be very reasonable.
posted by ssg at 9:31 AM on September 19


I am assuming you want to do this in a spreadsheet and not with a custom coded optimizing algorithm. I would start by making As a column and Bs a row so that you can calculate the entire grid of results.

This is how I would think of it conceptually:
Do the X2-X1 calculation and if it is outside the range, return an error code. This pair won't work. Use conditional formatting to black it out.
Do the Y2-Y1 calculation and if it is outside the range, return an error code. This pair won't work. Use conditional formatting to black it out.

Compare the X2-X1 calculation to the ideal. If it matches, give it a significant weighting factor. If it doesn't match, but X1 could be adjusted to make it match better, give it a less significant weighting factor, based on risk with making that change to X1. If it doesn't match, but the mismatch is in the other direction, so it couldn't be corrected, give it the lowest weighting factor.
Compare the Y2-Y1 calculation to the ideal. As above with regard to weighting and risk.

You can adjust the actual weighting factors to reflect Y being more important than X or vice versa. So maybe a perfect match is a 0 (differences on both X and Y are 0), but a .1 difference on X gets weighted as a .2 difference while on Y it gets weighted as a .3 difference, and if the difference is in the correctable direction, you lower the weighting a little based on the risk of correcting it and if it's in the non-correctable direction you increase it a little because it can't be corrected for.

Add the weighting factors from the two steps above together to get an overall rating of how good a match is.
Use conditional formatting to highlight the best matches and tbh, I'd probably eyeball the rest to make a list manually, but you could have various functions that pull out the best matches for each A and B.

This is where a better coded solution would use optimization logic to make the decisions for you on which A gets which B when there is more than one place where B is the best matched A based on your larger goals for the list. Like, is it better to have 5 perfect matches if it means you will be left with mediocre or failing matches for the other 15 items on a list, or do you need to find a list that gives each of 20 items a working match, even if it means they might all be kind of mediocre?

But as a manual version, you can be the optimizing algorithm. Build a list to the side of your grid of what you think will be the best list and have it calculate the total and average goodness scores for your list. If you swap out a perfect match for two slightly less perfect matches, does your total goodness get better or worse? Is that the kind of optimization that makes sense for your conditions?
posted by jacquilynne at 9:35 AM on September 19


To clarify my comment after all the updates: as others have alluded this is really about how to pick an objective function O that scores each pairing scheme. ("Objective" is the general terminology, it could be a cost function, fit function, loss function, profit function, whatever). Once you have that figured out, asking how to implement a brute-force solution in Python is pretty straightforward.

If you really don't know how to form O, you can explore families of O's to see how useful they are with your real stuff and data. E.g. in my indicator example, you can think of that as a family of O's, parameterized by the threshold T. Decreasing T helps you find the better matches, but fewer of them, with more rejects. Increasing T gives you more matches when you optimize O, and gives you less items to re-tool or shelf until next round. But your real data may indicate a huge change in performance just by changing T a little. A threshold of thresholds if you will. This is pretty common, and a reason why this kind of optimization problem is usually best tackled iteratively or interactively, rather than picking one solution and sticking to it.
posted by SaltySalticid at 9:37 AM on September 19


What d11 suggested is a straight-forward way to model and solve this type of thing as a standard operations-research type optimisation problem. I also recommend an approach like this. The docs for scipy.optimize.linear_sum_assignment describe it as a solver for the assignment problem (wikipedia). If you prefer to work in excel rather in python, try searching for keywords like: excel operations research solver "assignment problem".

To be able to encode your problem as an instance of a linear sum assignment problem, as d11 suggests you need to define a cost for each proposed match between an item from column A and another item from column B

Here's a naive way to define a cost. Fix some positive constants alpha > 0, beta > 0. could start with them both set to 1.

Define the "cost" of a proposed match as cost := alpha * ((x2-x1) - 0.21)^2 + beta * ((y2-y1) - 0.27)^2

if you make alpha larger than beta, then you're making a trade-off saying it's better to reduce the x-error than the y-error.

This isn't the best cost function for your problem, from what you've described it's more complicated.

But, as long as you define some cost function that spits out a non-negative number, and assigns larger numbers to inferior proposed matches, then you define a cost matrix to feed into a solver like linear_sum_assignment, and get it to spit out a subset of matches that minimise the total cost.
posted by are-coral-made at 9:50 AM on September 19


It sounds as though you have "ideal" numbers for both x2-x1 and y2-y1. So, you could compute a distance function:

SQRT( ((x2-x1)-idealx)^2 + ((y2-y1)-idealy)^2) )

This then becomes the score for each match between a B item and an A item, it shows the distance the x and y values for that match are from your ideal numbers with smaller distances being better. As some have stated, one of the columns of items will need to be transposed so that you end up with an nXm matrix where n is the number of A items and m is the number of B items. You can then calculate the value that should be at each entry in the matrix using the distance function
posted by TimHare at 10:58 AM on September 19


If all you're doing with the outputs of a distance function is sorting them, there's no need for the SQRT because the raw sum of squares will yield the same sort order.

Also, you might want something a little more involved than ((x2-x1)-idealx) in the guts of it, given that there are hard acceptability limits on (x2-x1) and that the ideal value is not half way between these.
posted by flabdablet at 11:14 AM on September 19


You are all kind and generous people, and I so appreciate the time you've taken today to answer my questions. I apologize for places where I was not as clear as I could've been. There's a number of things I'm going to have to do some reading on -- and that alone is very much appreciated, especially the talk about objective functions. I appreciate the specific suggestions for things to try, and I'll work on making some progress there.
posted by AbelMelveny at 12:20 PM on September 19


There's no need for the squares either, absolute value will suffice. There's no a priori reason to think Euclidean distance is the best. In fact, given that decreasing xj is acceptable in cases where decreasing yj is not, it might make sense to use some weights Wx and Wy to reflect which one is more important to get close. Eg O=sum( Wx|xj-xf(j)|+ Wy|yj-yf(j)|), and again you can run the routine for several values of the weights to see which ones align better with the kinds of pairings you want. I'm probably biased but SciPy/NumPy makes all this straightforward to do, if you can write what you want in pseudocode. That will give you much better performance in the long run, in run time, exploratory analysis ability, tweakability, and final results. Especially since it seems like you want to do a lot of this over the next year. In contrast, dicking around with Excel is like using a hammer when a mallet is called for: it might work, but you also could muck things up by not having the capacity for subtlety. If this really is something worth good money for the company, demand good support/money/assistance in doing it right.
posted by SaltySalticid at 12:35 PM on September 19


Dragged out my rusty Python semi-skills and came up with this, using your sample data:

a.csv
"item","x","y"
"A1",86.63,50.12
"A2",86.65,50.14
"A3",86.66,50.13
"A4",86.66,50.13
"A5",86.67,50.17
"A6",86.67,50.16
"A7",86.70,50.20
"A8",86.71,50.21
"A9",86.71,50.17
"A10",86.71,50.19
"A11",86.72,50.22
"A12",86.72,50.23
"A13",86.72,50.20
"A14",86.72,50.23
"A15",86.73,50.22
b.csv
"item","x","y"
"B1",86.81,50.38
"B2",86.83,50.38
"B3",86.84,50.38
"B4",86.84,50.39
"B5",86.86,50.41
"B6",86.86,50.40
"B7",86.86,50.39
"B8",86.86,50.42
"B9",86.86,50.42
"B10",86.87,50.43
"B11",86.89,50.46
"B12",86.89,50.45
"B13",86.90,50.44
"B14",86.90,50.45
"B15",86.90,50.46
config.csv
"x_low", "x_opt", "x_high", "x_weight", "y_low", "y_opt", "y_high", "y_weight"
   0.17,    0.21,     0.22,        1.0,    0.24,    0.27,     0.29,        1.0
optimatch.py
#!/usr/bin/python3
import sys
import csv

# Fit is negative for clearance outside bounds, zero at
# either bound, varying smoothly up to 1 at optimum.
def fit(clearance, lower, optimum, upper):
    if (clearance < optimum):
        return (clearance - lower) / (optimum - lower)
    else:
        return (upper - clearance) / (upper - optimum)

with open(sys.argv[1]) as f1:
    a = list(csv.DictReader(f1, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[2]) as f2:
    b = list(csv.DictReader(f2, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[3]) as config:
    c = next(csv.DictReader(config, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

# Build a list of pairs, each including a weighted sum of fits.
pairs = [
    (
        i['item'],
        j['item'],
        fit(j['x'] - i['x'], c['x_low'], c['x_opt'], c['x_high']) * c['x_weight'] +
        fit(j['y'] - i['y'], c['y_low'], c['y_opt'], c['y_high']) * c['y_weight']
    ) for i in a for j in b
]

# Emit pairs in order of best fit.
pairs.sort(key = lambda pair: pair[2], reverse = True)
for pair in pairs: print(pair)
./optimatch.py a.csv b.csv config.csv | head
('A2', 'B5', 1.9999999999997107)
('A1', 'B4', 1.9999999999990479)
('A3', 'B6', 1.7499999999999378)
('A4', 'B6', 1.7499999999999378)
('A6', 'B10', 1.7499999999999156)
('A2', 'B6', 1.666666666666444)
('A1', 'B3', 1.6666666666660404)
('A9', 'B13', 1.5000000000001652)
('A10', 'B15', 1.500000000000143)
('A2', 'B8', 1.499999999999787)
It does something. I'll leave you to be the judge of whether it's something useful.
posted by flabdablet at 1:24 PM on September 19


Also just noticed that it could probably use some logic to remove output lines that contain items that were already picked as parts of better pairings further up the sorted list, like items A2, A1 and B6 above. If the number of pairings you need on any given day is small and/or there are other criteria that would prevent certain pairings from working as you mentioned, you could probably just eyeball that. And of course there's the horrendous lack of any kind of error checking. But I think the principle is sound.
posted by flabdablet at 1:38 PM on September 19


Actually it's not good. A negative fit result should instantly disqualify a pairing, so the step that builds the list of pairs should have some filtering logic to handle that. Revising:

optimatch.py
#!/usr/bin/python3
import sys
import csv

# Fit is negative for clearance outside bounds, zero at
# either bound, varying smoothly up to 1 at optimum.
def fit(clearance, lower, optimum, upper):
    if (clearance < optimum):
        return (clearance - lower) / (optimum - lower)
    else:
        return (upper - clearance) / (upper - optimum)

with open(sys.argv[1]) as f1:
    a = list(csv.DictReader(f1, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[2]) as f2:
    b = list(csv.DictReader(f2, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[3]) as config:
    c = next(csv.DictReader(config, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

allpairs = [
    (
        i['item'],
        j['item'],
        fit(j['x'] - i['x'], c['x_low'], c['x_opt'], c['x_high']) * c['x_weight'],
        fit(j['y'] - i['y'], c['y_low'], c['y_opt'], c['y_high']) * c['y_weight']
    ) for i in a for j in b
]

acceptable = [pair for pair in allpairs if pair[2] >= 0 and pair[3] >= 0]
acceptable.sort(key = lambda pair: pair[2] + pair[3], reverse = True)
for pair in acceptable: print(pair)
./optimatch.py a.csv b.csv config.csv | head -n15
('A2', 'B5', 0.9999999999998439, 0.9999999999998668)
('A1', 'B4', 0.9999999999992034, 0.9999999999998446)
('A3', 'B6', 0.7500000000000712, 0.9999999999998668)
('A4', 'B6', 0.7500000000000712, 0.9999999999998668)
('A6', 'B10', 0.7500000000000712, 0.9999999999998446)
('A2', 'B6', 0.9999999999998439, 0.6666666666666)
('A1', 'B3', 0.9999999999992034, 0.6666666666668369)
('A9', 'B13', 0.5000000000002984, 0.9999999999998668)
('A10', 'B15', 0.5000000000002984, 0.9999999999998446)
('A2', 'B8', 0.9999999999998439, 0.4999999999999431)
('A2', 'B9', 0.9999999999998439, 0.4999999999999431)
('A1', 'B2', 0.7500000000000712, 0.6666666666668369)
('A3', 'B7', 0.7500000000000712, 0.6666666666666)
('A4', 'B7', 0.7500000000000712, 0.6666666666666)
('A5', 'B10', 0.7500000000000712, 0.6666666666666)

posted by flabdablet at 2:01 PM on September 19


Here's another version that remembers which parts have already been picked so as not to re-pick the same ones. Interestingly, for the example data supplied, this actually cuts the resulting pairs list down enough that there's no longer any need to use head to truncate it.

This version also interprets the weights in config.csv as percentages reflecting the relative importance of fitting on x and fitting on y, and rounds the weighted fit results so that the final fit scores become simple integer quality-of-fit percentages instead of weird-ass decimals. The config file weights should sum to 100 to make this work properly.

The a.csv and b.csv files are still as above.

config.csv
"x_low", "x_opt", "x_high", "x_weight", "y_low", "y_opt", "y_high", "y_weight"
  0.17,    0.21,     0.22,         50,    0.24,    0.27,     0.29,         50
optimatch.py
#!/usr/bin/python3
import sys
import csv

# Fit is negative for clearance outside bounds, zero at
# either bound, varying smoothly up to 1 at optimum.
def fit(clearance, lower, optimum, upper):
    if (clearance < optimum):
        return (clearance - lower) / (optimum - lower)
    else:
        return (upper - clearance) / (upper - optimum)

with open(sys.argv[1]) as f1:
    a = list(csv.DictReader(f1, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[2]) as f2:
    b = list(csv.DictReader(f2, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

with open(sys.argv[3]) as config:
    c = next(csv.DictReader(config, skipinitialspace = True, quoting = csv.QUOTE_NONNUMERIC))

pairs = [
    (i['item'], j['item'], round(x_fit * c['x_weight'] + y_fit * c['y_weight']))
    for i in a for j in b
    if  (x_fit := fit(j['x'] - i['x'], c['x_low'], c['x_opt'], c['x_high'])) >= 0
    and (y_fit := fit(j['y'] - i['y'], c['y_low'], c['y_opt'], c['y_high'])) >= 0
]

pairs.sort(key = lambda pair: pair[2], reverse = True)
picked = {}
for pair in pairs:
    if not (pair[0] in picked or pair[1] in picked):
        picked[pair[0]] = picked[pair[1]] = True
        print(pair)
./optimatch.py a.csv b.csv config.csv
('A1', 'B4', 100)
('A2', 'B5', 100)
('A3', 'B6', 87)
('A6', 'B10', 87)
('A9', 'B13', 75)
('A10', 'B15', 75)
('A4', 'B7', 71)
('A7', 'B11', 58)
('A5', 'B8', 42)
('A13', 'B14', 29)
('A8', 'B12', 13)

posted by flabdablet at 11:04 AM on September 22


« Older Give me your late life job change ideas   |   Debt collector or scammer? Newer »

You are not logged in, either login or create an account to post comments