Math equation / algorithm for a database
July 3, 2008 8:05 PM Subscribe
I'm looking for some help with a math equation / algorithm I'm trying to create. I'm a programmer by trade and not very good at math so I'd like some help with this or perhaps a site where I could pay someone to come up with the equation I'm looking for. I considered rentacoder or elance however I'd be concerned if the companies there had the proper math knowledge. A lot of claims get made on those sites that turn out to be false.
I'm trying to do is calculate the value or profitability of keywords based on some data that I have. I have a list of about 35 categories and in each category I have 500 keywords that belong to that category. I've provided a small sampling of the data I have (at the end of this post). The sampling has 5 categories and in each category it has the top 5 and bottom 5 from the lists of 500.
The category records have a '~' starting the row. The first field is the category name and the second field is the total dollar amount of products sold in that category in the last 2 weeks.
Under each category are the keyword records that belong to that category. Normally this list has 500 records however I’m only displaying the top and bottom five (records 1-5 and 495-500). These records are always placed in order of popularity. Meaning record one is searched for more often than record two. The first field if the record is the keyword itself. The second field is the number of products sold in the past two weeks that contain that keyword (raw number not dollar amount however I might be able to get that data). Field three is the number of products currently being offered that contain that keyword. In other words how many products are competing for that keyword at any given moment. The last field is a number representing the popularity of the keyword in the regular search engines (Google, Yahoo, etc.). It's a weight figure for how many people search for that term in a given month. I'm not sure how they come up with this number however here is a sampling for reference (sex: 1415948, harry potter: 242471, will smith: 91542, java books 378). I don't have this data but I was also thinking of including how many web sites compete for that keyword in Google to do a supply/demand comparison with field four. I can easily get that data and would like to include that in the computation.
Some thoughts I have about the data. The keywords are ordered by popularity like I said. This alone is only mildly helpful. I was thinking that the amount of product sold in a category could bring some value to that ranking. Meaning position 1 in a category that sells 1 million every 2 weeks is more valuable than position 1 in a category that sells $200,000 every 2 weeks. I also thought the search engine popularity weight number might also bring some value to that order. If the product ranks high in both areas then perhaps it is of greater overall value.
This is the data I have at the moment however I'm open to thoughts or suggestions. I might be able to get additional data if it's important however I'd like to stay with this if I can. I'm certain I have access to this data.
What I would like to do is take this data and compute a value for each keyword that tells me its profitability or value. I'd like to find out what keywords in this massive list are the best. I would place this 'profitability value' in the final field of each keyword record for someone to review. The category data (dollar amount sold in that category), keyword order (record 1 is more popular than record 2), and the data in record field 2 and 3 are the most important. The search engine information is less important (lets say 3 times less important).
Any help would be greatly appreciated. Thanks so much for your time. Have a great day.
~Antiques,6155459.31
victorian,50984,23124,313
sterling,506010,153977,327
furniture,54095,9618,3605
art deco,18234,13062,424
lamps,182309,56646,580
...
bedroom,11594,3241,709
chest of drawers,194,242,67
cottage,13730,6258,214
photo,1131344,159891,1765
pre columbian,673,246,10
~Art,33651005.64
aceo,4554,3924,40
obey,3365,1072,205
highwaymen,1875,233,136
thomas kinkade,1566,1379,320
florida,51435,11457,4226
...
boston,76336,20618,3296
mother,120329,19159,1857
vermont,7714,1723,593
poland,10018,3604,1210
cherry blossom,3410,1428,497
~Baby,1887275.00
clothes,78160,40211,2226
baby,405388,139070,4956
shoes,465883,245167,6373
winnie the pooh,17721,8714,2899
pottery barn,7237,6054,4391
...
nuk,276,126,38
cover,1884691,262594,1030
cradles,861446,30529,41
butterflies,100523,37077,5900
graco car seat,402,224,51
~Books,99187955.26
easton press,1826,907,63
stephen king,6545,3419,1282
harry potter,22159,7573,7007
nancy drew,4963,1447,757
bible,201307,11721,3081
...
electrical,14574,2904,235
blackjack,334947,9839,967
diary,66830,3756,347
judo,3198,307,356
jane eyre,2981,256,290
~Business & Industrial,203912902.60
john deere,22686,11873,4646
backhoe,979,572,180
tractor,23708,15907,669
bobcat,3999,980,1521
loader,5620,2906,33
...
buffer,2415,1075,64
boring bar,656,262,10
tig welder,763,422,96
rake,2649,625,62
flow meter,856,485,63
I'm trying to do is calculate the value or profitability of keywords based on some data that I have. I have a list of about 35 categories and in each category I have 500 keywords that belong to that category. I've provided a small sampling of the data I have (at the end of this post). The sampling has 5 categories and in each category it has the top 5 and bottom 5 from the lists of 500.
The category records have a '~' starting the row. The first field is the category name and the second field is the total dollar amount of products sold in that category in the last 2 weeks.
Under each category are the keyword records that belong to that category. Normally this list has 500 records however I’m only displaying the top and bottom five (records 1-5 and 495-500). These records are always placed in order of popularity. Meaning record one is searched for more often than record two. The first field if the record is the keyword itself. The second field is the number of products sold in the past two weeks that contain that keyword (raw number not dollar amount however I might be able to get that data). Field three is the number of products currently being offered that contain that keyword. In other words how many products are competing for that keyword at any given moment. The last field is a number representing the popularity of the keyword in the regular search engines (Google, Yahoo, etc.). It's a weight figure for how many people search for that term in a given month. I'm not sure how they come up with this number however here is a sampling for reference (sex: 1415948, harry potter: 242471, will smith: 91542, java books 378). I don't have this data but I was also thinking of including how many web sites compete for that keyword in Google to do a supply/demand comparison with field four. I can easily get that data and would like to include that in the computation.
Some thoughts I have about the data. The keywords are ordered by popularity like I said. This alone is only mildly helpful. I was thinking that the amount of product sold in a category could bring some value to that ranking. Meaning position 1 in a category that sells 1 million every 2 weeks is more valuable than position 1 in a category that sells $200,000 every 2 weeks. I also thought the search engine popularity weight number might also bring some value to that order. If the product ranks high in both areas then perhaps it is of greater overall value.
This is the data I have at the moment however I'm open to thoughts or suggestions. I might be able to get additional data if it's important however I'd like to stay with this if I can. I'm certain I have access to this data.
What I would like to do is take this data and compute a value for each keyword that tells me its profitability or value. I'd like to find out what keywords in this massive list are the best. I would place this 'profitability value' in the final field of each keyword record for someone to review. The category data (dollar amount sold in that category), keyword order (record 1 is more popular than record 2), and the data in record field 2 and 3 are the most important. The search engine information is less important (lets say 3 times less important).
Any help would be greatly appreciated. Thanks so much for your time. Have a great day.
~Antiques,6155459.31
victorian,50984,23124,313
sterling,506010,153977,327
furniture,54095,9618,3605
art deco,18234,13062,424
lamps,182309,56646,580
...
bedroom,11594,3241,709
chest of drawers,194,242,67
cottage,13730,6258,214
photo,1131344,159891,1765
pre columbian,673,246,10
~Art,33651005.64
aceo,4554,3924,40
obey,3365,1072,205
highwaymen,1875,233,136
thomas kinkade,1566,1379,320
florida,51435,11457,4226
...
boston,76336,20618,3296
mother,120329,19159,1857
vermont,7714,1723,593
poland,10018,3604,1210
cherry blossom,3410,1428,497
~Baby,1887275.00
clothes,78160,40211,2226
baby,405388,139070,4956
shoes,465883,245167,6373
winnie the pooh,17721,8714,2899
pottery barn,7237,6054,4391
...
nuk,276,126,38
cover,1884691,262594,1030
cradles,861446,30529,41
butterflies,100523,37077,5900
graco car seat,402,224,51
~Books,99187955.26
easton press,1826,907,63
stephen king,6545,3419,1282
harry potter,22159,7573,7007
nancy drew,4963,1447,757
bible,201307,11721,3081
...
electrical,14574,2904,235
blackjack,334947,9839,967
diary,66830,3756,347
judo,3198,307,356
jane eyre,2981,256,290
~Business & Industrial,203912902.60
john deere,22686,11873,4646
backhoe,979,572,180
tractor,23708,15907,669
bobcat,3999,980,1521
loader,5620,2906,33
...
buffer,2415,1075,64
boring bar,656,262,10
tig welder,763,422,96
rake,2649,625,62
flow meter,856,485,63
well, if you can get the dollar value of the items sold corresponding to a particular keyword, you can calculate the dollar value of an average sale (keyword's total dollar sales / keyword's number of sales).
then multiplying that by the # of items for sale under that keyword, you could have some kind of measure of the potential value of all the items that are currently listed with that keyword.
this would be more meaningful for certain keywords, though (e.g. antiques::victorian probably has a much larger variance in their distribution of prices, compared to say, business & industrial::rakes, so it might not be as useful for keywords which can vary widely. can you get ahold of any statistics about the distribution of prices - std deviation say? that might come in handy for weighting.)
posted by sergeant sandwich at 8:35 PM on July 3, 2008
then multiplying that by the # of items for sale under that keyword, you could have some kind of measure of the potential value of all the items that are currently listed with that keyword.
this would be more meaningful for certain keywords, though (e.g. antiques::victorian probably has a much larger variance in their distribution of prices, compared to say, business & industrial::rakes, so it might not be as useful for keywords which can vary widely. can you get ahold of any statistics about the distribution of prices - std deviation say? that might come in handy for weighting.)
posted by sergeant sandwich at 8:35 PM on July 3, 2008
Well, if you can get the dollar value, the most valuable keyword is the one with the most dollars made that month. It's not exactly clear what you want to know beyond that. The number offered for sale tells you nothing, except maybe how long it takes to sell something. (so if you have 30k sales a month, and 1k items for sale, it only takes a day to sell an item, unless the sales or items for sale number is changing rapidly, which you don't know)
posted by delmoi at 8:47 PM on July 3, 2008
posted by delmoi at 8:47 PM on July 3, 2008
Well, what may help you get started is to put a very simple example into excel. Don't make any of the numbers big, so you can kinda follow along in your head, and just play with it a bit.
~Bikes, $10,
wheels, 50,60,1
hipster, 20, 30, 2
~Farming supplies, $3
vitamins, 100, 200,3
green, 300,300,4
hipster, 20, 30, 2
So hipster would be related to Bikes' $10 and have 20 hits; wheels is related as well and has 50 hits. But...who bought the bikes? Was it the people who searched hipster or wheels? And can keywords show up in two categories, like hipster above?
Starting with some assumptions, you may simplify your problem to the point that you understand it, and then work backwards relaxing/removing the assumptions. Some I suggest starting with:
Assume all profits in each category are due to the top ranking keyword.
Then try assuming each keyword in the category gets the same amount of profit.
Then assume it is proportional to the count on the keyword.
Then try assuming a distribution, like a normal spread over the rankings (gets more complicated).
Do you have this data over time? If the number of hits on wheels and the $ total on bikes is increasing at the same rate *over time*, you may want to assume that the increase in $bikes is because of the increase in #(hits) wheels. data over time here would help a great deal. And, yes, I think you are right that you are going to want a professional to help you. Know any grad students in math or engineering or statistics? Or you can try posting on the jobs board here.
posted by Eringatang at 8:51 PM on July 3, 2008
~Bikes, $10,
wheels, 50,60,1
hipster, 20, 30, 2
~Farming supplies, $3
vitamins, 100, 200,3
green, 300,300,4
hipster, 20, 30, 2
So hipster would be related to Bikes' $10 and have 20 hits; wheels is related as well and has 50 hits. But...who bought the bikes? Was it the people who searched hipster or wheels? And can keywords show up in two categories, like hipster above?
Starting with some assumptions, you may simplify your problem to the point that you understand it, and then work backwards relaxing/removing the assumptions. Some I suggest starting with:
Assume all profits in each category are due to the top ranking keyword.
Then try assuming each keyword in the category gets the same amount of profit.
Then assume it is proportional to the count on the keyword.
Then try assuming a distribution, like a normal spread over the rankings (gets more complicated).
Do you have this data over time? If the number of hits on wheels and the $ total on bikes is increasing at the same rate *over time*, you may want to assume that the increase in $bikes is because of the increase in #(hits) wheels. data over time here would help a great deal. And, yes, I think you are right that you are going to want a professional to help you. Know any grad students in math or engineering or statistics? Or you can try posting on the jobs board here.
posted by Eringatang at 8:51 PM on July 3, 2008
Well, I see I misread a bit (sorry, its late), but still the general idea holds. Make some simple assumptions and play around. Then you will know what other data you want.
posted by Eringatang at 9:16 PM on July 3, 2008
posted by Eringatang at 9:16 PM on July 3, 2008
JohnGL,
The solution to most similar problems I have seen is usually found after you thoroughly understand both the problem, and what you are trying to determine.
What you have is data, in which you sense buried useful information. For such a specific type of problem, you are not going to find an 'equation' in the sense of the general solution to quadratics or such that just pops out the answer. What you have to do is puzzle out which of these pieces of data contain the info you want and in what proportion. You may find out that your data won't support the kind of analysis that you want to do and it may need augmentation, either with more fields of descriptive data, or with some derived or assumed characteristic. On the other hand, maybe you have all the info you need to do what you are trying to do.
A rental programmer might be able to help you, but since you code, and since you know what you are trying to answer, with all its nuances and shortcomings, I think you are better off developing the analysis skills than in communicating the problem to someone less interested in the solution. Hell, by the time you explain it well to someone, you probably understand it better than they will and the solution will present itself to you. That's happened to me a lot.
I'd agree with egingatang.... use something like excel to play with your data until you answer your own problem. Hose around with simple data to enhance your understanding of the problem and the shortcomings of the source data. Do this obsessively until you dream about it at night and you'll increase the likelihood of an "AHA!" moment where it all comes together.
This data is difficult to crunch for you now, but in a career, it'll be one of many problems that pops up and the solution of hiring an expert can get costly if it's your standard approach. Really, there isn't much math in this.. but there is a lot of room for innovation. Try a bunch of different stuff and keep what works.
Good luck.
posted by FauxScot at 2:21 AM on July 4, 2008
The solution to most similar problems I have seen is usually found after you thoroughly understand both the problem, and what you are trying to determine.
What you have is data, in which you sense buried useful information. For such a specific type of problem, you are not going to find an 'equation' in the sense of the general solution to quadratics or such that just pops out the answer. What you have to do is puzzle out which of these pieces of data contain the info you want and in what proportion. You may find out that your data won't support the kind of analysis that you want to do and it may need augmentation, either with more fields of descriptive data, or with some derived or assumed characteristic. On the other hand, maybe you have all the info you need to do what you are trying to do.
A rental programmer might be able to help you, but since you code, and since you know what you are trying to answer, with all its nuances and shortcomings, I think you are better off developing the analysis skills than in communicating the problem to someone less interested in the solution. Hell, by the time you explain it well to someone, you probably understand it better than they will and the solution will present itself to you. That's happened to me a lot.
I'd agree with egingatang.... use something like excel to play with your data until you answer your own problem. Hose around with simple data to enhance your understanding of the problem and the shortcomings of the source data. Do this obsessively until you dream about it at night and you'll increase the likelihood of an "AHA!" moment where it all comes together.
This data is difficult to crunch for you now, but in a career, it'll be one of many problems that pops up and the solution of hiring an expert can get costly if it's your standard approach. Really, there isn't much math in this.. but there is a lot of room for innovation. Try a bunch of different stuff and keep what works.
Good luck.
posted by FauxScot at 2:21 AM on July 4, 2008
This thread is closed to new comments.
Secondly, you can't determine the total dollar amount for a keyword; for all you know, 90% of the money spent in Antiques might be for a couple dozen very expensive pre-Colombian items.
So I guess I don't really know what you're expecting to compute. You say that total sales for the category is most important (higher number = better), followed by search rank within the category (lower number = better), so you might try:
keyword "value" = sum of (category sales / search rank) for all categories keyword appears in
or
keyword "value" = sum (category sales * (500 - search rank)) likewise
...either will give you a value that increases as category sales increase and as search rank improves. But I don't know how useful that's gonna be...
posted by equalpants at 8:30 PM on July 3, 2008