mysql hope
August 14, 2006 12:40 PM   Subscribe

MySQL refuses to index my table properly, because the data I'm looking for is only three letters long.

I wrote a little database postcode app, and wondered for ages why it wasn't working properly. Then I realised it would return results perfectly well when I searched for a four-character postcode, like BA16 -- it was just ignoring the three-character ones, like BA5. Is there any way to turn off (or work around) this idiotic MySQL behaviour?

If it matters, I'm using fulltext indexes with this SQL:
SELECT * FROM 'pc' WHERE MATCH (town,county,postcodes,area,county) AGAINST ('BA5').
posted by reklaw to Computers & Internet (16 answers total) 1 user marked this as a favorite
I will tell you how to fix this for my standard consulting fee, or you can look it up in the MySQL docs.

MySQL 5.1 Reference Manual :: 12.7 Full-Text Search Functions:
Some words are ignored in full-text searches:

* Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is four characters.
* Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overwritten by a user-defined list.

The default stopword list is given in Section 12.7.3, “Full-Text Stopwords”. The default minimum word length and stopword list can be changed as described in Section 12.7.5, “Fine-Tuning MySQL Full-Text Search”.
MySQL 5.1 Reference Manual :: 12.7.5 Fine-Tuning MySQL Full-Text Search:
The minimum and maximum lengths of words to be indexed are defined by the ft_min_word_len and ft_max_word_len system variables. (See Section 5.2.2, “Server System Variables”.) The default minimum value is four characters; the default maximum is version dependent. If you change either value, you must rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set the ft_min_word_len variable by putting the following lines in an option file:


(The MySQL documention is your friend.)
posted by orthogonality at 12:50 PM on August 14, 2006

What are the results when you EXPLAIN that query, such that you're certain it's not using your index?

MySQL has a built-in optimizer that will examine the structure of a query and attempt to estimate the time saved by using an index. If it determines that it's better off ignoring the index and just going at it, it'll do just that.

How many different postcodes do you have, or what other information is contained in the field that you're using a FULLTEXT index for (which is somewhat unreliable with single strings)?
posted by disillusioned at 12:51 PM on August 14, 2006

orthogonality: How would you go about changing that on a shared web hosting server? Would you ever be able to?
posted by reklaw at 12:56 PM on August 14, 2006

This doesn't seem like the sort of incidence that you'd want to use FULLTEXT. Just index the first 4 chars of postcode and use LIKE (or a LEFT() and =). It'll still use the indexes (although you can check this). The amount of postcodes in the United Kingdom is trivial. Do you need to search county, town, etc, for postcodes?
posted by wackybrit at 1:02 PM on August 14, 2006

reklaw writes "orthogonality: How would you go about changing that on a shared web hosting server? Would you ever be able to?"

If you've got your own instance of the mysql server running on the shared host, you're good to go. If not, you're out of luck, because other users of the database won't want to make the min length 3.

But the real answer is the one wackybrit gave you; there's unlikely to be a good reason to be using fulltext to look for postal codes.
posted by orthogonality at 1:06 PM on August 14, 2006

The reason I'm using fulltext is that I've got someone else's silly dirty database with comma-separated postcodes in each column, like this:

column one: name of place
column two: postcode, postcode, postcode, postcode

Daft, I know.

I think I do have my own instance of the mysql server running, though, although I've no idea how to edit the config. Do I have to mess around with a unix shell or something?
posted by reklaw at 1:13 PM on August 14, 2006

reklaw writes "I think I do have my own instance of the mysql server running, though, although I've no idea how to edit the config. Do I have to mess around with a unix shell or something?"

This way will take you to hell. You'd have to change the value, and then re-index. And make sure that anyone who indexes in the future with myisamchk passes the same min length parameter -- if they donm't, the query breaks. MySQL Full-Text Search is half-baked, and its re-indexing is pretty bad too. (It's often faster to drop indices than to repair them, as a project I'm working on just completed for disillusioned demonstrates.)

Why not just fix the database? In the long run, this will be easier and more maintainable.
posted by orthogonality at 1:25 PM on August 14, 2006

Note for future Google searchers etc.: I couldn't be bothered to fix this properly, so I just did a find-and-replace to make all my commas into "ZZZ" -- MySQL indexes "BA5ZZZ", and it's trivial to strip all occurances of ZZZ from the output. OMG HAX. Don't worry, this script is just for my own use anyway...
posted by reklaw at 3:09 PM on August 14, 2006

Oh dear god in heaven.
posted by orthogonality at 4:24 PM on August 14, 2006

Yeah, I'm right there with you ortho. Oh dear god.
posted by delfuego at 4:58 PM on August 14, 2006

Now watch that kluge turn up in Microsoft Postcodes :-)
posted by flabdablet at 6:54 PM on August 14, 2006

ortho + wackybrit, good advice going unheeded.
posted by ijoshua at 7:05 PM on August 14, 2006

Orthogonality: in the interests of learning, what would be the right way to do this?

1) If the data is just simply (locality, csv_postcodes), create the table in the form of (unique_postcode, locality).

2) If it's any more complex e.g. actual address data, use linked tables to a (unique_postcode, locality) table and set indexes based on the most common query type.

3) Something else?
posted by Pinback at 7:15 PM on August 14, 2006

Pinback writes "in the interests of learning, what would be the right way to do this?"

Yeah, you pretty much have it. The "rule" is, every entity has a single row in some table.

An entity is any thing you want to represent in the database, possibly including relationships among other things, or even consisting of metadata. That's the essence of database normalization. If you don't normalize, you have duplicated entities, which means you can simultaneously represent one real thing in two different, contradictory states.
posted by orthogonality at 10:47 AM on August 15, 2006

Yeah, totally. The moral of the story is, whoever made the database originally doesn't know shit about databases. Comma-separated (or pipe-separated, or whatever) fields break the ENTIRE POINT of them.

I guess the truly-right thing to do would be to write a script that would take each postcode and make it its own row -- but that would, of course, require both writing the script to begin with, and then running it over the entire db. Still, I feel a little guilty about my kludge, even though I'm the only one to have to suffer with it.
posted by reklaw at 12:15 PM on August 15, 2006

Thanks, ortho - I knew reklaw's "solution" was horribly, horribly wrong, and was just curious to know if my thinking was on the right track (and to elicit a more generically useful explanation for future AskMeologists).

Thankfully, reklaw knows his solution is pretty nasty. But, as a non-programmer who's just good enough to create ugly 1000-line hacks in PHP or JavaScript, it's the sort of sideways solution to a problem that would cross my mind for ... oh, maybe 5 seconds? ... before I disgusted even myself ;-)

Though, I admit, I might use something like it if necessary as part of a quick-and-dirty to initially populate a new table...
posted by Pinback at 11:53 PM on August 15, 2006

« Older What's the cheapest cell phone plan?   |   Which bike tires do I need? Newer »
This thread is closed to new comments.