Speeding Up MySQL LIKE "%%" Query Using FULLTEXT Index.
So I often find myself in a spot where I need to run a query with a where column like "%search here%". But in MySQL these queries are slow and you can't use a traditional index to help speed them up. In this writeup I walk through a real live example of how I added FULLTEXT index to help speed up a query and still get that same results returned as if you used the like "%search here%" query.
So on Codegrepper users will search code snippets based on the content of the code snippet. The query looks like this:
So it happens that the `code_answer` column is a VARCHAR(5000) this is too big to add a traditional index too (Specified key was too long; max key length is 3072 bytes), but even if I could add an index to that code_answers field, I happen to know that the index does not work with like '%%' . It would work if I was just doing a like 'my search%' but I'm not doing that... and no one ever really is...
So the above query is using no indexes and is taking a couple seconds. This is totally unacceptable for me. Hrmmm, this has to be a common spot right? I google around and can't find any really simple or elegant fixes so we got to "first principles" this bitch.
I find some people saying "FULLTEXT" indexing is the way to go, but after trying some basic queries using FULLTEXT I realize its not behaving like the like '%search%' does, at least not out of the box.
But I'm hopeful that with some tweaking I can get it to work how I want. So first thing is first, to get full text working I first add the full text index to the column.
Then I change the query to
Now my query takes about 7ms, thats much better. But after trying a few queries I quickly notice I'm getting different results than when I run like "%the car crashed hard%";
Getting FULLTEXT to work the same as like '%term%' is going to be some work, so make sure you really need it and can't just use the above query.
Alright so first thing is first, how does the above match syntax work and how are the results different. Here is my understand of how the FULLTEXT index works. Basically if we have a row this text: "I was driving poorly and the car crashed hard". The fulltext index will split each word into it's own index that can be searched on.
So in that last query we actually used what the team at MySQL calls a "phrase match" by surrounding the term with quotes "". But its not really a phrase match it actually equivalent to this query:
Another way to write that using fulltext matching is:
The + prefix on the word basically means AND so all those words need to be in the index. There are still a few problems though:
First of all what if the user searches something like "the car crashed h" . They missed the "ard". Our like "%%" query would return our row but the full text will not because it's matching everything at the world level and h != hard.
That is when the asterisk * comes to the rescue. Ends up we can use * as a "wildcard like thing" similar to how we would use like '%'. Unfortunately we can only use it on the end of the word. So now our query turns into:
Ok we are getting closer but we still have a couple more problems. If the user searches "hard car crashed". It will still match, but we really don't want it to because we want it to match the exact phrase. The easiest way around this I found is to just run the original query too, in most cases the fulltext query will filter out most results so or original query won't have to look at that many rows so our new query is:
Ok cool, so one thing I have not shown is how to convert the search term to the formatted match. We could do this in SQL, but I prefer to do it in whatever backend language we are using. Because I'm smart and cool my language of preference is PHP.
I can then compose the query in PHP with something like:
Ok hopefully this is all making sense because now we are getting pretty close to having query that works just like like "%the car crashed hard%" would, but is much faster. So we are still missing one obvious thing, but before I get there I want to talk about a few caveats of the FULLTEXT matching that will impact our results. We will have to deal with these caveats in some way.
Caveat 1 - Stopwords (To Ignore Them or Handle)
First off just to mention it, FULLTEXT search has some default "stop words" they are different based on if you use MyISAM or InnoDB (I'm using innoDB). Basically special words like "the, and , not ,in" will simply not be indexed. So if you decide to use the stop words you will need to ignore them when forming your match string, or disable stop words. I disabled stopword prior to creating the FULLTEXT index with SET @@SESSION.innodb_ft_enable_stopword = 'OFF'; If you did not do this before creating your FULLTEXT index or are using MyISAM you will need to disable stopwords on FULLTEXT (or handle them in in a different manor). Here is a good resource for disabling stopwords in fulltext search: https://stackoverflow.com/questions/12678920/ignoring-mysql-fulltext-stopwords-in-query
Caveat 2 - Minimum Length of Indexed words
Ok so this is actually kinda important. FULLTEXT will only index words of a certain length or more. You can see what your minimum is with:
Mine was set to 3, and I'm ok with that. So that means FULLTEXT will index any word that are 3 or more characters, if a word is 2 characters or less it won't be indexed.
If you decrease the minimum index size you will greatly hurt performance. By increasing it you will greatly help performance. I would bet that using 4 would actually be better, but I don't want to fuss with changing the default if you do here is how you do that: https://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html . If you do try different values, let me know what works best.
So now we need to remove any words less than 3 characters from our matchString. In PHP I will do that like so:
Caveat 3 - Handling non alphanumeric characters.
So what about all the non alphanumeric characters like !)%#$!@#%,.][.
Well after some testing it seems that FULLTEXT index basically treats them like they are a space (ie: a word delimiter. So if you column contains "the car crashed hard" it would be treated the same as "the car(crashed)hard" or "the car crashed-hard" or "the#car crashed hard". You get the point right...
So we need to account for special characters when creating our matching string. We basically want to just treat all alphanumeric characters as if they are a space character: I do that in PHP like so:
Putting it together our (almost) final code will look like this:
Ok this will get you almost there, and a quick test shows me this query goes from about 2.5 seconds (using like) to 5ms(using our new FULLTEXT index + like query).
Note: Performance improvements vary based on the query. For example queries with words all less than 3 characters won't get any improvement because no words are indexed so we just fall back to like '%search%'.
Solving the Final Edge Case (Don't do it...)
So, there is one last difference in this query from the like "%search term%" style query that I would encourage you not to solve for, but if you must, then I may have an idea. Can you spot the difference?
There is this one spot where if the user searches "rashed hard" , (they miss the first part of the word in the query) we will miss the row that contains "crashed hard" while like "%rashed hard%" would return the row.
Note: I present a solution below, but I did not actually implement this solution in production. I think this problem is simply too much of an edge case to solve for, especially given the complexity of the solution.
This is basically the opposite of when they missed the end of a word and we were able to solve with the * operator. So what if we just flipped everything?
I think this would work, so I'll give it a try. First we need to create a whole new code_answer column that will hold the cold_answer except be backwords.
So now we will basically want to run two MATCH queries one against the normal column and one against the reverse column. So our final query for the search term "rashed car" will look like this:
So here is the final PHP logic that would build this query:
Holly cow that is a lot of work just to speed up the like "%%" query, but my final result is good I'm getting around 100x or 200x speed improvement on most queries. For a heavy traffic website I think this is really necessary.
I hope this helps some other people and let me know if you see any issues or improvements to be had. Hit me up @taylorhawkes on twitter or email me email@example.com