What Problems Search Engine Algorithms Have To Tackle?

I had found in Will Fitzgerald’s post a very clear overview of how search engines work.

He’s not going in detail into the machine learning algorithms, but he gives an idea of what kind of problems they have to tackle 🙂 Enjoy the reading!

When a search engine, like Bing’s or Google’s says the results are algorithmically chosen, this primarily means that machine learning algorithms are used for selecting a collection of results to potentially display, and then other machine learning algorithms are used for ranking the results for display (which goes first, second, etc.) Of course, there are a lot of other things that go on: algorithms are used for spelling correction, query alterations (for example, noticing that “SF” is a common way of saying “San Francisco”), query classification and answer type selection (for example, should “365/24/7” return a direct answer of 2.17261905?), and so on. But what Google is accusing Bing of doing is “cheating” by copying their answers directly from Google, which is to say, that the usual selection and ranking steps are being bypassed in favor of direct copying.

[…] The clickstream (requisite Wikipedia article) is the record of searches, displayed results, and selected results collected by “our customers,” as Shum said. Clickstream analysis (pioneered by Google, I say, admiringly) is an extremely powerful source of data for search result selection and ranking. It tells you, implicitly, what people think of the results presented to them. Given a ranked selection of A,B,C, they click on C and then B, but leave A alone. Given enough of these data, the machine learning models can learn to present C and B (in that order) and downrank A (if A is presented at all). And there is a lot of clickstream data, both in the direct logs to the search providers’ services, as well as the “opt-in” data mentioned by Shum. Obviously, Bing can’t inspect Google’s clickstream logs, but when customers allow it, Bing can use their searches made to Google to approximate this. I don’t know the details of what is collected (nor could I tell you, I suppose, if I did), but these are the data Shum is referring to.