Navigational structure of large web sites seems to skew probabilities returned by Web Language Model API
Category: project oxford
Alex V Orlov on Mon, 18 Apr 2016 21:01:35
I've been playing around with the WLM API, trying to recognize meaningful two- or three-word combinations (as in, "white elephant" is a thing in its own, while "red elephant" is not). I find that in a lot of cases the frequency data provided by WLM API is not very useful. I think that's because the corpus treats each specific page as one long sentence, ignoring the page structure. Here's an example: the most likely word to follow "christianity", according to the "generateNextWords" method of the API, is the word "arqade". I'm pretty sure that that's because the navigation menu that appears at the bottom of all Stack Exchange sites (example: http://russian.stackexchange.com/questions/12295/what-is-the-russian-for-a-coffee-sleeve) has links to all other Stack Exchange sites, and in that list Christianity is followed immediately by Arqade. And, Stack Exchange is such an enormous body of text in terms of the number of pages, that it seems to completely skew the probabilities of word sequences.
I wonder if I'm missing something, maybe a different way to query the APIs would correct the skew? I've been using "body" as "model" and "2" as "order" for word pairs.
Ryan Galgon - MSFT on Mon, 25 Apr 2016 17:33:36
Alex, we’ve confirmed that the service gives “arqade” as the most likely word to follow “christianity" using the body model. And your theory about the possible cause does seem likely. We will try to fix this.
Most customers use the title model instead. It gives less noisy results than the body model, probably because web page titles are easier to parse correctly than the body text. We verified that “arqade” does not appear in the next word lists using the title, query, and anchor models.
Please let us know if you find other problems with the service.