One of the techniques that researchers in AI use to make inferences from input such as text or speech is the Hidden Markov Model (HMM) – “hidden” because the state of the process is not directly visible, so state transitions need to be inferred from how they influence the observed variables. HMMs are basically simple Bayesian dynamic networks. Markov chains are formed by parsing input such as text into paired sets, then making predictions about the likely sets that will follow, based on probability analysis. Amit Gruber and his colleagues have taken this idea and applied it to the idea of topic searches within a search engine.
The algorithm is called Hidden Topic Markov Models (HTMM) and the approach is to model the “topics of words in the document as a Markov chain”, according to Gruber. He explains that the algorithm assumes that all words in the same sentence have the same topic, and that successive sentences are more likely to have the same topics. This approach produces “better perplexity in modeling documents with only a modest increase in learning and inference complexity.” Amit Gruber is currently engaged in a research internship with Google Research, where he has produced a new implementation of the algorithm in C++.
See an explanation of the HTMM approach and a video of Amit’s seminar here.