cplint on SWISH is a web application for probabilistic logic programming  About  Help  LIFTCOVER-Help  PHIL-Help  PASCAL-Help  Credits  Dismiss

Latest: Threads and Python in LIFTCOVER, course, PHIL examples, book

  
    ? users online
  • Logout
    • Open hangout
    • Open chat for current file
<div class="notebook">

<div class="nb-cell html">
<script type="text/javascript">
$.getScript("https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js"+
             "?config=TeX-MML-AM_CHTML",function() {MathJax.Hub.Queue(["Typeset",MathJax.Hub]);});
</script>
<script type="text/javascript">
$.getScript("https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML",function() {MathJax.Hub.Queue(["Typeset",MathJax.Hub]);});
</script>
<h2>Latent Dirichlet Allocation</h2>
<p>From <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation</a>:</p>
<blockquote cite="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" style="font-size: inherit;"><p>
In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model...
</p>
<p>The generative process is as follows.  Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for a corpus \(D\) consisting of \(M\) documents each of length \(N_i\):
</p><ol>
<li> Choose \( \theta_i \, \sim \, \mathrm{Dir}(\alpha) \), where \( i \in \{ 1,\dots,M \} \) and
\( \mathrm{Dir}(\alpha) \) is the <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution" target="_blank">Dirichlet distribution</a> for parameter \(\alpha\)
</li>
<li>
Choose \( \varphi_k \, \sim \, \mathrm{Dir}(\beta) \), where \( k \in \{ 1,\dots,K \} \)
</li>
<li>
For each of the word positions \(i, j\), where \( j \in \{ 1,\dots,N_i \} \), and  \( i \in \{ 1,\dots,M \} \)
<ol>
<li> Choose a topic \(z_{i,j} \,\sim\, \mathrm{Categorical}(\theta_i). \)
</li>
<li>Choose a word \(w_{i,j} \,\sim\, \mathrm{Categorical}( \varphi_{z_{i,j}}) \).
</li>
</ol>
</li>
</ol>
<!-- The lengths \(N_i\) are treated as independent of all the other data generating variables (\(w\) and \(z\)).
-->
</blockquote>
<p>This is a smoothed LDA model to be precise. The subscript is often dropped, as in the plate diagrams shown here.
</p>
<center>
<a href="https://en.wikipedia.org/wiki/File:Smoothed_LDA.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png" width="300"></a>
</center>
<p>The aim is to  compute the word probabilities of each topic, the topic of
each word, and the particular topic mixture of each document. This can be done
with Bayesian inference: the documents in the dataset represent the
observations (evidence) and we want to compute the posterior distribution of
the above quantities.
</p>
<p>Suppose we have two topics, indicated with integers 1 and 2, and 10 words, indicated with integers \(1,\ldots,10\).
</p>
<p> Predicate <code>word(Doc,Position,Word)</code> indicates that document
<code>Doc</code> in position <code>Position</code> (from 1 to the number of
words of
the document) has word <code>Word</code></p>
<p> Predicate <code>topic(Doc,Position,Topic)</code> indicates that document
<code>Doc</code> associates topic <code>Topic</code> to the word in position <code>Position</code></p>
<p>We can use the model generatively and sample values for word in position 1 of document 1:</p>
</div>

<div class="nb-cell query">
mc_sample_arg(word(1,1,W),100,W,G),argbar(G,C).
</div>

<div class="nb-cell html">
<p>Also generatively, we can sample values for couples (word,topic) in position 1 of document 1:</p>
</div>

<div class="nb-cell query">
mc_sample_arg((word(1,1,W),topic(1,1,T)),100,(W,T),G),argbar(G,C).
</div>

<div class="nb-cell html">
<p>We can use the model to classify the words in topics: assigning words to
topics. Here we use conditional inference with Metropolis-Hastings.
</p>
<p>
A priori both topics are about equally probable for word 1 of document 1.
</p>
</div>

<div class="nb-cell query">
mc_sample_arg(topic(1,1,T),100,T,G),argbar(G,C).
</div>

<div class="nb-cell html">
<p>After observing that words 1 and 2 of document 1 are equal, one of the
topics gets more probable.
</p>
</div>

<div class="nb-cell query">
mc_mh_sample_arg(topic(1,1,T),(word(1,1,1),word(1,2,1)),100,T,G,[lag(2)]),argbar(G,C).
</div>

<div class="nb-cell html">
<p>You can also see this if you look at the distribution of the probability
of topic 1 before and after observing that words 1 and 2 of document 1 are equal:
the observation makes the distribution less uniform.
</p>
</div>

<div class="nb-cell query">
prob_topic_1(G).
</div>

<div class="nb-cell html">
<script type="text/javascript">
$.getScript("https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML",function() {MathJax.Hub.Queue(["Typeset",MathJax.Hub]);});
</script>
<h3>Code</h3>
<p>You can change the number of words in vocabulary and the number
of topics by modifying the facts for <code>n_words/1</code> and
<code>n_topics/1</code>.
</p>
<p>The distributions for both \(\theta_m\) and \(\varphi_k\) are
symmetric Dirichlet distributions with scalar concentration parameter \(\eta\)
set using a fact for the predicate <code>eta/1</code>. In other words
\(\alpha=\beta=[\eta,\ldots,\eta]\).
</p>
</div>

<div class="nb-cell program" data-background="true">
:- use_module(library(mcintyre)).

:- if(current_predicate(use_rendering/1)).
:- use_rendering(c3).
:- endif.

:- mc.

:- begin_lpad.

theta(_,Theta):dirichlet(Theta,Alpha):-
  alpha(Alpha).

topic(DocumentID,_,Topic):discrete(Topic,Dist):-
  theta(DocumentID,Theta),
  topic_list(Topics),
  maplist(pair,Topics,Theta,Dist).

word(DocumentID,WordID,Word):discrete(Word,Dist):-
  topic(DocumentID,WordID,Topic),
  beta_par(Topic,Beta),
  word_list(Words),
  maplist(pair,Words,Beta,Dist).

beta_par(_,Beta):dirichlet(Beta,Parameters):-
  n_words(N),
  eta(Eta),
  findall(Eta,between(1,N,_),Parameters).

alpha(Alpha):-
  eta(Eta),
  n_topics(N),
  findall(Eta,between(1,N,_),Alpha).

eta(2).

pair(V,P,V:P).

topic_list(L):-
  n_topics(N),
  numlist(1,N,L).

word_list(L):-
  n_words(N),
  numlist(1,N,L).


n_topics(2).

n_words(10).

:-end_lpad.

prob_topic_1(G):-
  mc_sample_arg(theta(1,[T0|_]),400,T0,L0),
  mc_mh_sample_arg(theta(1,[T1|_]),(word(1,1,1),word(1,2,1)),400,
    T1,L1,[lag(2)]),
  densities(L0,L1,G,[nbins(30)]).
</div>

<div class="nb-cell html">
<h3>References</h3>
<p>[1]
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.
</p>
<p>
[2]
<a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation</a>
</p>
</div>

</div>