[gensim:11723] summarizer.py uses an undirected graph instead of a directed one

Discussion:

Nicolas Tastevin

2018-10-29 15:31:26 UTC

The summarizer.py script implements TextRank algorithm to extract most
important sentences from a document.

The core concept of the algorithm relies on PageRank algorithm. The
sumarizer.py script creates an adjacency graph to fill in input of the
PageRank algorithm. The script uses a BM25 score to set weights on edges of
the graph. This score is not symmetric [ BM25(a,b) can be different from
BM25(b,a) ]. Thus the logic of the code is coded in order to preserve this
dissymmetry.

However, the non symmetric BM25 scores matrix generates a symmetric
adjacency matrix after applying the _set_graph_edge_weights function and we
loose the dissymmetry of BM25. This is caused by the use of the
gensim.summarization.graph module. Indeed as documented in the module: *this
module contains abstract class IGraph represents graphs interface and class
Graph (based on IGraph) which implements **undirected** graph*

As a consequence in the function def _set_graph_edge_weights(graph) (from
summarizer.py), the following lines of codes do not generated 2 edges but
only one because of the undirected property of the graph module.

edge_1 = (sentence_1, sentence_2)

edge_2 = (sentence_2, sentence_1)

if not graph.has_edge(edge_1):

graph.add_edge(edge_1, weights[i][j])

if not graph.has_edge(edge_2):

graph.add_edge(edge_2, weights[j][i])

So the code does not implement what it expects to do.

PS: I have already sent an e-mail to warn about another mistake in the
summarization module concerning the implementation of BM25.py

You can find it at this link:

https://groups.google.com/forum/#!topic/gensim/dBEX182kXbg

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

2018-10-29 17:57:05 UTC

Permalink

Hi Nicolas,

thanks for the reports! It does sound like a bug to me, from your
description.

I informed the authors of that module, let's see what they think about this
functionality.

By the way, how did you discover those issues? What motivated your
investigations and reports?

Cheers,
Radim

Post by Nicolas Tastevin
The summarizer.py script implements TextRank algorithm to extract most
important sentences from a document.
The core concept of the algorithm relies on PageRank algorithm. The
sumarizer.py script creates an adjacency graph to fill in input of the
PageRank algorithm. The script uses a BM25 score to set weights on edges of
the graph. This score is not symmetric [ BM25(a,b) can be different from
BM25(b,a) ]. Thus the logic of the code is coded in order to preserve this
dissymmetry.
However, the non symmetric BM25 scores matrix generates a symmetric
adjacency matrix after applying the _set_graph_edge_weights function and
we loose the dissymmetry of BM25. This is caused by the use of the
gensim.summarization.graph module. Indeed as documented in the module: *this
module contains abstract class IGraph represents graphs interface and class
Graph (based on IGraph) which implements **undirected** graph*
As a consequence in the function def _set_graph_edge_weights(graph) (from
summarizer.py), the following lines of codes do not generated 2 edges but
only one because of the undirected property of the graph module.
edge_1 = (sentence_1, sentence_2)
edge_2 = (sentence_2, sentence_1)
graph.add_edge(edge_1, weights[i][j])
graph.add_edge(edge_2, weights[j][i])
So the code does not implement what it expects to do.
PS: I have already sent an e-mail to warn about another mistake in the
summarization module concerning the implementation of BM25.py
https://groups.google.com/forum/#!topic/gensim/dBEX182kXbg

Nicolas Tastevin

2018-10-30 09:13:42 UTC

Permalink

Hi Radim,

Thank you for replying.

I am working on a research subject around Text Summarization. This led me
to the TextRank algorithm as a benchmark. I wanted to test the Gensim
implementation. However on a huge document, using the Graph module to build
adjacency matrix seems to be computationally and memory expensive. Thus I
tweaked the code to make the preparation of the adjacency matrix rely on
numpy matrix only in order to run vectorized computation.

By comparing my new outputs with the one of gensim, results appeared to be
different. After some debug and break points, i noticed than the input
adjacency matrix for pagerank was always symmetric for Gensim (it should
not) while the output of the BM25 score (weight matrix) is in most cases a
non symmetric matrix. This conducts me to have a glance at
summarization.graph module because I suspected this module to build
undirected Graph.

I hope authors of the module will confirm that.

Please let me know what they think about it.

Best regards,

Nicolas

Ps: As a reminder, I pointed another little bug in a previous post:
https://groups.google.com/forum/#!topic/gensim/dBEX182kXbg

Post by Radim ÅehÅ¯Åek
Hi Nicolas,
thanks for the reports! It does sound like a bug to me, from your
description.
I informed the authors of that module, let's see what they think about
this functionality.
By the way, how did you discover those issues? What motivated your
investigations and reports?
Cheers,
Radim