tl;dr: I implemented a knowledge extraction algorithm able to extract hierarchies of concepts from a reasonably high volume of texts. The presented work is mostly a mashup but this article details how it can be applied to extract ontology or synthesize an high volume of textual contents.
The result is visible here: Metarticle.
A few years ago, while in engineering school, Olivier Favre presented a very interesting work where he built a content hierarchy starting from a content graph using the Louvain algorithm for fast community unfolding (clustering within a graph). The most interesting aspects of this method is that it does not just output communities, but a hierarchy of communities.
I have always considered that this approach could be useful to uncover a hierarchy of concepts, thus some simple form ontologies.
Trying to get fit to start a new venture, KonnectR, I bookmarked a CBS article listing more than 70 startup post mortem analysis. I may like reading online content, 70+ is way to much.
I thought then that adapting the idea including some sentiment analysis could be something interesting. My expected output would be cluster of concepts and each concept labeled with a sentiment (positive, negative or neutral).
With my friend and now former colleague Dan, we spent one Saturday to build a working backend for this idea: content scrapping, context extraction, concept extraction, sentiment analysis, co-occurence graph and community extraction.
I thought the GitHub Data Challenge was a good opportunity to present this work. The most popular repositories for a given query are providing a good and related corpus to work with. The speed at which technologies are growing make this kind of work relevant. It took me a couple extra days of work to: scrap data from GitHub, find a decent way of presenting the results and do some fine tuning.
The scrapping is the most trivial part of the process. The GitHub API does not
require any API key for those simple operations (and I am grateful for
that). One of this API service lets you perform searches on their repositories then the
text retrieval is relying on the assumption that some
README.md (or a similar file) does exist and is in markdown.
A bit of cleanup is required (mostly to remove code in those READMEs) but this step is rather straightforward.
Possible improvement: some code is still not correctly removed and causes some variable names to appear in the output. License text should be also excluded.
Context extraction is about extracting meaningful chunks from the global text. We mostly extracted sentences, the easiest, most common way, to break a text into small pieces.
Possible improvement: working at paragraph level could provide different results though it may deeply impact the density of the co-occurence graph and thus most of the process output.
In simple words, concept extraction means extracting meaningful vector words from a sentence. The basic process is removing stop words from a sentence and that's what was implemented and returning each remaining word as a single concept.
Before this step we build a list of possibly relevant words, a word is considered as relevant if it appears in a number of text big enough to open doors to the analysis and small enough to be discriminating.
Possible improvement: "They said that this was a great idea." this sentence presented as is does not explain much, who are "they"? what is "this"? Anaphore resolution could really help to rebuild the semantic meaning that is lost through relative pronouns.
Possible improvement: stop words removal is great but does not really account for concepts that overlap more than one word like "data mining", "user experience", ... A phase handling concept discovery could bring a nice value added to the overall result.
Sentiment analysis was irrelevant for GitHub Data Challenge, most README files are not about expressing opinions and are purely descriptive. I believed that if that was relevant for my article example, it was not for this particular repurposing.
Sentiment analysis is about labelling text with a sentiment, it answers questions like "does this tweet about our brand is positive?". The biggest problem here may be to find a learning base to learn these concept. We used one related with movie rating.
Once the model is trained each concept receive an aggregate rating according to the sentiment linked with the contexts where it appears.
Possible improvement: Movie ratings has a given vocabulary, business has a completely different one. I think it does bias results a lot, words like "scene", "character" and "scenario" can have a complete different meaning depending on the context.
Possible improvement: Before display, sentiments could be recomputed to be analyzed in the context in which they are shown. This may be easier by storing sentiment along graph edges.
If two concepts appear in the same contexts, then we build an edge between them in our context graph. Easy ain't it?
The next step community extraction is the most consuming one and relies on the topology of the graph. To reduce complexity of the computations we dropped concepts that were appearing not enough times (not significative enough) and the ones that were appearing too often (we are looking for discriminant concepts).
Community extraction can be done following agglomerative or divisive approaches to find communities on top of graphs. The target is to build node clusters (communities) in a way that minimizes a metric correlated with the number of edges that connects the different communities (and possibly the intensity of the edges inside a community).
Community extraction is done here using the Louvain algorithm. An agglomerative approach that presents the interesting property to extract hierarchy of communities, merging communities of a step of the algorithm to build the ones of the next step.
Possible improvement: for some configuration and some case only one level was provided as an output of the algorithm. Instead of tweaking for each example, finding an heuristic for adjusting parameters would be something that could bring more automation in the process.
I reused Mike Bostock example of D3.js Zoomable Treemap to add a display layer on top of everything.
Possible improvement: right now all cells are all of the same color, this color could be used to represent the sentiment computed and also how much it is disputed (if the concept is positive but not by much, the saturation would be less important).
What It Uncovers
Remember a presentation of the visual result is here: Metarticle.
The exposed results are grouping related term together, a lot of noise are here but key concepts can be highlighted. For example, under "list/reddis", terms related with high performance and scalability are grouped like "worker", "job", "queue" and "delay". Similarily under "list/async" most term are linked with asynchronous processing. That kind of approach could be very useful for new topic discoveries, helping end users to connect dots.
Some fine tuning remains to be done and suggest those results can be improved.
The biggest hierarchies seems to be the garbage ones, the clustering algorithm might be focusing too much on grouping elements, I would suggest to tweak it to discard some low degree concepts in order to improve the overall results.
On performance matters, this program is very slow but could be easily optimised, it's a quick mashup but huge improvements could be achieved be parallelizing everything and cleaning a lot of quick fixes that have been done.
I have discovered that this practice is called topic modeling.