How we built a search engine for UK MP tweets with Solr, Python & StanfordNLP

Matt Pearce writes:

We recently released UKMP, a search application built on work done on last year’s Enterprise Search hack day. This presents the tweets of UK Members of Parliament with search options including filtering by party, retweet and favourite count, and entities (people, locations and organisations) extracted from the tweet text. This is obviously its first incarnation, so there are still a number of features in development, but I thought I would comment on some of the decisions taken while developing the site.

I started off by deciding which bits of the hack day code would be most useful, from both the Solr set-up side and the web application we were hoping to build. During the hack day, the group had split into a number of smaller teams, with two of them working on a set of data downloaded from Twitter, containing the original set of UK MP tweets. I took the basic Solr setup and indexing code from one group, and the initial web application from the other.

Obviously we couldn’t work with a completely static data set, so I set about putting together a Python script to grab the tweets. This was where I met the first hurdle: I was trying to grab tweets from individual MPs’ feeds, but kept getting blocked by the Twitter API, even though I didn’t think I was over-stepping the limits set on the calls. With 200-plus MPs to track, a different approach would be required to avoid being blocked. Eventually, I took a different approach, and started using the lists compiled by Tweetminster, who track politicians tweets themselves. This worked much better, and I could soon start building a useful data set.

I chose the second group’s web application because it already used the Stanford NLP software to extract entities from the tweet text. The indexer script, also written in Python, calls the web app to extract the entities before indexing the tweets. We spent some time trying to incorporate the Stanford sentiment analysis as well, but found it wasn’t practical – the response time was too slow, and we didn’t have time to train the dataset to provide a more useful analysis of the content (almost all tweets were rated as either “negative” or “neutral”, which didn’t accurately reflect the sentiments in the data).

Since this was an entirely new project, and because it was being done outside the main client workflow, I took the opportunity to try out AngularJS, an MVC-oriented JavaScript front-end framework. This runs on top of, and calls back to, the DropWizard web application, which provides the Model part of the Model-View-Controller system. AngularJS itself provides the Controller, while the Views are all written in fairly standard HTML, with some AngularJS frosting to fill in the content.

AngularJS itself generally made development very easy and fast, and I was pleased by how little JavaScript I had to write to build a working application (there is also a Bootstrap crossover module, providing AngularJS directives to work with the UI layout tools Bootstrap provides). As a small site, there are only two controllers in play: one for each page. AngularJS also makes it very easy to plug in other script modules, such as that used to generate the word cloud on the About page. However, I did come across a few sticking points as I built the app, as one might expect from a first-time user. The principle one was handling the search box at the top of the page, which had to be independent of the view while needing to modify it to display the search results. I am still not sure that I ended up with the best approach – the search form fires an event when submitted, which then percolates up the AngularJS control hierarchy until caught and dealt with: within the search page, the search is handled normally; from other pages, we redirect to the search page and pass in the term. It doesn’t feel as smooth as it should do, which is why I remain unconvinced this is the best solution.

All in all, this was an interesting sideline project, and provided a good excuse to try out some new technology. The code itself, along with some notes on how to get the system up and running, is in our github repository – feel free to try it out, and make suggestions for improvements or better ways to use the code.

Leave a Reply

Your email address will not be published. Required fields are marked *