entity recognition – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/ http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/#respond Wed, 16 Dec 2015 16:21:32 +0000 http://www.flax.co.uk/?p=2860 This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and … More

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.

Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?

Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.

We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/feed/ 0
Google Search Appliance version 7 – too little too late? http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/ http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/#respond Wed, 10 Oct 2012 12:26:35 +0000 http://www.flax.co.uk/blog/?p=862 Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one … More

The post Google Search Appliance version 7 – too little too late? appeared first on Flax.

]]>
Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).

Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.

However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.

It will be interesting to see if this version of the GSA is the last…

The post Google Search Appliance version 7 – too little too late? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/10/10/google-search-appliance-version-7-too-little-too-late/feed/ 0
Outside the search box – when you need more than just a search engine http://www.flax.co.uk/blog/2011/12/06/search-plus-when-you-need-more-than-just-a-search-engine/ http://www.flax.co.uk/blog/2011/12/06/search-plus-when-you-need-more-than-just-a-search-engine/#respond Tue, 06 Dec 2011 14:40:55 +0000 http://www.flax.co.uk/blog/?p=676 Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your … More

The post Outside the search box – when you need more than just a search engine appeared first on Flax.

]]>
Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).

However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ‘search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).

There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.

These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!

The post Outside the search box – when you need more than just a search engine appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2011/12/06/search-plus-when-you-need-more-than-just-a-search-engine/feed/ 0