entity extraction – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/ http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/#respond Wed, 16 Dec 2015 16:21:32 +0000 http://www.flax.co.uk/?p=2860 This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and … More

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.

Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?

Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.

We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/feed/ 0
Cambridge Search Meetup – Knowledge Discovery & Wayfinding http://www.flax.co.uk/blog/2014/07/03/cambridge-search-meetup-knowledge-discovery-wayfinding/ http://www.flax.co.uk/blog/2014/07/03/cambridge-search-meetup-knowledge-discovery-wayfinding/#comments Thu, 03 Jul 2014 11:53:56 +0000 http://www.flax.co.uk/blog/?p=1245 We were lucky enough to have two speakers from Cambridge text mining company Linguamatics at last night’s Meetup. Robin Newton kicked us off with an amusing and idiosyncratic view of the uses and mis-uses of search – starting with the … More

The post Cambridge Search Meetup – Knowledge Discovery & Wayfinding appeared first on Flax.

]]>
We were lucky enough to have two speakers from Cambridge text mining company Linguamatics at last night’s Meetup. Robin Newton kicked us off with an amusing and idiosyncratic view of the uses and mis-uses of search – starting with the problem that when you have text search software, every problem can look like search might solve it. He gave an example of his recent search for a new job: although matching his skills on paper with a potential employer’s needs is one thing, he also wants to be sure the employer ‘isn’t a crook’! With reference to Tyler Tate’s talks on Information Wayfinding, which in turn quotes urban planner Kevin Lynch, Robin told us how he felt that search ‘journeys’ weren’t always the most efficient way to discover an answer: his assertion was that finding a person who could tell you was more useful. Since even in the most efficient and well-run organisation not all information is held in documents one might agree that finding an ‘expert’ is the best way to get the answers one needs. He finished with a welcome message that informal networking in pubs and cafes (much like our Meetup) helps share a lot of very useful information – and this is how he eventually decided that Linguamatics was going to be a great place to work.

Next was CTO and co-founder of Linguamatics, Dr David Milward, who described his company’s capability in text mining, Natural Language Processing (NLP) and search. He described the challenges of extracting ‘concepts’ from text – how words and acronyms with multiple potential meanings are difficult to parse automatically without contextual knowledge. Linguamatics’ approach has been described as ‘Agile NLP’ and allows the quick development of new patterns for concept extraction. A powerful example he gave was how by specifying a relationship between two entities, in this case one company acquiring another, structured data can be extracted from unstructured text. Other examples focused on the medical and bioscience field (a particular interest of ours at present due to the upcoming BioSolr project) and showed how their software can cluster facts and find connections between disparate pieces of data (‘which X relates to Y via Z’). This process can also be used to generate new facets for searching from free text, including for numeric ranges, and these can even be tailored for different user groups. It’s clear that Linguamatics are experts in this area and David’s talk was of great interest to many in the room, including several from the European Bioinformatics Institute.

We finished with the usual chat, networking and drinks. Thanks to both our speakers – and do let me know if you have a suggestion for a presentation at a future event!

The post Cambridge Search Meetup – Knowledge Discovery & Wayfinding appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/07/03/cambridge-search-meetup-knowledge-discovery-wayfinding/feed/ 1
ISKO UK – Taming the News Beast http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/ http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/#respond Wed, 02 Apr 2014 11:55:31 +0000 http://www.flax.co.uk/blog/?p=1180 I spent yesterday afternoon at UCL for ISKO UK‘s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it … More

The post ISKO UK – Taming the News Beast appeared first on Flax.

]]>
I spent yesterday afternoon at UCL for ISKO UK‘s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it up in a nice secure ontology. There were around 90 people attending from news, content, technology and academic organisations, including quite a few young journalism students visiting London from Missouri.

The first talk was by Matt Shearer of BBC News Labs who described how they are working on automatically extracting entities from video/audio content (including verbatim transcripts, contributors using face/voice recognition, objects using audio/image recognition, topics, actions and non-verbal events including clapping). Their prototype ‘Juicer’ extractor currently works with around 680,000 source items and applies 5.7 million tags – which represents around 9 man years for a manual tagger. They are using Stanford NLP and DBpedia heavily, as well as an internal BBC project ‘Mango’ – I hope that some of the software they are developing is eventually open sourced as after all this is a publically-funded broadcaster. His colleague Jeremy Tarling was next and described a News Storyline concept they had been working on a new basis for the BBC News website (which apparently hasn’t changed much in 17 years, and still depends on a lot of manual tagging by journalists). The central concept of a storyline (e.g. ‘US spy scandal’) can form a knowledge graph, linked to events (‘Snowden leaves airport’), videos, ‘explainer’ stories, background items etc. Topics can be used to link storylines together. This was a fascinating idea, well explained and something other news organisations should certainly take note of.

Next was Rob Corrao of LAC Group describing how they had helped ABC News revolutionize their existing video library which contains over 2 million assets. They streamlined the digitization process, moved little-used analogue assets out of expensive physical storage, re-organised teams and shift patterns and created a portal application to ease access to the new ‘video library as a service’. There was a focus on deep reviews of existing behaviour and a pragmatic approach to what did and didn’t need to be digitized. This was a talk more about process and management rather than technology but the numbers were impressive: at the end of the project they were handling twice the volume with half the people.

Ian Roberts from the University of Sheffield then described AnnoMarket, a cloud-based market platform for text analytics, which wraps the rather over-complex open source GATE project in an API with easy scalability. As they have focused on precision over recall, AnnoMarket beats other cloud-based NLP services such as OpenCalais and TextRazor in terms of accuracy, and can process impressive volumes of documents (10 million in a few hours was quoted). They have developed custom pipelines for news, biomedical and Twitter content with the former linked into the Press Association‘s ontology (PA is a partner in AnnoMarket). For those wanting to carry out entity extraction and similar processes on large volumes of content at low cost AnnoMarket certainly looks attractive.

Next was Pete Sowerbutts of PA on the prototype interface he had helped develop for tagging all of PA’s 3000 daily news stories with entity information. I hadn’t known how influential PA is in the UK news sector – apparently 30% of all UK news is a direct copy of a PA feed and they estimate 70% is influenced by PA’s content. The UI showed how entities that have been automatically extracted can be easily confirmed by PA’s staff, allowing for confirmation that the right entity is being used (the example being Chris Evans who could be both a UK MP, a television personality and an American actor). One would assume the extractor produces some kind of confidence measure which begs the question whether every single entity must be manually confirmed – but then again, PA must retain their reputation for high quality.

The event finished with a brief open discussion featuring some of the speakers on an informal panel, followed by networking over drinks and snacks. Thanks to all at ISKO especially Helen Lippell for organising what proved to be a very interesting day.

The post ISKO UK – Taming the News Beast appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/feed/ 0
Search Meetup Cambridge – Challenges of Unstructured Data http://www.flax.co.uk/blog/2012/03/15/search-meetup-cambridge-challenges-of-unstructured-data/ http://www.flax.co.uk/blog/2012/03/15/search-meetup-cambridge-challenges-of-unstructured-data/#respond Thu, 15 Mar 2012 09:52:12 +0000 http://www.flax.co.uk/blog/?p=733 Another Cambridge Search Meetup this week, with two speakers on unstructured data, plus the usual networking, beer and snacks. We started with Dean Yearsley of Pingar talking and bravely attempting a live demo of their API, which amongst other things … More

The post Search Meetup Cambridge – Challenges of Unstructured Data appeared first on Flax.

]]>
Another Cambridge Search Meetup this week, with two speakers on unstructured data, plus the usual networking, beer and snacks. We started with Dean Yearsley of Pingar talking and bravely attempting a live demo of their API, which amongst other things has facilities for entity extraction in multiple languages including English, Chinese and Japanese. The Pingar system is written in .Net and thus unsurprisingly plays well with Sharepoint: Dean demonstrated it automatically providing extra metadata for Sharepoint items, especially useful if a new column has been added to a Sharepoint store, as it would be tedious for operators to have to add data for this column to each item manually.

Jordan Hrycaj of 7Safe, recently acquired by PA Consulting, was up next to talk about what he described as ‘ad-hoc’ search – for use in digital forensics or digital discovery applications. The application he described can be used to search the hard disks of suspect PCs or servers for information such as credit card numbers extremely quickly, working at a low level to avoid leaving any impression on the data (i.e., no file timestamps are altered) and usually working on live systems. This system is command line based, tiny in size and portable across operating systems and is an impressive way to cut down the likely candidates for a data security breach. It was fascinating to hear about a way to search that doesn’t depend on indexing, and the compromises made for performance reasons (i.e., regular expressions can be used but without wildcards).

Thanks to both speakers and to all who came to hear them. We already have some more talks lined up so we expect the next Meetup to be sooner rather than later!

The post Search Meetup Cambridge – Challenges of Unstructured Data appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/03/15/search-meetup-cambridge-challenges-of-unstructured-data/feed/ 0