geolocation – Flax

Solr geolocation searches using WKT – latitude or longitude first?

Charlie Hull — Fri, 12 Sep 2014 10:15:56 +0000

Matt Pearce writes:

We have been working with a client who needs to search for documents based on location, either using a single point or (sometimes very) complex polygons. They supplied the location data in WKT format which we assumed we could feed directly into our search engine (in this case Solr) without any modifications being necessary.

Then we started testing the location searches using parameters in lat, long format. These were translated into a Solr filter query such as:

{!geofilt sfield=location pt=53.45,-0.25 d=20}

which produced no results, even though we knew there were documents well within the bounds of the search range. Reversing the coordinates did produce results though, and that seemed like a quick solution, so we assumed there was a problem in Solr that needed to be flagged.

This seemed like a problem that other Solr users would have come across, so I checked in JIRA, but nobody had raised it as an issue. That was a red flag to me, so I took a look at the code, and discovered that in the situation above, the first number is taken to be the y-coordinate, while the second is the x-coordinate. Very strange. I still didn’t want to raise a new issue, since it was looking increasingly like a problem with either our data or the request.

It turns out that in WKT format, the longitude coordinate comes first. We could safely reverse the coordinates in our search string because all our locations were in the UK, but this wouldn’t work for points in the US, for example, where longitudes go beyond -90. The coordinate order is mentioned in the GeoJSON specification, and on the Elasticsearch Geo Shape Type page, although I initially found it in some helper pages for SQL Server 2008! Unfortunately, it is not mentioned in the Solr documentation, as far as I can see, nor the Wikipedia entry for WKT.

In short, if you are representing geographical location data in WKT (and storing it in Solr or Elasticsearch), longitude comes first!

The post Solr geolocation searches using WKT – latitude or longitude first? appeared first on Flax.

Cambridge Search Meetup: free postcodes from old maps & visualising fish

Charlie Hull — Thu, 18 Jul 2013 08:44:54 +0000

Last night was a particularly hot Cambridge Search Meetup: someone suggested that next time we lash four punts together and float down the river – would certainly be a little cooler though I’m still not sure how to rig a projector!

Our first speaker was Nick Burch who told us about a fascinating past project of his to source freely available postcode data for the UK. His team collected out-of-copyright maps, scanned them and created a website to crowd-source knowledge of the postcode of individual features (say a childhood home or a church). In 6-9 months they had a database of the first four characters of all UK postcodes (e.g CB1 1xx) and their locations – good enough for many location based services to take advantage of their free feeds. Shortly afterwards the UK’s Ordnance Survey released their own data for free – partly as a result of projects like Nick’s and pressure from the burgeoning Open Data movement. Nick suggested that the best way to approach projects such as this is to look for data similar to what you require, find a way to interest people on the Internet in it, provide an API for corrections and feedback and to release all your data using a permissive license.

Next up was Craig Mills, who provided some background on his past projects monitoring cod stocks (apparently it’s good to offer fisherman a £500 bounty for returning your tagging hardware!) and more recently on tools for monitoring and visualising ecology. He mentioned the open source Sphinx search engine and visualisation tool CartoDB as two key technologies, and talked about how a clickable map interface is often preferred to the traditional search box. An interesting technique was to crowd source photos from around the world and use an algorithm to spot the relative amount of ‘nature’ and ‘man-made’ textures in them – a potentially powerful way to measure how humans are changing the planet.

We finished as ever with beers, snacks and chat in the thankfully cooler downstairs bar. Thanks to both our speakers and all who came – next week we have a fantastic opportunity to join Grant Ingersoll on a free Apache Lucene/Solr hack day – do let us know if you’re coming as space is limited.

The post Cambridge Search Meetup: free postcodes from old maps & visualising fish appeared first on Flax.

ECIR 2011 Industry day – part 2 of 2

Charlie Hull — Thu, 28 Apr 2011 12:14:39 +0000

Here’s the second writeup.

We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.

Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.

Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.

I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).

Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.

The post ECIR 2011 Industry day – part 2 of 2 appeared first on Flax.