Jump to content

User:The Anome/Geodata initiative

From Wikipedia, the free encyclopedia

Wikipedia:WikiProject Geographical coordinates is a great success. As of April 2023, about 1.18 million pages now contain linked, machine-readable geographical coordinates, and the geodata is generally of high quality.

The backlog

[edit]

However, there is still a backlog of

As I see it, there are three top-level tasks that need to be performed to help improve this situation:

  • reducing the backlog of articles eligible, but missing, geodata by finding new data
  • validation of existing geodata
  • de-tagging articles for which it will never be possible to assign geodata

And one more task which would be of great use going forward;

  • Integration with OpenStreetMap and Wikidata

Reducing the backlog

[edit]

There are a number of sources that can be used to fill in the backlog on enwiki.

Discovery of more license-compliant free data sources

[edit]

At the moment, I'm using two public domain data sources created by the U.S. federal government; the USGS's Domestic Names dataset and the National Geospatial-Intelligence Agency's Geographic Name Server data for the rest of the world.

There are other sources out there which are not license-compliant, particularly that from OpenStreetMap, and these can definitely not be used.

However, it would be interesting to see if there have been any more recent additions to the world of open geodata which might be useful for geocoding on Wikipedia.

Candidates:

GNS multimatches

[edit]

The Anomebot draes upon the public domain GeoNet Name Server data, and uses an algorithm that matches unique data points (for features of a given type with the same name in a single country) to single Wikipedia articles that describe features of the same type, with the same name, in the same country. Combined with a large number of rejection heuristics, this seems to provide entity resolution that is well in excess of 99% percent accurate.

There are still a considerable number of "multimatches" in which more than one GNS data point can be found to match a Wikipedia article. If these multi-matches could be resolved, I estimate that about 20,000 more articles could be geocoded. Doing this is not a simple exercise.

In some cases, multimatches may simply be multiple points for the same entity, which could then be merged to generate a single point. In other cases, different points might actually represent different entities. There is no easy way to program this; this is a complex entity-resolution task.

Pulling from Wikidata

[edit]

Pulling data from Wikidata, or from other Wikipedia languages seems like another obvious way to go. But before this can be done, effort needs to be taken to avoid pulling in bad data. In particular, this includes preventing pulling in data from bot-generated sources that might be of dubious provenance, either because of poor entity resolution (eg LSJbot?) or wholesale import of tainted data from external sources. Also, in some cases, manually maintained data sourced from low-traffic Wikipedias may not have sufficient exposure to peer review to prevent erroneous data.

Language resolution

[edit]

There are lots of places whose local names are in Arabic or other script where multiple transliterations make entity resolution difficult. Finding native-language resources for data on these places would be very useful. Algorithms exist for matching Arabic and Latin names, but they are complex, and I'm not sure they're reliable enough.

Cross-wiki collaboration

[edit]

At the moment, the major formalized geodata tagging efforts have been confined to enwiki. It would be useful to have something like the {{coord missing}} mechanism available on other language Wikipedia editions.

Outreach to WikiProjects

[edit]

There are large pools of articles without geotags that are within the remit of particular WikiProjects; for example, Mexico articles missing geocoordinate data will be within the remit of Wikipedia:WikiProject Mexico. It might be useful to reach out (possibly in an automated way?) to those Wikiprojects by providing them with information about eligible pages that might need their help. However, doing in a machine-generated way will need community approval well before any kind of talk page spamming effort is considered.

Outreach to individual editors

[edit]

Another idea would be to reach out to recent editors of articles (in particular to article creators) to let them know that an article they've edited could do with having coordinates added, and given information on how to do that. The need for community approval would also be crucial here; I would suggest that editors be alerted either at most once a year, or simply just once and never alerted sgain.

Outreach to Wikidata

[edit]

It would be interesting to consider adding a "geocodable place" class to Wikidata, or a property which could be used to mark individual articles as geocodable but not yet coded.

This might be possible to do by adding subclass relationships to things like "structure", "administrative region", "populated place", "natural feature" etc.

Validation of existing data

[edit]

There are numerous sources of geodata that are not eligible for inclusion in Wikipedia because of data licensing problems. It should, however, be permissible to use them to validate Wikipedia's exisiting data, so that errors can be flagged for manual resolution.

This could also allow the detection of any previous automated imports of data with licensing problems (for example by matching to excessive precision), which could then potentially be replaced or deleted.

Splitting out of fields within coord templates

[edit]

Instead of:

{{coord|51|30|26|N|0|7|39|W|region:GB-ENG_type:city|display=inline,title}}

have

{{coord|51|30|26|N|0|7|39|W|region=GB-ENG|type=city|display=inline,title}}

which seems semantically nicer to me, and makes it easier to add new geodata fields like "elevation", or "scale".

Q: Would this actually be useful to anyone? Will it break existing tools? Is it worth the huge effort of bot-editing hundreds of thousands of pages?

Integration with OpenStreetMap and Wikidata

[edit]

OSM is the other pole of the geodata world; free for use, but license-incompatible with Wikipedia. Linking Wikipedia to OSM and vice versa is really the next phase for the geodata project; ideally every single geotag for a mappable entity should have an OSM ID attached.

While various efforts have already been made in this direction, it would be great if this could be drawn in to the existing geodata project which has been so successful to date.

This can be achieved both from the Wikipedia side and from outreach to the OpenStreetMap community.

One idea: put Wikidata IDs in geotags. While most articles with geotags have exactly one tag, locating the object in the page, and corresponding directly to the Wikidata item for that page, that's not necessarily true in the other direction. Some pages have many geotags; for example, pages containing lists of geographical features or listed buildings. In many cases, each of these features will have its own OSM ID and Wikidata ID that are distinct from the listing page itself.

So London's tag would now look like

{{coord|51|30|26|N|0|7|39|W|region=GB-ENG|type=city|OSM =65606|display=inline,title}}

Question: would this serve a useful purpose? who would be the target data consumer for this?

Big idea

[edit]
One geotag, one Wikidata entity, one OSM entity

At the moment, only Wikipedia pages can link to Wikidata entries. But often there are buildings that are notable for Wikidata, for exsmple by being on a historic sites registry, but only notable enough to have an entry in a table or list within a Wikipedia article, and no way to link the two.

This could be done by adding a QID field to geotags in these pages. Note that pages which have only one geotag would not need QIDs in the tags, as that would be redundant.

So for example, the entry for "Number 7" in Grade II* listed buildings in Coventry would have a corresponding Wikidata page, and a link to the QID in the coord tag.

Question: would this serve a useful purpose? who would be the target data consumer for this?