Jump to content

User:The Anome/Wikidata geodata import

From Wikipedia, the free encyclopedia

Things to be imported:

  • geodata from Wikidata for the ~20,000 articles that have been identified as eligible for geocoordinates but currently do not have them: see Category:Articles missing coordinates with coordinates on Wikidata
  • only if the Wikidata article is a descendent of http://www.wikidata.org/wiki/Q123349660 -- "Geolocatable item", the top of the hierarchy of geolocatable objects on Wikidata (see d:User:The_Anome for why this isn't useful)
  • only if sourced on Wikidata from a Wikipedia edition, see below
  • only from Wikipedias with > 1,000,000 articles, so that we know there's an active editing community to provide oversight
  • not from any of the Swedish, Waray or Cebuano Wikipedias, because they are likely to have been auto-generated by Lsjbot

Wikipedia editions with > 1,000,000 articles, excepting English and those mentioned above:

  • Polish: pl
  • Arabic: az
  • Deutsch: de
  • Español: es
  • Français: fr
  • Italiano: it
  • Egyptian Arabic: arz
  • Nederlands: nl
  • Japanese: jp
  • Português: pt
  • Sinugboanong Binisaya (Cebuano)
  • Svenska
  • Ukrainian: uk
  • Tiếng Việt: vi
  • Winaray
  • Chinese: zh
  • Russian: ru

Totalling 14 eligible Wikipedia editions.

Petscan query for articles with coords on Wikidata but not on Wikipedia: here

Problem: at least some WD items have P625 coordinates, but no reference for that coordinate. The only thing practical to do here is to inspect the pages on every single candidate wiki. To do it live could mean doing a substantial fraction of 20,000 x 14 = 280,000 page fetches, which is mad. Go to the dumps? These are tiny, compared to the usual page and categorylinks tables, so downloading 14 of these seems to be quite practical.

Format of the geo_tag SQL files is here: https://www.mediawiki.org/wiki/Extension:GeoData/geo_tags_table -- but also no cross-ref to QID or page name, only to page_id, interpreting which requires dowloading the entire page.sql file for each wiki. At 14 lots of ~2 Gbytes, this feels like a bit too much data to download; but not as mad as doing > 200k page loads.