Jump to content

Wikipedia:Wikipedia Signpost/2024-05-16/Op-Ed

From Wikipedia, the free encyclopedia
File:Wikidata 6th Birthday cake Wikimedia Norge.jpg
Alicia Fagerving
CC BY-SA 3.0

Wikidata to split as sheer volume of information overloads infrastructure

The Wikimedia Foundation will soon split parts of the WikiCite dataset off from the main Wikidata dataset. Both data collections will be available through the Wikidata Query Service: although in queries, by default users will get content from the main graph, and can afterwards take extra effort to request WikiCite content. This is the start of query federation for Wikidata content, and is a consequence of Wikidata having so much content that the servers hosting resources of the Wikidata Query Service are under strain.

I support this as a WikiCite editor, because WikiCite is consuming considerable resources, and the split preserves the content by reducing its accessibility. This split could also be the start of dedicated support for Wikimedia citation data products.

I am wary of the split, because it only gives about three more years to look for another solution, and we have already been seeking one since 2018. The complete scholarly citation corpus of ~300 million citations is not a large dataset by contemporary standards, but our Blazegraph backend strains to include 40 million right now. Even after a split, Wikidata will fill with content again. Fear of the split has been slowing and deterring Wikidata content creation for years, and we do not have long-term plans for splitting and federating Wikibase instances repeatedly.

The split will create a WikiCite graph separate from the main Wikidata graph. The main Wikidata graph will retain content of broader interest, including items for authors, journals, publishers, and anything with a page in a Wikimedia project.

This challenge does not have an obvious solution. I have tried to identify experts who could describe the barriers at d:wikidata:WikiProject Limits of Wikidata, but have not been able to do so. I asked if Wikidata could usefully expand its capacity with US$10 million development, and got uncertainty in return. I have no request of the Wikimedia community members who read this, except to remain aware of how technical development decisions determine the content we can host, the partnerships we can make, and the editors we can attract. The Wikimedia Foundation team managing the split have documentation which invites comment. Visit, and ponder the extent to which it is possible to discuss the scope of Wikidata.

That is the summary, and all that casual readers may wish to know. For more details, read on!

I am writing this article as an opinion or personal statement. I am unable to present this as fact-checked investigative journalism, and am presenting this from my own perspective as a long-term WikiCite contributor who has incorporated this project into many sponsored projects in my role as Wikimedian in residence at the School of Data Science at the University of Virginia. I have a professional stake in this content, and wish to be able to anticipate its future level of stability.

Why split WikiCite from Wikidata?

About 1/3 of Wikidata items are WikiCite content

Wikipedia is a prose encyclopedia established in 2001. Over the years, the Wikipedia community deconstructed Wikipedia's parts into Wikimedia sister projects, one of which was Wikidata, established in 2012. Wikidata is designed such that, to the casual observer, it seems to accomplish magic to solve multiple billion-dollar challenges facing humanity. Soon after its establishment, though, its infrastructure hit technical limitations. Those limits prevent current (and especially prospective) projects from importing more content into Wikidata.

The only large Wikidata project to continue limited development was WikiCite, which is an effort to index all scholarly publications. WikiCite grew, and is currently a third of the content on Wikidata. Users access WikiCite content through multiple tools; the tool I watch is Scholia, a scholarly profiling service serving this content 100,000 times a day. The point of Wikimedia projects is to be popular and present content that people want, and there is agreement that WikiCite is a worthwhile project. While Wikidata is overstuffed with content, use of the Wikidata Query Service strains Wikidata's computational resources and causes downtime. Reducing the costs and strain with a WikiCite split is one solution to manage them.

Scholia gets 100,000 views a day. Many outages since 2022 are of the toolforge:Toolviews tool, which is another unrelated outage issue.

The problem is that Wikidata is facing an existential crisis, due to reaching many of the challenges reported in WikiProject Limits of Wikidata. Users must be able to access Wikidata content through database queries, and the amount of content in Wikidata is large enough that more queries are failing, and more frequently. The short term solution which is happening right now is the first Wikidata graph split, which will result in the separation of the WikiCite dataset from the main Wikidata graph. This is not a long term solution, because Wikidata will fill up with data again. If users had their way, Wikidata would expand in resource use to index all public information on people, maps, the Sum of All Paintings, video games, climate, sports, civics, species, and every other concept which is familiar to Wikipedia editors and which could be further described with small – meaning not big data – general reference datasets.

In 1873 optical engineer Ernst Abbe discovered a formula describing the optical limit of microscopes. Limits can be surpassed when understood. Describe Wikidata's limits at d:Wikidata:WikiProject Limits of Wikidata

Here is a timeline of the discussions:

Here are some questions to explore. Ideally, answers to these questions could be comprehensible to Wikimedia community members, technology journalists, and computer scientists. I do not believe that published attempts at answering these questions for those audiences exist.

  1. If we could predict Wikidata's future capacity, then editors could strategically plan to acquire data at the rate of growth. Will Wikidata's capacity in 3 years be more or the same as current capacity?
  2. WikiCite hit many upload limits in 2018. In the 6 years since, we have not identified a solution. What could we have done differently to develop appropriate discussion at the time the problem was identified?
  3. Suppose that the Wikimedia community developed a successful product – like WikiCite and Scholia – which also came with expenses. How can the Wikimedia community assess the value of such things and determine what support is appropriate?

Scholarly profiling

Google Scholar is a popular scholarly profiling service, but it is also non-free, which is why Wikipedia cannot post screenshots of its outputs.

Scholarly profiling is the process of summarizing scholarly metadata from publications, researchers, institutions, research resources including software and datasets, and grants to give the user enough information to gain useful insights and to tell accurate stories about a topic. For example, a scholarly profile of a researcher would identify the topics they research, their social network of co-authors, history of institutional relationships, and the tools they use to do their research. Such data could be rearranged to make any of these elements the subject of a profile, so for example, a profile of a university would identify its researchers and what they study; a profile of software would identify who uses it and for what work; and a profile of a funder would tell what impact their investments make.

The easiest way to understand scholarly profiling is to use and experience popular scholarly profiling services.

Google Scholar is the most popular service and is a free Google product. It presents a search engine results page based on topics and authors. Scopus is the Elsevier product and Web of Science is the Clarivate product. Many universities in Western countries pay for subscriptions to these, with typical subscription costs being US$100,000-200,000 per year.

A search for "influenza" in Internet Archive Scholar suggests papers from 1971, 2007, and 1890. Semantic Scholar claims copyright of the website, and OpenAlex is ambiguous with licensing.

Free and nonprofit comparable products include Semantic Scholar developed by the Allen Institute for AI, OpenAlex developed by OurResearch, and the scrappy Internet Archive Scholar developed by Wikimedia friend Internet Archive.

Other tools with scholarly profiling features include ResearchGate, which is a commercial scientific social networking platform, and ORCID, which compiles bibliographies of researchers.

OpenAlex, Semantic Scholar and Internet Archive Scholar designate the data as openly licensed and allow export, but all of these have ambiguous open licensing terms for elements of their platforms. Google Scholar, Scopus, and Web of Science slurp data that they find and encourage crowdsourced upload of data, but their terms of use do not allow others to export it as open data. It has been a recurring thought that the WikiCite and Scholia could meet institutional needs at a fraction of the Scopus and Web of Science subscription costs. ORCID also encourages data upload and entire universities do this, but only for living people, and the data is only public with consent of the individual profiled.

Statements such as the Barcelona Declaration on Open Research Information seek to gather a collaboration which could manifest an ideal profiling platform, which would be open data, exportable, allow crowdsourced curation, encourage public community discussion of the many social and ethical issues which arise from presenting a platform like this, and of course be sustainable as a tool which used computing resources. Scholia is these things, except for hitting technical limits.

WikiCite and Scholia

WikiCite is a collection of scholarly metadata in Wikidata, the WikiProject to curate that data, and the name of the Wikimedia community who engage in that curation. Scholia is a tool on Toolforge which generates scholarly profiles by combining WikiCite and general Wikidata content into a reader-friendly format. Scholia is preloaded with about 400 Wikidata queries, so instead of any new user needing to learn queries, they can use the Scholia interface to run queries to answer common questions in academic literature research.

Anyone can use WikiCite content for applications on or off wiki. Histropedia emphasizes timelines, and uses WikiCite content to visualize research development over time.

WikiCite is the single most popular project in Wikidata in terms of amount of content, number of participants, depth of engagement of participants, count of institutional collaborations, and donation of in-kind labor from paid staff subject matter experts contributing to the project. In terms of content, WikiCite is about 40 million of Wikidata's 110 million items. Because it is openly licensed, many other applications ingest this content, including the other scholarly profiling services but also free and open services such as Histropedia. Four WikiCite conferences have each convened 100 participants. WikiCite presentations have been a part of many other Wikimedia conferences for some years. The largest WikiCite project in terms of participants was the WikiProject Program for Cooperative Cataloging, which recruited staff at about 50 schools to make substantial WikiCite contributions about their own research output. In the context of the Wikimedia Foundation investing in outreach, there are projects like this which are outside of that investment, but which attract investors, new editors, and institutional partnerships.

The promise of WikiCite is to collect research metadata, confirm its openness, then enrich it with further metadata including topic tagging and deconstruction of the source material to note use of research resources, such as software, datasets, protocols, or anything else which could be reusable. Scholia presents all this content. Example Scholia applications are shown here, with links to the queries and pages which present such results.

What next?

The Wikidata Query Service is failing more often. 99% of the time it works, but 1% failure of a tool central to accessing Wikidata is an emergency to address immediately. To ensure continued access to Wikidata content, the Foundation has responded with a recently refined plan announced here incorporating everyone's best ideas for what to do.

It is challenging to coordinate the Wikipedia and Wikimedia community. The above mentioned Barcelona Declaration asks organizations to commit to "make openness the default", "enable open research information", "support the sustainability of infrastructures for open research information", and "support collective action to accelerate the transition to openness of research information", which are all aims of WikiCite, Scholia, and Wikimedia more broadly, but in my view Wikimedia projects have been too independent to join such public consortia. If we could reach community consensus to join such programs, then I think experts in that group could advise us on technical needs, and funders would consider sponsoring our proposals to develop our technical infrastructure. If the Wikimedia Movement had money, then based on my incomplete understanding of the limit problems, I recommend investing in Wikidata now so that we can better recruit expert partnerships and contributors. Since we lack money, the best idea that I have is to find the world's best experts considering comparable problems, and explore options for collaboration with them. I wish that "Wikipedia" could sign the Barcelona Declaration or a similar effort, and get more outside support.