Wikipedia:Wikipedia Signpost/2017-02-06/Technology report
Better PDFs, backup plans, and birthday wishes
Better PDFs are coming
A new way to export pages to PDF files has been developed. The current method of creating PDFs uses the Offline content generator (OCG) service. However, it can be quite problematic for many articles, as tables–including infoboxes–are completely omitted.
There have been multiple requests for table support since the OCG was introduced in 2014. The issue was also raised in 2015 as part of that year's Community Wishlist Survey and German community technical wishlist. Since then, the German Wikimedia chapter (WMDE) has been leading the initiative on enhancing tables in PDF. It was discussed at the 2016 Wikimania Hackathon, where a solution was proposed: offer an alternative PDF download that replicates the look of the website, using browser-based rendering instead of the OCG's LaTeX-based rendering.
-
The current (OCG) PDF rendering
-
Browser-based PDF rendering
The new PDF creator uses the Electron Service to render pages (using the Chromium web browser as a back end). When enabled on a wiki that already uses the OGC service, clicking "Download as PDF" on the side menu will display a choice of which service to use. The Electron Service was enabled by default on Meta and German Wikipedia last week, and is planned to be deployed to more wikis later.
A community consultation is open on MediaWiki.org regarding the future of PDF rendering. It is proposed to retire the OGC by August this year, once "core" OGC features are available with the Electron service. One such feature is the book creator, which collates multiple articles into a single PDF via the Collection extension. However, there are no plans to provide a two-column option, nor any plans to support conversion to plain-text or other file formats. E
Backing up Wikimedia
Concerns were raised earlier this week on the wikimedia-l mailing list about the "back-up plan" for Wikimedia.
The most well-known backups are the data dumps of MediaWiki content. Operations Engineer Ariel Glenn, who focuses on the dumps, doesn't consider them to be a form of backup though: the dumps only contain public data that is viewable by all, and just run twice a month.
Glenn further explained that the dumps are currently stored on two servers in the Virginia datacenter, and the most recent ones are also on a third server. They are also mirrored by other organizations, placing copies in California, Illinois, Sweden, and Brazil.
Glenn noted that there are no dumps of images currently. Operations Engineer Filippo Giunchedi said, "We're looking at 120 terabytes of original [files] today." Giunchedi added that files are stored in both the Virginia datacenter and one of the Texas datacenters, so there is some redundancy.
The databases themselves have a high level of redundancy according to Database Administrator Jaime Crespo. The servers themselves use RAID10, and there are about 20 active database replicas across the Virginia and Texas datacenters with the same content that can be cloned if one server goes down. For cases of accidental data loss, there is one server that has a delayed replica by 24 hours in each datacenter.
As far as actual backups, Wikimedia uses bacula as its backup software.
"As far as content goes, we do perform weekly database dumps and store them in an encrypted format in order to provide a pretty good guarantee we will avoid data leak issues via the backups," Operations Engineer Alexandros Kosiaris said. "We've had no such issues yet, but better safe than sorry."
The backups are stored in the Virginia and Texas datacenters, and are deleted after about 45–50 days for privacy policy compliance, Kosiaris explained.
As for improvements, Glenn has been looking for new mirrors for the dumps. Crespo noted that work on selecting a location for a new Asia datacenter is in progress, including discussions with legal. L
Ten years of Twinkling
The popular Twinkle tool (available as a gadget in Special:Preferences) celebrated its tenth birthday on January 21. Originally started as the rollback script "Twinklefluff" by AzaToth, it now automates or simplifies a plethora of common maintenance tasks, including responding to vandalism, tagging articles, welcoming new users, and admin duties. It is likely that over the past decade, millions of edits have been made using Twinkle. Thank you to everyone who has made Twinkle possible, your efforts are very much appreciated! E
In brief
New user scripts to customise your Wikipedia experience
- Megawatch[1] (source) by User:NKohli (WMF) – Watch or unwatch all pages in a category (for large categories, restricted to the top 50 pages only).
- Watchlist-openUnread[2] (source) by User:Evad37 – open multiple unread watchlist pages with a single button. Various options can be set, see documentation.
- References Consolidator[3] (source) by User:Cumbril – converts all references in an article to list-defined format.
Newly approved bot tasks
- MusikBot (task 10) – Move BLPs created by Sander.v.Ginkel to the draftspace.
- Ramaksoud2000Bot (task 2) – Tag Wikipedia files that shadow a Commons file with {{ShadowsCommons}}.
- Dexbot (task 11) – Clean up incorrect section names.
- PrimeBOT (task 9) – Replace template being deleted with a "See also" link.
- BU RoBOT (task 31) – Replace hyphens with endashes within the relevant years parameters in {{Infobox football biography}} as per MOS:DASH.
- TheMagikBOT (task 2) – Add the {{pp}} template to protected pages that do not have them.
- JJMC89 bot (task 8) – Replace
{{Don't edit this line {{{machine code|}}}|{{{1}}}
with{{Don't edit this line {{{machine code|}}}
in taxonomy templates. - DatBot (task 5) – Replace deprecated WikiProject Chinese-language entertainment template.
- EnterpriseyBot (task 10) – Comments out the class parameter in WikiProject banners on the talk pages of redirects.
- JJMC89 bot II (approval) – Deploy {{Wikipedia information pages talk page editnotice}} as an editnotice for talk pages of Wikipedia and Help pages in Category:Wikipedia information pages.
- Bender the Bot (task 7) – Replace
http://
withhttp://
for the New York Times domain.
Latest tech news from the Wikimedia technical community: 2017 #3, #4 & #5. Please tell other users about these changes. Not all changes will affect you. Translations are available on Meta.
- Recent changes
- You can now upload WebP files to Commons. (Phabricator task T27397)
- There is a new magic word called
{{PAGELANGUAGE}}
. It returns the language of the page you are at. This can be used on wikis with more than one language to make it easier for translators. (Phabricator task T59603) - When an admin blocks a user or deletes or protects a page they give a reason why. They can now get suggestions when they write. The suggestions are based on the messages in the dropdown menu. (Phabricator task T34950)
- You are now able to use
<chem>
to write chemical formulas. Before you could use<ce>
.<ce>
should be replaced by<chem>
. (Phabricator task T153606) - You now can add exceptions for categories which shouldn't be shown on Special:UncategorizedCategories. The list is at MediaWiki:Uncategorized-categories-exceptionlist. (Phabricator task T126117)
- The "Columns" and "Rows" settings have been removed from the Editing tab in Preferences. If you wish to keep what the "Rows" setting did you can add this code to your personal CSS:
#wpTextbox1 { height: 50em; }
You can change the number50
to make it look like you want to. (Phabricator task T26430) - Sometimes edits in MediaWiki by mistake are shown coming from private IP addresses such as 127.0.0.1. Edits and other contributions logged to these IP addresses will be blocked and shown the reason from MediaWiki:Softblockrangesreason. This should not affect most users. Bots and other tools running on Wikimedia Labs, including Tool Labs will receive a "blocked" error if they try to edit without being logged in. (Phabricator task T154698)
- When you edit with the visual editor categories will be on the top of the page options menu. (Phabricator task T74399)
- You can see a list of the templates on a page you edit with the visual editor. (Phabricator task T149009)
- The OAuth management interfaces now look slightly different. (Phabricator task T96154)
- Problems
- video2commons was down for two weeks. This was because of a problem with Commons video transcoders. It is now back up. (Phabricator task T153488)
- Future changes
- The Community Tech team will develop more tools to handle harassment of Wikimedia editors. The goal is to give the communities better tools to find, report and evaluate harassment. They will also work on more effective blocking tools. (Meta page, Wikimedia email list)
- The Wikimedia technical community is doing a Developer Wishlist survey. The call for wishes closed on 31 January, but discussions are continuing, and the voting phase will be open between 6 and 14 February.
Installation code
- ^ Copy the following code, click here, then paste:
importScript( 'User:NKohli (WMF)/megawatch.js' ); // Backlink: User:NKohli (WMF)/megawatch.js
- ^ Copy the following code, click here, then paste:
importScript( 'User:Evad37/Watchlist-openUnread.js' ); // Backlink: User:Evad37/Watchlist-openUnread.js
- ^ Copy the following code, click here, then paste:
importScript( 'User:Cumbril/RefConsolidate_start.js' ); // Backlink: User:Cumbril/RefConsolidate_start.js
Discuss this story
A couple of clarifications, it was probably my fault not to express them clearly when I was asked. There are about 20 English Wikipedia core mediawiki replicas (the number is not fixed, newer servers are continuously being added/upgraded and others decomissioned). There are around 130 core db server in total for all projects serving wiki traffic, to maximize high availability and performance, and its topology can be seen at: http://dbtree.wikimedia.org/ Some auxiliary (non-core) servers are hidden for clarity. Should a meteorite hit the west cost of US, we could have all wiki projects running on the secondary datacenter in 30 minutes (?)- we are trying to get faster and better there. http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Technical_Operations
Also, there is 2 (not 1) db servers delayed 24 hours, one per main datacenter, one just happens to be temporarily (for a few weeks) under maintenance and it is up but not "delayed" after hardware renewal (redundancy helps, not only a recovery method, but also for easier maintenance and less user impact).
In general, backups is something that one never stops working on- there is always room for faster backups, faster recovery, more backups, better verification, more redundancy, etc.
--jynus (talk) 14:55, 6 February 2017 (UTC)[reply]
PDF rendering
I'm really excited to learn that better PDF rendering is on the way -- this will be enormously helpful to many projects. I'm curious, will the rendering respect little customizations, e.g. whether one has chosen to show or hide the Table of Contents or collapsed text, or the sort order chosen in sortable tables? Also, is there any related progress on ODT or ePub output? -Pete Forsyth (talk) 21:41, 6 February 2017 (UTC)[reply]
- It will render exactly the same as a printout of page would look like, if you would be an anonymous user (basically, it works just like "Print to PDF on any modern OS's print dialog). There is no progress on ODT or ePub output (as a matter of fact, it could be argued that we will be further from such a solution, by choosing for maintainable simplicity over unmaintainable complexity). —TheDJ (talk • contribs) 12:55, 7 February 2017 (UTC)[reply]
- TheDJ, currently, "print to PDF" does respect whether or not the TOC is expanded. But if the browser engine doing the rendering is on the server side, will that still be the case? -Pete Forsyth (talk) 19:10, 7 February 2017 (UTC)[reply]
- No, the rendering is serverside, so it has no idea about the context that your browser keeps. —TheDJ (talk • contribs) 20:07, 7 February 2017 (UTC)[reply]
- OK. As I suspected...and unfortunate, but difficult to change, I'd imagine. Thanks for the clarification! -Pete Forsyth (talk) 20:08, 7 February 2017 (UTC)[reply]
- Glad to see the ability to have proper tables in PDFs is now likely. Quite a lot of my work on recent years has been tables and lists, and its been as real pain not to be able to render them. It means WP readers can't access them easily for study off web. I have had to place the texts in my word processor and format them there for my private use. Apwoolrich (talk) 11:16, 8 February 2017 (UTC)[reply]
I too am looking forward to see the glorious PDF function restored to its glory! Can't wait for books to be a thing again! Headbomb {talk / contribs / physics / books} 17:17, 13 February 2017 (UTC)[reply]