Jump to content

Wikipedia:Bots/Noticeboard/Attribution bot proposal

From Wikipedia, the free encyclopedia

This is a proposal for an Attribution bot or other automatic or semi-automatic procedure to accompany the discussion at Wikipedia:Bots/Noticeboard#Copy attribution bot question or proposal. Its goal is to remedy missing copy or translation attribution in numerous articles, by adding the attribution to the edit summary after the fact, working off a list of article names and other metadata.

Context and background

[edit]

Editors are welcome to copy or translate material from other Wikipedias (or wikis with compatible licenses) as long as they comply with our licensing requirements which requires credit be given in the history to the authors of the original content.[a] The suggested wording to use in the edit summary is given in the editing guideline WP:Copying within Wikipedia, at WP:TFOLWP. When an editor is unaware of the requirement, or forgets or is unwilling to add it, the required attribution can still be added after the fact, as described at WP:CWW#Repairing insufficient attribution.

Standard text for the edit summary when a user individually repairs missing attribution looks like this:

For copied material:

NOTE: The previous edit as of 22:31, October 14, 2015, copied content from the Wikipedia page at [[Exact name of page copied from]]; see its history for attribution.

For translated material:

NOTE: Content in the edit of 01:25, January 25, 2023 was translated from the existing French Wikipedia article at [[:fr:Exact name of French article]]; see its history for attribution.

The automated procedure uses a modified version of this text, linking the NOTE and adding the username of the user who made the edit[b] and substituting in data line and runtime parameters for parts of the attribution statement as needed. The modified translation example might look like this:

[[WP:RIA|NOTE]]: Content in the edit of 01:25, January 25, 2023 was translated by [[Special:Contributions/User:Example1|Example1]] from the existing French Wikipedia article at [[:fr:Exact name of French article]]; see its history for attribution. (added by [[WP:AttriBot|AttriBot)]])

Task

[edit]

The bot's task would be to add the proper retroactive attribution wording to the article history, fulfilling the licensing requirement. Input to the bot would be a page containing a list of articles, where each article would be accompanied by parameters identifying the source of the copy or translation, and the timestamp of the edit which added unattributed content from a Wikimedia sister project. An additional optional parameter would be available for customizing the edit summary of the attribution edit. A few run-time parameters would be available to avoid the need for constant repetition in the input file. The output would be a dummy edit to each article, along with an edit summary using the wording given at WP:RIA, substituting in the correct wording per the parameters. An output log itemizing the activity would be optional, as would a dry run that outputs attribution statements to a log without changing any articles.

Let's provisionally call it 'AttriBot', because it seems like a natural choice.

Input file

[edit]

The input file consists of a required section header, followed by multiple data lines, each one describing one edit requiring repair of missing copy or translate attribution. Users may add comment lines that are ignored by the procedure.

Section header

[edit]

Each input file should have a level two (H2) section header identifying the username of the user, and the type of attribution required, which may be either copy or translate. The two fields are delimited by a semicolon separator character. Examples:

== Jimbo; translate ==
== Example1; copy ==

For users without special permissions, there should be exactly one Section header on the page, as shown, and all of the edits identified in the § Data lines to follow should all have been performed by the given user, and they must all be edits involving text either copied or translated from another Wikipedia or sister project, according to the second token in the header. In addition, the Username token should match the {{ROOTPAGENAME}} of the submitted file (that is, a user should submit files only from their own userspace).[c]

For admins, there may be any number of Section headers on the page, and they may specify different users and/or different edit types (i.e., copy or attribution). In addition, there is no restriction on the username token matching {{ROOTPAGENAME}}; admins may, at their discretion, submit one input file from their own userspace with three section headers identifying different users, or three input files from the user subspace of three different users, each with one section header.

Data lines

[edit]

Each data line contains data identifying one article at Wikipedia requiring retroactive attribution, the foreign Wikipedia source of the copy or translate operation, and the timestamp of the edit. Each data line is in SSV (semicolon-separated variable) format (see § Choice of delimiter), containing three required fields and one optional field:

* [[ArticleTitle]]; SourceTitle; Timestamp; Comment

where:

  • * – leading asterisk (or colon) in column one to stop wrapping when user views their file (optional blank(s) after it)
  • [[ArticleTitle]] – title of the page at en-wiki containing unattributed text copied or translated from a foreign Wikipedia or compatible wiki (required; linked; not a redirect)
  • SourceTitle – title of the source page; may contain a prefix with optional leading colon, a WP code, and another colon; e.g. :de:Schutzstaffel. (required; unlinked; not a redirect)
  • Timestamp – a string representing a timestamp as shown in the revision history (required) e.g., 02:45, 8 February 2021
  • Comment – a user-given string to be appended to the bot-generated summary for this line.

Each data line must represent a single edit that lacks required attribution. All data lines must correspond to a single user, whose username is given in the § Section header above the data lines (but see § Runtime params).

Comment lines

[edit]

Within the input file, there is no formal definition of a "Comment line", as there is in some programming languages. By appropriate use of inclusion control tags, the input file may contain lines that are effectively comments. Surround any material that is not part of an input file data line with paired <noinclude>...</noinclude> tags.

An example:

== Jimbo; translate ==
<noinclude> My German translations:</noinclude>
[[Article1]]; :de:German Article1; timestamp1
[[Article2]]; :de:German Article2; timestamp2
[[Article3]]; :de:German Article3; timestamp3
<noinclude> My French translations:</noinclude>
[[Article A]]; :fr:French articleA; timestampA
[[Article B]]; :fr:French articleB; timestampB

Lines within paired noinclude tags are skipped by the bot, and therefore do not appear in § debug or log output; they are strictly for the convenience of the creator of the input file.

Note: Blank lines are seen by the bot and output to the log.

Special considerations

[edit]

right-to-left scripts

[edit]

Pay attention when including a SourceTitle (param 2) that is from a language with a right-to-left script such as Hebrew or Arabic. These may benefit from a trailing left-to-right mark character (Html entity &lrm;) to prevent the next parameter in the input line from being garbled in page view mode. It is not required for proper functioning of the attribution repair procedure, just for human readers.

Location

[edit]

Normally, the input file should be a subpage in your user space. See § User requests and administration for details.

Runtime params

[edit]
  • user – specifies the username for the edit summary, e.g. 'earlier edits by USER were copied/translated...'; if missing, taken from the username part of the input filename, if found in Userspace, otherwise error: missing user. A run cannot proceed without a single identified user.
  • type – one of copy or translate; overridden by param type in the input line
  • log – when = y, copies lines from intput file to speciifed log, and appends the RIA edit summary line; optional; default=y; set to n to turn off logging.
  • debug - when = y, just produces the log, but doesn't edit any files

Output format

[edit]

Generates a RIA-style attribution summary of the form:

[[WP:RIA|NOTE]]: Content in the edit of $TIMESTAMP was $TYPEd by [[Special:Contributions/USER|USER]] from the existing LANGUAGE Wikipedia article at [[$SOURCETITLE]]; see its history for attribution. $USERCOMMENT

Caps in the model RIA edit summary above show substitutable items ('$' indicates a field in the data line):

  • TIMESTAMP is the value from the Input file data line parameter 3. If missing, in the edit of $TIMESTAMP changes to in previous edit(s).
  • LANGUAGE is derived from the WP code in the data line SourceTitle (param 2); omitted if code = en.
  • TYPEd is either copied or translated, and comes either from the runtime parameter type, or the data line parameter Type (param 4).
  • USER is either from the {{ROOTPAGENAME}} of the input file, or from the runtime parameter user.
  • SOURCETITLE is from the Input file data line parameter 2, SourceTitle.
  • COMMENT is an optional free-form user comment, from data line parameter Comment (param 5)

Note: LANGUAGE is derived from the prefix in the SourceTitle (if any), where prefix is a WP code as in the table at List of Wikipedias#Wikipedia editions

The generated edit summary is added to the article whose title is given in data line param 1, ArticleTitle.

Note: If this becomes an approved bot, a linked bot id should be appended to the end of the generated edit summary.

Logging

[edit]

Each data line is echoed to the log, followed by the edit summary line, indented and preceded by a increasing integer count value, starting at 1 for the first data line. Logging is enabled by default, but may be disabled via a runtime parameter.

The log file is written to subpage /log of the input file given by the user. But at operator discretion, the log file may be created locally to the run location, with the /log page as a redirect to it, or otherwise.

Alternatively, subpage: /log/RUNTIMETAMP, if desired.

Debugging

[edit]

If § debug mode is requested or enabled via a runtime parameter, the log is generated, but no articles are modified.

Issues

[edit]

Choice of delimiter

[edit]

The input file should be an SSV file (semicolon-separated variable). CSV (comma-separated variable) format is standard, but comma is a very common title character (especially in place names), but semicolon is not, so it is a better field separator than comma; only approximately 72 article titles have semicolons in the title (and most are redirects) so very unlikely to collide with article names needing attribution.

Examples

[edit]

A run with this sample input file:

== Example_user1; translate ==
* Liberation of France; fr:Élections constituantes françaises de 1945;  00:28, 4 February 2021
* Liberation of France; fr:Assemblée consultative provisoire; 02:45, 8 February 2021
* Liberation of France; Battle of Gabon;  02:50, 3 February 2021; From rev 1003043161.
/* pt */
* Caixa 2; pt:Caixa dois; 22:09, 26 November 2019
* Brazilian criminal justice; :pt:Prisão preventiva; 01:43, 22 April 2024; 
* Brazilian criminal justice; :pt:Direito penal
* Brazilian criminal justice; :pt:Justiça Militar do Brasil; 4:54, 15 April 2024
* Brazilian criminal justice; :pt:Código de Processo Penal brasileiro; 08:55, 8 July 2023;
/* de */
* Anti-gender movement; :de:Anti-Gender-Bewegung; 04:14, 29 August 2021; ; From rev. [[:de:Special:Permalink/214798358#Deutschland|214798358]]
* Weimar Republic; :de:Weimarer_Republik#Frühe Krisenjahre (1919–1923);  01:40, 21 December 2020;
* War guilt question; de:Kriegschuldfrage; 19:10, 25 February 2021;  From rev. 207898075
* War guilt question; Color book; 21:13, 25 February 2021; 

would generate the following output log:

1. Liberation of France; fr:Élections constituantes françaises de 1945;   00:28, 4 February 2021 
: [[WP:RIA|NOTE]]: Content in the previous edit of 00:28, 4 February 2021 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the French Wikipedia article [[fr:Élections constituantes françaises de 1945]]; see that article's history for attribution. 
2. Liberation of France; fr:Assemblée consultative provisoire;  02:45, 8 February 2021
: [[WP:RIA|NOTE]]: Content in the previous edit of 02:45, 8 February 2021 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the French Wikipedia article [[fr:Assemblée consultative provisoire]]; see that article's history for attribution. 
3. Liberation of France; Battle of Gabon;  02:50, 3 February 2021; From rev 1003043161.
: [[WP:RIA|NOTE]]: Content in the previous edit of 02:50, 3 February 2021 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Wikipedia article [[Battle of Gabon]]; see that article's history for attribution.  From rev 1003043161.
/* pt */
4. Caixa 2; pt:Caixa dois;  22:09, 26 November 2019
: [[WP:RIA|NOTE]]: Content in the previous edit of 22:09, 26 November 2019 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Portuguese Wikipedia article [[pt:Caixa dois]]; see that article's history for attribution. 
5. Brazilian criminal justice; :pt:Prisão preventiva;  01:43, 22 April 2024;  
: [[WP:RIA|NOTE]]: Content in the previous edit of  01:43, 22 April 2024 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Portuguese Wikipedia article [[:pt:Prisão preventiva]]; see that article's history for attribution. 
6. Brazilian criminal justice; :pt:Direito penal
: Content in previous edit(s) by [[Special:Contributions/Example_user1|Example_user1]] were translated from the Portuguese Wikipedia article [[:pt:Direito penal]]; see that article's history for attribution. 
7. Brazilian criminal justice; :pt:Justiça Militar do Brasil;  4:54, 15 April 2024
: [[WP:RIA|NOTE]]: Content in the previous edit of 4:54, 15 April 2024 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Portuguese Wikipedia article [[:pt:Justiça Militar do Brasil]]; see that article's history for attribution. 
8. Brazilian criminal justice; :pt:Código de Processo Penal brasileiro;  08:55, 8 July 2023;  
: [[WP:RIA|NOTE]]: Content in the previous edit of  08:55, 8 July 2023 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Portuguese Wikipedia article [[:pt:Código de Processo Penal brasileiro]]; see that article's history for attribution. 
/* de */
9. Anti-gender movement; :de:Anti-Gender-Bewegung; 04:14, 29 August 2021; ; From rev. [[:de:Special:Permalink/214798358#Deutschland|214798358]]
: [[WP:RIA|NOTE]]: Content in the previous edit of 04:14, 29 August 2021 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the German Wikipedia article [[:de:Anti-Gender-Bewegung]]; see that article's history for attribution.  From rev. [[:de:Special:Permalink/214798358#Deutschland|214798358]]
10. Weimar Republic; :de:Weimarer_Republik#Frühe Krisenjahre (1919–1923); 01:40, 21 December 2020;   
: [[WP:RIA|NOTE]]: Content in the previous edit of 01:40, 21 December 2020 by [[Special:Contributions/Example_user1|Example_user1]] was translated from the German Wikipedia article [[:de:Weimarer_Republik#Frühe Krisenjahre (1919–1923)]]; see that article's history for attribution. 
11. War guilt question; de:Kriegschuldfrage; 19:10, 25 February 2021;  From rev. 207898075
: [[WP:RIA|NOTE]]: Content in the previous edit of 19:10, 25 February 2021  by [[Special:Contributions/Example_user1|Example_user1]] was translated from the German Wikipedia article [[de:Kriegschuldfrage]]; see that article's history for attribution.  From rev. 207898075
12. War guilt question; Color book; 21:13, 25 February 2021; 
: [[WP:RIA|NOTE]]: Content in the previous edit of 21:13, 25 February 2021  by [[Special:Contributions/Example_user1|Example_user1]] was translated from the Wikipedia article [[Color book]]; see that article's history for attribution. 

Example notes:

  • 1 Simplest case: local article; foreign article; timestamp. Also #2, 4, 5, 7 et al.
  • 3: no foreign prefix in source file, so: "...from the Wikipedia article" (not: "the English Wikipedia article...")
  • 4: the comment line above it is just echoed to log.
  • 6: No timestamp: "previous edit of hh:mm, dd Month yyyy by USER was..." ⟶ "previous edit(s) by USER were"
  • 9: two semicolons shows an empty 'type' field in arg4, so output is still the default "translated from" language. The arg5 value is a trailing comment to be echoed to the log.
  • 12; same as #3.

Requests

[edit]

Single user

[edit]

Users wishing to have a list of their articles adjusted for missing copy or translate attribution, should make up a list of their articles in their own userspace. Suggestion: use a WP:User subpage, like Special:Mypage/Attribution set 1 (or ...2, etc.).

All data lines in the file correspond to edits by a single user, as specified in the § Section header.[c]

Admins and multiple users

[edit]

Non-admins must create one file for each user. If you are an admin requesting a run for multiple users, note that they may be placed in a single input file using multiple § Section headers to identify each user. An equivalent operation would be to have multiple input files with one section header per user, each file describing edits by a single user.[d] If choosing the multiple files option, the files may be created as a user subpage of each user in turn, or they may all be in admin user space, at your discretion. For example, in the latter case as:[e]

  • User:Admin_user/Attribution requests
    • /Example_user_1
    • /Example_user_2
    • /Example_user_2/Set_2
    • /Three users/Set_1

and so on. When submitting files pertaining to other users, whether in user subspace or in your own user space, please ensure that the § Section header specifies the proper user in each case before submitting your request.

Where to file

[edit]

Requests for attribution runs should be made at User talk:DreamRimmer, which should include the filename(s) of your input file(s). You can request a § debug run, meaning you will get back a list of all of the edit summaries that would be applied, but no files will actually be changed. If your input file is in User:Myuser/Attrib set one then your log may be copied to (or redirected from) User:Myuser/Attrib set one/log,[f] or elsewhere upon request, or the discretion of the operator.

Operations

[edit]

Due diligence

[edit]

This procedure adds content to the edit summary that is stored in the revision history. Care must be taken that the correct summary is generated and written, as erroneous summaries added to the history cannot be undone and leave a permanent record in the history. An incorrect attribution added to history would have to be followed up by another one to leave a second edit summary in the history, negating or correcting the first one.[g] All users planning to submit an § Input file to generate automated attributions are strongly encouraged to perform § Dry runs first, and to carefully examine the returned log file before resubmitting it for a live run. First time users are encouraged to do small runs at first, until familiar with the process.

Semi-automated operation

[edit]

For the time being, runs are semi-automated and performed by an individual. The procedures and suggestions below are likely to change as the process matures.

Dry runs

[edit]

Users may submit an § Input file requesting a dry run (also known as debug mode) instead of a live run. In debug mode, the user submits the same § Input file and gets back a link to a log file consisting of a line-numbered copy of the input file,[h] where each echoed and enumerated data line is followed by the edit summary that would have been generated, had this been a live run. This may be used for vetting the Input file before submitting it.

Error handling

[edit]

Invalid input line

All lines are echoed to the log, unless logging is turned off. Lines in the input file must conform to the Input file format. Invalid input lines are echoed, but article processing is skipped, and an indented error message (e.g., Invalid; skipped, or similar is added to the log after the echoed input.

Considerations

[edit]

Do a large number of articles all appear to be about the same topic, or is there some reason to believe that a run might hit many articles on individual watchlists? (How do other bots deal with question?)

Throttling

[edit]

Note that exponential backoff is mentioned and linked from the § Best practices section of Help:Creating a bot.

Notes

[edit]
  1. ^ Providing attribution is not optional, and is per Wikimedia Terms of use.
  2. ^ If and when approved as a bot, the edit summary should also include an appended bot id with a link back to the bot page.
  3. ^ a b Still t.b.d.: whether user A is allowed to submit a list of unattributed translation edits performed by user B, or only regarding their own edits; probably the former.
  4. ^ "Non-admins must create one file for each user"—not yet decided whether non-admin users may file for other editors or not.
  5. ^ Suggested (but not required) subpage hierarchy for submitting multiple files about different users.
  6. ^ Or maybe better: User:Myuser/Attrib set one/log-YYYY-MM-DD-hh-mm-ss?
  7. ^ The scope of responsibilities of the requesting user, the bot/procedure operator, and the bot/procedure design and implementation should be delineated.
  8. ^ In the returned log file, only the § Data lines are echoed and enumerated. The § Section header and § Comment lines and empty lines are not included in the log.