Wikipedia:Bots/Requests for approval/CopyvioHelperBot

The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was

Request Expired.

CopyvioHelperBot

Operator: Chris is me

Automatic or Manually Assisted: Entirely manual

Programming Language(s):perl

Function Summary: Finds in-article copyvios and notifies operator.

Edit period(s) (e.g. Continuous, daily, one time run): No actual editing, any edits under the account are done by the operator at the request of the script.

Edit rate requested: 1 edit per minute

Already has a bot flag (Y/N):No

Function Details: The script Googles the first 15 words of each paragrah and lists any matching URLs, excluding a whitelist of mirrors. The operator then checks to see which direction the copyvio is (if it is one at all) and makes approriate changes.

Discussion

Where does it get the list of pages to check for? Does it just iterate through all pages here? Thats a ton of requests. Also, Wikipedia:CP tends to backlogged already. On the other hand, it would be much more server hoggish that my second bot, and our caching and slaves do go a far way, and the data could be useful. Still, it would be nice if the bot only looked at non-stub smaller pages (the ones that aren't so active, which tend to have lifted text) or maybe focused more on patrolling new pages (from the log perhaps?) in real time for incoming copyright vios. Voice-of-All 18:24, 20 December 2006 (UTC)[reply]

Sorry. Each time I run it, it checks one page (found with Special:Random). I then delete the problem sections (or if it's the whole article, then I {{db-copyvio}} it). -- Chris is me 18:34, 20 December 2006 (UTC)[reply]

There are over 1.5M artcles. How many of them contain blatant copyvio that can be detected by your bot? Probably, the bot should check mostly Special:Newpages and "problem" categories such as Wikipedia:WFY Wikipedia:CBM CAT:NOCAT? Or it'll process them quite easily and you'll have a time to check random pages? MaxSem 20:43, 21 December 2006 (UTC)[reply]

You'd be surprised how many in-article copyvios there are. I could make it scan newpages, but (1) I don't know very much perl (I didn't write the thing) and (2) It's not automatic, I just run it when I feel like and if there's a copyvio, I remove it. Wait, I need a bot flag for that? 66.82.9.80 04:08, 22 December 2006 (UTC) _{This post was made by -- Chris is me _{(user/review/talk)} when he was unable to log in}[reply]

I've been running the scanner sopradically and now currently average something like 1 copyvio for 15 articles. This is bad, weel bad. -- Chris is me 04:18, 26 December 2006 (UTC)[reply]

Be useful to have some results to review. Also you may like to consider donwloading a database dump and driving your bot off that, then server issues vanish. Rich Farmbrough, 22:45 28 December 2006 (GMT).

Indeed, either just check newpages or download a dump and iterate (not random) through allpages. Voice-of-All 04:26, 31 December 2006 (UTC)[reply]

I agree with voice of all on what to use this bot for. I actually don't think you need bot approval if you're manually doing and monitoring every edit anyway. If you're not sure how to modify your bot to automatically run in the way people are referring to you could ask for some help on Wikipedia:Bot requests. Vicarious 03:27, 1 January 2007 (UTC)[reply]

What about situations where say, a reporter for a major newspaper copies a wiki article without giving credit? Wouldn't this bot, even user assisted, have a high chance of marking the article in that case? --Measure 21:14, 9 January 2007 (UTC)[reply]

Nevermind. I see my misunderstanding comes from a poor reading of how this bot would be used. --Measure 18:34, 10 January 2007 (UTC)[reply]

A good approach to this would be to collect a list of 25-50 new pages and use the special export command to download them all at once(for newpages). A version could also run on a database dump to find old violations. I like the idea of a continuous version of this that posts reports that people can look through with the suspected text and the url. If you need any help coding you can ask me. HighInBC ^{(Need help? Ask me)} 17:07, 25 January 2007 (UTC)[reply]

A bot checking just newpages for copyvios already exists (and has for months) at Wikipedia:SCV. This is really the best way to do it... trust me, purging copyvios from old articles is a very difficult and time-consuming task... one of the most challenging things I've ever done on Wikipedia was a project to remove maybe 100 deeply embedded copyvios. I can't imagine trying to look at all 1.6 million articles. Anyway. if we can just have a bot monitor newpages at all times, there should be few new copyvios slipping through the cracks. --W.marsh 05:06, 1 February 2007 (UTC)[reply]

Request Expired. —Mets501 (talk) 01:31, 4 February 2007 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.