Like many other Wikis, Bookshelved is being hit by spam. My reaction is partly annoyance, partly "c'est la vie". One reason I'm putting up a page to open discussion on the topic is that, truth be told, I'm slightly more annoyed at the reactions of some Wiki users to spam than to the spam itself: "someone should do something about it" more or less sums it up, followed by various suggestions as to the shoulds available to the someones: password protection, blacklisting, CAPTCHA schemes, etc.
I'd like to take a different tack and recommend that we first characterize the problem before diving into solution space. Is spam a problem ? Whose problem is it ? What makes it a problem ?
I think link spam is a problem in the sense that it's a 'broken window'. The spammer shows disrespect to the site by using it to create 'google juice' rather than to discuss books. Those of us who like to read and contribute to a site about books feel that the link spammer doesn't follow the simple rules.
Until and unless the spam effects limited resources (like bandwidth or diskspace), link spam's only the problem of the irritated editor. I define 'irritated editor' as a person who 'fixes' the wiki, rather than 'contributing' to it. I fully admit to being an irritated editor about 1/5 of the time. (It used to be more when I was at Ward's Wiki, but I've been concentrating on my little sphere over at Wikipedia lately.) Being an irritated editor is fun only until you catch yourself being one. Then you feel like a curmudgeon.
[Warning: Paranoid Diatribe] However, what if a link spammer decides to spread links throughout the wiki? Instead of sticking to the SandBox, what if the spammer trashes the FrontPage! There oughta be a law![End Rant]
See, it's the irritated editor in me. I remember back in 1994 being irate over spam in usenet groups. Then I got fierce when spam started hitting my inbox. I suppose in a year or two I won't care about linkspam either. -- SeanO'Leary
Sean, "irritated editor" is a wonderful framing of the problem - at least one part of it. I wasn't ever enough on Usenet to care, but I remember being concerned with email spam for that reason - it took up precious brain cycles, until my filtering rules started taking care of most of it. Now here's a thought - I may be filtering and classifying more of my incoming email, more rigorously, than I would in the absence of spam.
The disrespect issue is a serious one. Kind of like having total strangers wipe their shoes on your doormat, then be off again; no serious material harm, but whose home does this guy think this is ? -- lb
Disrespect might be the essence of all the irritations on the web. Flamebait is a version of disrespect, so is trolling. Usenet and email spam is classic disrespect as well. Disrespect in broadcast media (radio and television) is handled by changing the channel, but disrespect in an interactive environment like the web allows for people to take action (individually or collectively). The Napster phenomenon can be cast as a respect issue: music lovers felt that high CD prices were disrespectful to the fans. Labels feel that music sharing is disrespectful to their efforts in cultivating talent. iTunes seems to balance the two: easy access to music you want, fair price for each track. Google became popular because they respected their users as much as their advertisers right when the other search technologies were touting 'pay for placement'.
"Maximize respect" (aka the golden rule) seems to be a credible business practice. Who'da thunk it. -- SeanO'Leary.
Respect is all well and good but what exactly do we do when these people see this wiki as a resource to be exploited for pagerank and nothing else? There's nothing else here they need or want so they have no disincentives to prevent them from escalating (there goes that martial metaphor again) their efforts till they're using scripts to replace dozens of pages with advertising. Of course we could just see this as an attempt to change this infinte game into a finite one wherein they post links till the wiki's pagerank drops low enough to be useless or the wiki gets shut down. Ultimately the pressure is upon us to counter-act their moves with more effective ones. Manually reverting changes is only a stopgap and at some point they will start winning. When playing an opponent who seems to have no weaknesses we can either surrender immediately or change the game. So how can we change the game in ways that that will mean they cannot play or can only play with a handicap?--ade
Some technological solutions:
Their escalating may actually have the opposite effect. Since I know Bookshelved is under attack, I check the site more often. I'm certainly reading more, and I think I'm contributing more (if only to this page, haha). I recall a few years ago on Ward's Wiki some folks did some widespread changes (in retrospect they were deleting a bunch of junk). The hue and cry over that kicked some of us (ok, me) in a recent changes junkie of the highest order.
Now, if the spammer(s) decides to use scripts to replace dozens of pages at once, I would support a ban on that IP address. Another possible way is to require a login to edit pages and 'twit' users that spam. But it seems like an awful lot of trouble for a problem of this magnitude. -- SeanO'Leary
Nothing I can recall. Perhaps it's just the time of year. :) --lb
Ah, yes, August in France. Paris is empty and the beaches are full. Meanwhile, here in Chicago, work work work.
Considering the links the spammers are posting, perhaps it's summer in Shanghai as well. -- SeanO
Google's index appears to have removed most of the previous revisions of Sandbox thanks to the change made to the robots.txt file several weeks ago. I wonder if the spammers are smart enough to realize that Google now sees zero pages linking to their sites instead of an increasing number of pages when their links get deleted from the current revision of a Bookshelved page. -- sn
Aha. Interesting. I didn't think the "robots.txt" change would make a difference - I'm sure glad it did ! I'll be doing the same to other Wikis I host. -- lb
Cleaning the current revision of the Sandbox keeps it looking nice, but an odor remains.
The older revisions of this page, including those with spam links, are still being indexed by Google et al via the "View other revisions" links: http://www.google.com/search?as_q=sandbox&num=100&as_sitesearch=bookshelved.org
Is it difficult to have the Sandbox (and probably older revisions of all pages) ignored by search engines (e.g. by using robots.txt and/or adding the page header tag <meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />)?
Couldn't the robots.txt file just include the line:
The robots wouldn't get to the Sandbox, therefore it wouldn't get to the previous versions.
But given the mindset of the people who are spamming the sandbox wouldn't they just spamming the entire wiki instead? The nearest thing to a solution is this idea used in the moinmoin wiki: http://moinmoin.wikiwikiweb.de/RedirectingExternalLinks where all their redirects go through google and then external links don't get a boost to their pagerank. Having said that we really don't want to get into an arms race with these people.--ade
After intense spamming of another site that I steward - http://xp-france.net/ - I broke down and, in spite of my reluctance to tweak code that doesn't have unit tests, implemented a UseMod? patch that aims to counter such attacks. If it works well over there, I'll patch the Bookshelved instance as well. -- lb
This wiki spam is a problem all over. But do we really need to discuss it all over? What happened to Wiki:OnceAndOnlyOnce ? Since spam has very little to do with books, http://CommunityWiki.org/WikiSpam would be a more appropriate wiki to discuss spam. -- DavidCary
Ah, but David this page is not about other wikis. It is about our, currently successful, attempts to resolve our problem with wiki spam. And by the way have you considered creating a homepage and adding your name to TheVisitorsBook?--ade
I added my name to TheVisitorsBook some time ago. I suppose I should create a homepage now. Please forgive me for hoping that a solution for our problem would also solve the "same" problem on other wiki as well. -- DavidCary 2006-09-28.
A possible hint as to what the problem actually is: http://www.edge.org/q2006/q06_5.html#goleman
If there is a MirrorMirror, I don't know where it is. I'm not sure about the email registration business. It strikes me as one more step in the arms race. I have this niggling intuition that if we could just clearly nail down what the problem with spam is, we'd find an effective way to solve it for good. Could be a mere fantasy, of course. -- lb
Well there's 2 sides to the issue. From the side of the spammer--there is no problem. Other people provide giant billboards upon which the spammer can advertise his/her wares at little cost. So the spammer, who is a perfectly fine human being makes the only rational decision possible within the capitalist economic framework and uses these low cost resources. On the other hand those of us who feel a certain emotional attachment to these giant billboards (because we 'incorrectly' see them as something other than under-utilised economic resources) find it offensive that someone wants to creates 100s of pages that serve no useful purpose. I suppose it could be said that those of us who are uncomfortable with the idea of a wiki filled with mostly with porn adverts (and where people like BenHogan find that their 'home' page is now a porn advert) are the ones with the problem.
I suppose we could learn to live with it. We could start keeping local copies of our favourite section of the wiki and using scripts to monitor the wiki but you know something I think that's wrong. I think we know what the problem is and I don't think that it's a lack of emotional intelligence on the part of the spammers. They're acting in their own best interests. I don't think this is the kind of problem you solve for good instead you find temporary solutions that increase the cost of spamming or you surrender and let the spammers have the site. You seem to be opting for the latter approach.
After all if the spammers create enough pages then people will give up trying to revert the spam (note that both ElizabethWiethoff and KeithBraithwaite have already given up trying to manually revert the changes (didn't give up, was felled by food poisoning--KB)). I've personally given up trying to manually psuedo-delete all those spam pages and I'm currently resisting the strong temptation to write a script to delete/revert those pages affected by the last attack. So there's really only one important question: are we prepared to cling to our quaint ideas about this wiki or do we give the spammers their billboard? It's up to you.--ade
So what's the deal? Should I keep trying to revert by hand? At least I "deleted" all the NrTkwfOx pages and BambiBigrack pages I could spot. The vast majority of the other spam pages appear to be brand new pages create from legitimate Bookshelved links. -- Eliz
Thanks, Eliz. I'll try to keep an eye on it during my waking hours. But seriously, I think we might benefit from a MeatBall:ShotgunSpam solution like MeatBall:LinkThrottling. In short, when an edit exceeds a fixed number of links, or when it seems to consist of nothing but links, deny it. I think we could use real deletion as well, to clean up some of the mess as well. --ATS
Thanks, Aalbert, for cleaning up the rest of the mess. I had to leave for an XP meeting. What's eerie about the latest spam attack is, the external links were not real links. Scanning for 'http' as part of a link throttling measure would not have helped, though scanning for 'ttp' or 'tp://' would. Nevertheless, I'm in favor of link throttling and real deletion. -- Eliz
You know I never really noticed the weird 'URLs' -- I noticed the different protocol but thought nothing of it and just mindlessly started reverting.
Hmm... Some sort of protection seems to be in place now, since I had difficulties editing this page. I was told that I wasn't allowed to post external links (even though it was just an interwiki link to Meatball). Seems a bit harsh, but I guess (or rather I hope) it's just a temporary measure.--ATS
Ah, got it. It seems as if that specific 'protocol' is disallowed. So referencing MeatBall:EditThrottling should be allowed, then? Good. I think we could use some form of that as well, since I don't think any of us has any use for massive page creation.
I'm after something subtler than "surrender". Bookshelved has countermeasures in place: "link throttling" of a sort is one. I'll be looking at edit throttling. Others, like email registration, I'd be willing to consider only as a last resort; they seem to be to go counter to the spirit in which this wiki was created. My primary goal is that legitimate users of this site should be minimally inconvenienced, taking precedence over that of deterring spammers. -- lb
Quick update: I've been unable to find the actual Perl code for edit throttling in UseMod?. Can someone point me to it ? -- lb
Laurent, even if you don't wish to upgrade the public wiki to Usemod 1.0, may I suggest installing a private copy somewhere, running against the same page database, in order that you can delete the spam pages marked with DeleteDeleteDelete. -- EarleMartin
As of yesterday we are running 1.0. :) David, thanks for the pointer. -- lb
After a little research I've discovered this is actually a very popular spam attack. They're trying to embed lots of invisible, hence the "display: none" css style, spam links in the page. If you take a look at the 'recent changes' page for the official usemod wiki you will see that someone seems to be experimenting with variations on this CSS-based attack.--ade
I can understand trying to embed invisible links to porn and shopping sites. But I don't see the point of trying to embed an invisible message that just says "no changes". -- ElizabethWiethoff
It looks to me like someone who doesn't know what they're doing CargoCulting somebody else's code and failing. "No changes" could just be a randomly-selected English string; the attacker may not be able to read English (as in the Chinese spammers that appear from time to time). -- EarleMartin
I've added the above IPs to the ban list. I'm also going to enable edit throttling, if I can figure out how that works. -- lb