Why do we block Google?

Why do we block Google?
by tepples on 2008-03-23 (#32000)

The robots.txt file of this domain is currently configured to exclude almost all pages of this domain, including the entire forum. Why is this? The policy as of now appears to hurt 1. the visibility of nesdev and 2. the ability of members to search the board more efficiently.

by blargg on 2008-03-23 (#32001)

Maybe bandwidth? I notice a lot of boards have a "low-bandwidth" version that search engines index. Aside from that, every page here has about 7K of inline style-sheet, which is surely a waste of bandwidth. The style sheet even has a comment about this:
Quote:
NOTE: These CSS definitions are stored within the main page body so that you can use the phpBB2 theme administration centre. When you have finalised your style you could cut the final CSS code and place it in an external file, deleting this section to save bandwidth.

by tepples on 2008-03-23 (#32010)

This board is less active than Pocket Heaven or even tetrisconcept.com, yet those boards get indexed. If you're worried about server load, use Crawl-delay: .

by Dwedit on 2008-03-24 (#32020)

If you care about server load, use a better caching system and/or accelarated php.

Or maybe use Punbb.

by Bregalad on 2008-03-25 (#32041)

Well, the main Nesdev page, who links to this forum, can be found in Google, but unfotunately it hasn't been updated for almost 3 years or so, and 75% of links in the page are broken by now.

by NotTheCommonDose on 2008-03-25 (#32042)

It would be nice if Google was able to find this place.

by koitsu on 2008-03-29 (#32124)

It isn't specific to this site, it's specific to all the sites on the Parodius Network. It's a global robots.txt.

I did this because archival bots -- including Googles -- gets stuck in infinite loops when fetching content from message boards, and have awful problems dealing with sites that have multiple Host entries which point to the same content (nesdev.com vs. nesdev.com, etc.). The bandwidth usage is the main one.

We just had this happen tonight with a bunch of Chinese IPs. For the past 10 hours they've been pounding the nesdev site, and we've been spitting out up to 2mbit/sec worth of traffic for about 10 hours. I would have noticed sooner except I was sleeping. Did they honour robots.txt? Nope. It was a distributed leech session across multiple IPs within China, which means it was probably compromised machines being used to download content.

This is very likely going to cost me hundreds of dollars in 95th percentile overusage payments with the co-location provider.

EDIT: I've uploaded pictures of the incident tonight, to give you some idea what happens to the server, to the network, and to the firewall state tables when leeching or webcrawler bot software encounters message boards. As you can see, I had to firewall off portions of China to alleviate the problem (using deny statements in Apache doesn't help -- they don't honour anything other than HTTP 404, so the only way to stop them is to block their packets).

http://jdc.parodius.com/lj/china_incident/

by blargg on 2008-03-29 (#32125)

Too bad there's no way to have something watch for excessive usage like this and simply shut down the entire site until a human can figure out what to block. Makes me angry hearing about it.

by NotTheCommonDose on 2008-03-29 (#32126)

I fell every site on earth should be on Google always because everyone has the right to knowledge and information.

by Zepper on 2008-03-29 (#32127)

NotTheCommonDose wrote:
I fell every site on earth should be on Google always because everyone has the right to knowledge and information.

...and downloads? And leeching?

by NotTheCommonDose on 2008-03-29 (#32129)

Yes.

by koitsu on 2008-03-29 (#32131)

blargg wrote:
Too bad there's no way to have something watch for excessive usage like this and simply shut down the entire site until a human can figure out what to block. Makes me angry hearing about it.

The part I'm still trying to figure out is how they managed to get that amount of network I/O out of us.

The nesdev site is rate-limited to ~50KBytes/sec (shared across all visitors -- yes, that's why the site seems slow sometimes), which means technically it shouldn't have exceeded 384kbit/sec. I'm thinking there's a bug in the bandwidth limiting module we use, and if that's the case, I have another I can try -- or I'll just end up sticking the site on it's own IP and use ALTQ in the network stack to do the rate-limiting.

Alternatives I've come up with, none of which are very user-friendly:

1) Use a module which limits the number of requests-per-second submit per IP address; if they exceed the limit, they're blocked for something like 5-10 minutes. The problem with that method is that it can sometimes go awry (and I've seen it happen on sites I've visited), especially if someone loads different pages of the site in multiple tabs or windows.

It also doesn't solve issues like what happened this morning, because the requests being made by the leechers still come in and hit the webserver, and it still has to spit back some brief HTML saying they've been blocked temporarily. This doesn't stop the requests.

2) Use a module which limits the total site bandwidth to X number of kilobytes per minute/hour/day/week/month. If this number is exceeded, the site essentially shuts down hard until the limit is reset (by me). You might've seen this on some web pages out there, where you get a brief HTML message saying "Bandwidth Exceeded".

The problem with this is that all it takes is some prick downloading the entire site (which happens regularly) with wget or some *zilla downloader, and then the site goes offline for everyone until I get around to noticing or someone contacts me to reset the limit.

There's really no decent solution to this problem, folks, at least not one that's ultimately user-friendly, while still being resource-friendly and won't financially screw me into oblivion.

P.S. -- http://jdc.parodius.com/lj/china_incident/dropped_packets_0329_0950.png shows that the leechers *still* have not shut off their leeching programs.

EDIT: I figured out how the leechers managed to get past the bandwidth limit. The bandwidth limiting module we were using was setting the total amount of bandwidth per user to 384kbit, not for the entire site. Thus, multiple simultaneously connections could indeed reach 2mbit. For those who are technical, the module I was using was mod_bw. The documentation for this module is badly written; once I went back and re-read the docs for the directive, I realised "Oh, so THAT'S what they mean... ugh."

I've addressed this by switching to mod_cband, which lets you set a maximum bandwidth limit for a site as a total, not per-client.

Also, looks like some of the leechers have finally noticed and stopped fetching data: http://jdc.parodius.com/lj/china_incident/dropped_packets_0329_1054.png. Now all I'm left wondering is if they're just going to find other machines to do this from...

by MottZilla on 2008-03-29 (#32137)

What's the point of these people leeching? They just want to download the entire site for some reason? Just greedy/selfish people?

by koitsu on 2008-03-29 (#32140)

MottZilla wrote:
What's the point of these people leeching? They just want to download the entire site for some reason? Just greedy/selfish people?

I don't know, you'd have to ask them.

There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically. I did it solely so people would stop leeching the site, but I guess that's wishful thinking on my part.

by tepples on 2008-03-29 (#32141)

koitsu wrote:
There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically.

As far as I can tell, the forum is the most interesting part of the site, especially because the front page is years out of date. But I just downloaded all 70 MB of the site's archive over FTP five minutes ago, and a static copy of /bbs/ isn't in there.

by koitsu on 2008-03-29 (#32145)

tepples wrote:
koitsu wrote:
There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically.

As far as I can tell, the forum is the most interesting part of the site, especially because the front page is years out of date. But I just downloaded all 70 MB of the site's archive over FTP five minutes ago, and a static copy of /bbs/ isn't in there.

Correct, it's not there, because there's no easy way possible to archive that in a way that will be user-friendly, nor a way to easily automate the process. The web board isn't stored in flat files, it's all SQL-based. What exactly do you want me to do about it?

If I remove robots.txt, are you willing to pay the 95th-percentile overusage fees from my co-location provider? Because honestly that's the only way this is going to work in a way that makes you happy.

by tepples on 2008-03-30 (#32154)

There are ways to automate turning the SQL into static data, and I know some boards do this for some sort of "lo-fi version" of each thread that gets sent to the search engines.

But at this point, never mind.

by koitsu on 2008-03-30 (#32167)

tepples wrote:
There are ways to automate turning the SQL into static data, and I know some boards do this for some sort of "lo-fi version" of each thread that gets sent to the search engines.

But at this point, never mind.

Yep, I'm well aware that there's ways to do what you've described, but 1) I doubt the existing board software here has native capability for such (or if it does, in a way that can be done automatically on a nightly basis), 2) I'm unaware of any software that can scrape the SQL tables for an old phpBB forum and create static files for it all. If you know of such software, awesome -- or if you feel like writing it, equally as awesome -- sounds like a good project for you to take on. Pass the idea by Memblers. He runs the site, I just run the server / network. :)

I would be more than happy to incorporate the data into the cronjob we have that makes the weekly ZIP (or even offer a separate ZIP file for just the board).

The only reason I'm offering the weekly ZIP is because Memblers and I both felt it might keep people from downloading the entire sites' content on a regular basis. It has helped -- there's even someone that mirrors that ZIP file on a weekly basis (they fetch it once a week a few hours after we automatically update it). But sadly it doesn't stop people who use leech software (not the mention ones which forge their User-Agent!) against http://nesdev.com/ and then don't bother to look at it what the client is doing for 24+ hours (since they all get stuck in infinite loops once they hit the web boards).

EDIT: I thought about this a little bit more. I'm going to try an experiment for you, tepples. I'm going to remove the robots.txt for the next 3-4 weeks and watch our bandwidth usage closely. I put the robots.txt in place back in February of 2007, so it's possible Google and others have fixed their crawling software within the past year. If things get bad, obviously I'll put it back, but I'm willing to try an experiment for now.

Let me know how things look in a few days (I forget how often Google runs their scrapes).

P.S. -- Looks like they finally stopped: http://jdc.parodius.com/lj/china_incident/dropped_packets_0330_1523.png

by Xkeeper on 2008-05-06 (#33293)

fwiw, you could block certain pages that are known to eat bandwidth, like viewtopic.php?p=* (and other things, and only let it archive forums and threads.

by koitsu on 2008-05-12 (#33382)

Xkeeper wrote:
fwiw, you could block certain pages that are known to eat bandwidth, like viewtopic.php?p=* (and other things, and only let it archive forums and threads.

That doesn't work in the long-term. The way things are currently work fine, and Google should be caching the forums (tepples can verify the robots.txt is gone). The amount of bandwidth this site gets is pretty astounding considering how "niche" it is. People don't understand how expensive bandwidth is.