Searching full text of stories

vagrant5 · December 16, 2011

Hi,

I think it's a pretty serious issue that we are unable to search the full text of stories in the archive. Only the author, title and summary. From reading some previous posts on the forum I am understanding that such feature could not be supported by the database/server due to likely performance issues (it would be too much of a resource-heavy task).

The alternative is just to allow Google to index the site. That way we could just search on goolge without impact on the AFF server. But right now it's not indexed by google at all: this simple search turns up no results. Looking at the robots.txt it looks like Google should be able to index. But there are absolutely no results. I am pretty sure the reason is that there is no way for the google crawler to get past the warning page (see attached). Maybe you can expose some kind of index of the site to google without requiring acceptance of this warning? I am pretty sure there are ways to recognize the crawler too and let it past but make the others click through the warning.

In any case, I just wanted to hear the admin's take on this. I think it's pretty important to be able to search full text. Sometimes I want to search for obscure keywords or something else that nobody would ever think about putting in the story "Summary".

I definitely think AFF and the authors would get more exposure if the stories appeared on Google.

DemonGoddess · December 16, 2011

full text searching would put a huge strain on the db itself, no matter if we did it with the current search engine, or with Sphinx search. The database is currently 7.9GB, with the BULK of that storage being chapter data. This is why it can't be implemented.

If this was all html pages, rather than scripted content, sure, we could use Google search to do exactly what you say. However, since the content is database driven, and to get the input requires form use to begin with (invisible, but still a form of a kind), Google search CAN'T index that. The crawler has to be able to LINK to specific pages (static pages) to crawl.

If what you're really wanting is for us to eliminate that page itself, sorry, no can do at this point.

vagrant5 · December 17, 2011

If this was all html pages, rather than scripted content, sure, we could use Google search to do exactly what you say. However, since the content is database driven, and to get the input requires form use to begin with (invisible, but still a form of a kind), Google search CAN'T index that. The crawler has to be able to LINK to specific pages (static pages) to crawl.

Thank you very much for the response. I understand that the internal search would not be able to handle full text search, and we should not consider it as an option.

Also I understand and agree that we cannot consider the option of removing the warning page when a visitor first visits AFF.

However it IS possible to have google index AFF nonetheless. I'm not sure if this is what you were trying to say, but dynamically generated pages (non-static HTML) CAN be indexed by Google. Blogs (e.g. wordpress, livejournal) are all dynamically generated from a database too and can be indexed.

The issue is that Google cannot get past the warning page. Here are instructions from Google about how to recognize Google's crawlers, which should enable AFF to let them through without displaying a warning page. In other words, check whether user-agent is Googlebot, and if so let them through. Everyone else still sees the warning page. I know a lot of sites do this kind of thing.

If you are concerned about the impact on performance as Google continues to index, there is a way to tell their crawler to slow down (in google webmaster tools).

I really think getting indexed would increase AFF's visibility on the web. Please let me know what you think of the above.

Edited December 17, 2011 by vagrant5

DemonGoddess · December 17, 2011

Google actually does crawl this site, daily. Something to enhance the performance of it is to generate a sitemap, which is on my VERY long to-do list. As I'm in the middle of doing more work stuff (and have been for a month solid), the RL stuff has to and always does take precedence

vagrant5 · December 17, 2011

Google actually does crawl this site, daily. Something to enhance the performance of it is to generate a sitemap, which is on my VERY long to-do list.

Yes it crawls the forums/main domain (see results) but not the archive/stories (see results)

DemonGoddess · December 17, 2011

I'm aware of this. This is why I need to generate a site map, so it crawls the subdomains as well.

manta2g · December 18, 2011

*reads* full text search of chapters is not going to happen, but in the recode there will be full text search of story summaries which is the closest you will get, since trying to full text search chapter data will crash the whole database, even with sphinx, the load is just too big.

DemonGoddess · December 18, 2011

From google (the webmaster tools)

Googlebot processes each of the pages it crawls in order to compile a massive index of all the words it sees and their location on each page. In addition, we process information included in key content tags and attributes, such as Title tags and ALT attributes. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files or dynamic pages

So as you can see, even with a sitemap, it's STILL not going to index the chapter data

vagrant5 · December 18, 2011

*reads* full text search of chapters is not going to happen, but in the recode there will be full text search of story summaries which is the closest you will get, since trying to full text search chapter data will crash the whole database, even with sphinx, the load is just too big.

Yes I agree and understand that it is not feasible to implement full-text search on the AFF search engine. But it is quite feasible (and very desirable) to let people search for AFF stories on Google, by letting Google index AFF archive.

I'm aware of this. This is why I need to generate a site map, so it crawls the subdomains as well.

So as you can see, even with a sitemap, it's STILL not going to index the chapter data

I looked into sitemaps before posting my suggestion about detecting Googlebot via the user-agent. Sitemaps are not going to help. Believe me, google knows that the subdomains (e.g. comics.adult-fanfiction.org) exist. They know from links to the archive in posts on this forum, from DNS records, and from incoming links from elsewhere on the web. The issue is that Googlebot cannot get to any of the content. As you mentioned before, the problem is the hidden "form" submission(s). Googlebot will not submit "forms" and it won't store session information. This is what they are referring to when they say "dynamically generated pages" in the passage you quoted. In technical terms, this means that Google probably won't crawl anything that requires cookies and/or an HTTP POST request. A "POST request" is a form submission.

I checked how comics.adult-fanfiction.org works in this respect using Firefox's private browsing mode and Live HTTP headers addon. Basically, when a new visitor comes in that has never visited the site before, they have to get past the WARNING page by submitting a form (POST request) with their date of birth etc, after which they get and keep a cookie that identifies them to AFF server as having submitted said form. This makes sure they don't get asked to submit same WARNING form again for some time. This is the ONLY form (POST request) that is required to get to the stories and to navigate from chapter to chapter. Every time a user requests any page in the archives (e.g. a story chapter), they identify themselves by the cookie to your server, so your server knows they accepted and signed the Warning page. If the cookie is missing, they will see the warning page instead of the story. THIS is the mechanism that prevents Googlebot from seeing and indexing AFF archives. It cannot submit the WARNING form and won't keep cookies. So everytime it follows any link to the archive (e.g. a story), all it sees is the warning page.

So do not waste time with sitemaps because it won't help. What is necessary is a way to detect googlebot and let it browse the archive without checking its cookie (and therefore not displaying the WARNING page/form).

There are a ton of resources on the web about detecting googlebot, this one being the first result for PHP. I am not sure what the backend of AFF db interface is, but I am pretty sure someone out there has figured out something similar that you can use.

Please let me know what you think.

Edited December 18, 2011 by vagrant5

DemonGoddess · December 18, 2011

1. The function you referenced can get us blacklisted by Google. This is why they tell us to generate a sitemap.

2. It takes time for an index to be done once a site map has been generated.

3. The site map has been generated and submitted for index

4. No, it WON'T index the form, but if I tell it to look at the index file of a given subdomain, that's what it'll index.

vagrant5 · December 18, 2011

Thank you very much for your reply and for submitting the sitemap. I hope it works. I know indexing takes time, so we'll have to check status in a couple of months. But I want to clarify a couple of things:

The function I referred to would not get us blacklisted by google. It would only do this if your server presents completely different site and information to the google bot than it does to normal visitors (cloaking or spamdexing). This would not be the case. You would just be using this approach to let Googlebot pass a specific page that blocks its crawling. Lots of sites use a similar approach, including major newspapers, journal publishers, and other content providers, esp those who put up a paywall. Google welcomes this and has a page that addresses a similar issue (in the context of paywalls). Esp note how they say "Keep in mind that Googlebot cannot access pages behind registration or login forms". They have some very relevant suggestions, esp in the "implementation" subsection.
For google to index the entire site, it needs to move around the site and follow all the links and index all of that content. Because of the form it can't access the site at all and therefore doesn't index. I am not sure if having an index file would help.

DemonGoddess · December 19, 2011

Pointing to an index file ALWAYS helps. As I said, the sitemap has been submitted, but not yet indexed. The subdomains I've been submitting (direct link to index page) are also starting to show in search results. However, that takes TIME.

Each subdomain has to be submitted to the indexes as a site. As far as I know, up until recently only the forum was submitted that way, because I did it. Mind you, I got an earful from Jaxxy at the time when I did, but that's besides the polnt. Long story short, she put the screws to my submitting anything else for the domain to the indexes.

ApolloImperium · December 19, 2011

We appreciate all the suggestions that you have submitted and will take them as well as Google's information about how to allow this into consideration. As time allows, it may be something that we do in the future, however at this point, our focus is getting the site coding up to date and fixing the word order issues from the last crash. Thank you for the time spent on this and we will take it from here.

vagrant5 · December 19, 2011

Thank you both for your time!

DemonGoddess · December 29, 2011

Okay, we've done a slight workaround. Now, due to the size of the db itself, crawling chapter data is really not a good thing for us. However, when the form is triggered has been moved to the first time a user is trying to look at a chapter. This then sets the cookie, just like it did when trying to access anything at all.

We've replaced that with a simple cookie, which has no forms in it, and is set by clicking a link. The downside to this particular cookie, is it must be set per subdomain. It does last for 30 days, so only needs to be done once a month. Not a big deal, really. It still covers the site by clearly stating the age requirements needed to access, which is important.

Hopefully within the next week, we'll see different results from the crawlers, where they can index the members, the story listings, and etc. Just not the chapter content.

vagrant5 · December 29, 2011

I just tested the new system and I see how it works. This is great!! Yes the chapter content will remain off limits to crawlers, but at least the story listings and summaries, members etc can be crawled. Again, thank you for the change!!

ApolloImperium · December 29, 2011

*crosses fingers that this last MacGyver duct tape job holds until the rewrite gets there* Lick spit and a prayer!

DemonGoddess · December 29, 2011

*snickers*

ApolloImperium · December 30, 2011

If the archive code was a physical thing that we could see, I'd honestly be afraid to see what it looks like right now... I think we've got 10 industrial size rolls of duck tape in place and enough prayers to keep the Greek Parthenon going for a few centuries at least... You and Manta are geniuses!

Sign In

Searching full text of stories

Recommended Posts

vagrant5

DemonGoddess

vagrant5

DemonGoddess

vagrant5

DemonGoddess

manta2g

DemonGoddess

vagrant5

DemonGoddess

vagrant5

DemonGoddess

ApolloImperium

vagrant5

DemonGoddess

vagrant5

ApolloImperium

DemonGoddess

ApolloImperium

Browse

Activity