Archive Format Problems and Broken HTML

anonusr · July 15, 2012

Hello,

Since the new version of the archive was deployed, I noticed many older stories have broken html tags that gets mixed into the story text (the usual culprit is a broken span tag which take looks like <span \r\n which causes the style attribute (often mso-spacerun:yes) to show up in story text. Here's an example of a fic that has this problem (it's not the only one, but just one I stumbled upon):

http://anime.adultfa...hp?no=600042685

Additionally, these stories also often have line break problems - because text is divided into paragraphs tags , and also has additional unusual line breaks tags , the text appears to have strange breaks mid-paragraph and large amounts of white space between paragraphs due to the combination of both and the end of paragraph.

Finally - there are broken unicode whitespace characters in the text (usually as part of an mso-spacerun span that otherwise appears empty), that, though they do not show up in a browser, they do manifest themselves in html code.

I've written a tool to download a story and clean up the bad HTML (I was having trouble reading some of the works on site). The problem is, it's difficult for me to automatically figure out which stories have problems, and which ones don't (many stories have the entire story in a single span or paragraph, and use tags to break correctly. Whereas others have the undesired behavior. Therefore the tool isn't perfect - though it will clean the dangling html in story text, you have to tell it to clean the tags explicitly (I tried to make the tool function so that it can only help, and won't hurt an already well formatted story).

I was wondering, would the archive like a copy of my code (it's written in python2)? I'm not sure how widespread the problem is, but I know I've seen it on quite a few stories, and I thought it might be helpful for someone.

Regards,

anonusr

RogueMudblood · July 15, 2012

Please read this and this.

anonusr · July 16, 2012

Thank you for the links, but I'm somewhat confused...

I'm aware an issue exists, and I know it was brought on during the deployment of the RTE editor. However, it looks like your referring to a different bug than I am (I'd assume wall of text is a lack of breaks rather than breaks appearing in the wrong place).

Regardless, as DemonGoddess061 mentioned, the fixes have been being applied manually. The primary point of my post was to ask if you wanted my code to clean up the stories that were having problems with the broken span bug in a more automated fashion (assuming you can directly modify html text for stories in your database and can run a script against that html). That's all.

DemonGoddess · July 16, 2012

I'd have the coder look it over, and she can tell me if we can use it. As the html is actually a direct db injection and not an actual page, it would be modifying records, not actual html pages.

anonusr · July 21, 2012

Sorry for the delayed reply - here's the code I used. Depending if fix_lines_global is set, the code will either just fix the spans, or will remove the tags as well. If fix_lines is not set, this should not have a negative impact on any story (including one that is fixed), since it should only clear the bad unicode and spans.

Upon looking at it again, this code really isn't efficient (it does several passes of replace), but, as I mentioned, I suspect it might be better than going over stuff by hand. Regardless, the code is free for you to use if you want it.

Regards,

anonusr

PS: Python is a whitespace sensitive language, and the forum text editor seems to be removing all the indentation when I post. If you need a version with correct indentation, I can email you a copy, or I can give you a link to the code stored elsewhere. I've tried to mark in this post where if statements begin and end.

import re

def clean_story(html_page):

#Clean the story

#This page is unicode - use unicode strings

#Also, ignore any unicode bytes that are corrupted.

clean_page = unicode(html_page,'UTF-8',errors='ignore')

#First, replace all &gt with > (so later substitutions

#work as expected).

clean_page = clean_page.replace('>','>').replace(' ','')

#remove \r (so we only have standard \n linebreaks)

clean_page = clean_page.replace('\r','')

#Though not techniacally correct, replace all with

#so the page is consistent.

clean_page = re.sub(r' ',r' ',clean_page)

#next, fix all problems, with broken spans

clean_page = re.sub(r'<span ','<span',clean_page)

#Sometimes stories are broken into proper tages,

#whereas other times they are broken by . It's

#impossible to know when the help, and when they

#hurt. Therefore, this can be turned on with fix_lines_global

if fix_lines_global:

#Less aggressive version only removes line breaks if they are stand alone

#clean_page = re.sub(r'(?<! \n) (?!\n )','',clean_page)

#more aggressive version will still remove single line breaks, but,

#if there are multiple line breaks, this will take off one.

#we are using this one because stories that have to be reflowed

#also seem to have extra space.

clean_page = re.sub(r' (?!\n )','',clean_page)

#if fix_lines_global ends here

#Remove extra breaks from a page (regardless if fix_lines_global is set)

clean_page = re.sub(r'( \n){3,}',r' ',clean_page)

return clean_page

Edit: Added comments to denote where blocks end, and added import statement

Edited July 21, 2012 by anonusr

DemonGoddess · July 22, 2012

I'm actually replacing \r\n with , as that is what that represents in the tables. The coding is php5, mysqli; so I'm not sure if the python script will actually work for what we're doing. As I said though, I'll have her look and she'll tell me.

The other thing is, python is not an installed module on our dedicated server. So, while I certainly appreciate the offer, it's not something we can use.

Sign In

Archive Format Problems and Broken HTML

Recommended Posts

anonusr

RogueMudblood

anonusr

DemonGoddess

anonusr

DemonGoddess

Browse

Activity