Regarding chapter data repair:
I thought I'd give an example of each for instance. This is with the addition of the rich text editor, and how it showed code (often), screwed up spacing (100% of the time), changed fonts throughout, (often) and other such things.
This particular story that I just fixed is 29 chapters long. BEFORE stripping out all the garbage from the word exported html, The collective size of the thing was 8MB. Once I was done fixing it, it shrank to 1.5MB.
This is before I fixed things
this is after
another example of the rte stuff to fix-
As you can see, there is quite a difference.
These are the steps I have to take to fix these particular stories:
Check the text encoding. If it's not universal, convert it.
REMOVE all the extra html that makes illegal function calls, is just sort of ...there... and what not. (This is done line by line by line by line...)
convert the php line break to a space, as for these records, that's what the line break inserted actually is, not a paragraph end or line break.
Double check each record, make sure I missed nothing
Mind you, it doesn't SOUND like a lot, but consider that a simple paragraph open container, which is NORMALLY 3 characters (<p>), can often be upwards of 50 characters. Not only that, there are EXTRA paragraphs added, that the nonexistent .css file actually references for formatting. I run across this with each html element in the document. This is why this particular story shrunk so drastically. The sheer file size because of the extra, useless html embedded within each record.
What causes the appearance of normally "invisible" coding elements, or the oddball spacing, and what not; is that the rich text editor attempts to strip all this garbage out. It can only do a partial of this, until a user actually opens the chapter and clicks "edit chapter" to resave it. It then finishes stripping out most of it. The only thing I've seen it have difficulty with removing are the extra paragraphs Word likes to insert in the converted document.
So, what would take a user maybe a minute or two a chapter (unless they want to make additional changes) by simply resaving it, takes ME roughly two to four hours per chapter, depending upon just how much garbage was inserted in to the record. The more crap code, the longer it takes to fix it.
Which is why I am grateful to the users who've taken the time and not waited for me, and gone ahead and FIXED their stories where it was a word file exported to html. Seriously. Those of you who've already done this have saved me untold hours of work.