Great Western Coffee Shop

Sideshoots - associated subjects => News, Help and Assistance => Topic started by: grahame on November 22, 2015, 17:31:03

Title: Sorting out special characters in older posts
Post by: grahame on November 22, 2015, 17:31:03
I'm looking at a potential fix to the spewing character problem (technically it's an issue with character sets) that's been bugging us for a while, where special characters in older posts are becoming damaged when we backup / restore.  The problem is with old forum software, a very much more recent database than it was ever designed / tested for, and a technical admin who looks after this site as a hobby and let it happen and get a bit out of hand (you can also blame his lack of technical expertise in one of the technologies).   However, before applying a fix I would like to check with members.

Here are the options on offer:

Option 1: Status Quo - make no change which means that on any backup / restore cycle the number of special characters in each rubbish sequence goes up between 33% and 50% (that's something that's every few months)

Option 2: Replace special character sequences on past posts (and on future posts) with a cardinal character such as ^.  This would show up as follows:


(See original at)

Option 3: Neither above option acceptable - so have someone else take a look at the problem / see what they can up with if anything.  I can't know / predict any outcome from this option, but the issue is a significant one.  Members of other forums may have noted that they have gone as far as closing old forums and opening new ones to overcome issues; considerable time and effort and loss of historic data comes from that, with no guarantee (from me) that I would have the time / resource to be anything like as involved in the extra peak of work involved that I have in the past.  Potentially there are cost / hosting issues too - but then if the vote goes for this option, I would suggest a further discussion (or perhaps we can have it here?) as to what the terms of reference of the "take a look" commission would be and how we might appoint and potentially pay for such a look / see if it's not something that's in the volunteer sector.

For once, I will be voting in my own poll - and that will be for option 2. It seems to provide a solution to an ongoing problem that has the potential to get worse to the extent it could damage the forum's content and - in time - the very future. Indeed I think we may have already lost (but I can potentially roll back) one block of posts from about 7 or 8 years back.   Option 1 is driving on towards a wall into which we could crash rather nastily in the medium term.   Option 3 could potentially be a brave new perfect start, but you (members) would loose a lot of traction along the way and would need additions to (or a new) technical support person depending on how it went.

Title: Re: Sorting out special characters in older posts
Post by: bobm on November 22, 2015, 18:10:03
I too back option 2 - seems a happy compromise between what we have now and a very expensive and time consuming operation which may not provide a solution.

(Anyone who starts comparing Status Quo with other rock bands will be given a thread to manually replace each extraneous character one by one!   ;D )

Title: Re: Sorting out special characters in older posts
Post by: Bmblbzzz on November 22, 2015, 18:23:47
It's not just old posts, I've seen it on some quite recent ones. But I don't really understand why it happens. This is on SMF, isn't it? I know that other forums running on (ancient versions of) SMF don't have the same problem.

Anyway, Option 2 seems better than Option 1.

Title: Re: Sorting out special characters in older posts
Post by: grahame on November 22, 2015, 18:32:27
It's not just old posts, I've seen it on some quite recent ones. But I don't really understand why it happens. This is on SMF, isn't it? I know that other forums running on (ancient versions of) SMF don't have the same problem.

Anyway, Option 2 seems better than Option 1.

Thank you.

It's not really an SMF issue ... it's basically my workings on backup and restore on the occasions that the database has been flakey mixing up character sets.  If the vote suggests I go ahead, then I'll also take a look and see what I can do to avoid the problem breeding again now that I know what it is ...  I HAVE considered reverse engineering to unwrap but it's so wrapped up that's beyond me, frankly.

Title: Re: Sorting out special characters in older posts
Post by: trainer on November 22, 2015, 19:22:06
As an ordinary 'punter' on the Forum, with no understanding of the technicalities, but annoyed with the disrupted messages from time-to-time, I simply wish you to take the most straightforward action to minimise the issue. Much voluntary time is spent allowing people like me to dip in and out with no responsibility other that to make comments, for which I (and I am sure most others) are very grateful. I accept that this is a challenging matter to change and am reluctant to ask for any course of action which would mean someone else spending hours of time making my occasional time on here marginally better.

I will be grateful for any improvement.  Thanks all you technical people for what we do have.

I also have a sneaking satisfaction when those who 'know all about IT' are beaten by it.   ;D

Title: Re: Sorting out special characters in older posts
Post by: Chris from Nailsea on November 22, 2015, 23:19:58
Thanks for your comments, trainer.  ;) :D ;D

On that basis, my vote is also for option 2.  :-X

Title: Re: Sorting out special characters in older posts
Post by: GBM on November 23, 2015, 10:53:20
Totally agree with trainer.
Option 2 for me please.

Great pleasure for me to read and occasionally post.

Thank you all "admin" who run the forum

Title: Re: Sorting out special characters in older posts
Post by: Rhydgaled on November 23, 2015, 11:51:31
Option 2 for me as well, I think. Just a couple of questions though:
  • Would this replace only the garbage sequences or would the code (I assume this would be an automated replacement) not be able to distingish them from potentially useful characters (in other words, which characters would be replaced)?
  • Would/could the replacement character also start to breed?

Title: Re: Sorting out special characters in older posts
Post by: grahame on November 23, 2015, 12:04:58
Would this replace only the garbage sequences or would the code (I assume this would be an automated replacement) not be able to distingish them from potentially useful characters (in other words, which characters would be replaced)?

Testing has shown that we are unlikely to loose anything useful ... I have tried out much more tun the sample before I suggested it

Would/could the replacement character also start to breed?


Title: Re: Sorting out special characters in older posts
Post by: Red Squirrel on November 23, 2015, 13:35:32
Just a thought: Don't get rid of all the 'special characters' from this forum - some of us have nowhere else to go...  :)

Title: Re: Sorting out special characters in older posts
Post by: Rhydgaled on November 23, 2015, 14:05:49
Would this replace only the garbage sequences or would the code (I assume this would be an automated replacement) not be able to distingish them from potentially useful characters (in other words, which characters would be replaced)?

Testing has shown that we are unlikely to loose anything useful ... I have tried out much more tun the sample before I suggested it
Would/could the replacement character also start to breed?
Sounds good.

Title: Re: Sorting out special characters in older posts
Post by: grahame on December 22, 2015, 07:04:18
I am aware that I've not actioned this yet ... I'm seeing it as important but not time-critical and have had my plate rather full.  It also needs doing at a time that I've got excellent and continuous net access, when the forum is quiet, and when I'm feeling bright, well and awake enough to do it without significant interruptions.   I'm writing this update from (!) yet another hotel room, with a stinking cold and a course to give ... not the morning to give it a go.  Besides - it's a commuter morning and if I take the site down, chances are that signalling will pop in the Thames Valley, or the windscreen wipers will fail on 153369 again!

P.S. Well done to the Great Western operational team who got another 153 down to Westbury in time to run the busiest TransWilts service of the day yesterday evening.  Units will occasionally break down, and it's really appreciated when actions are taken to get a replacement in sooner rather than later. 

Title: Re: Sorting out special characters in older posts
Post by: grahame on December 25, 2015, 04:06:51
OK - let's see how that worked ... process went well / almost too well.   I am a bit concerned at just how much the database table dropped in size - it may be that there was an awful lot of corruption in old posts, or I may have done some damage.  If something turns up that's unfortunate, I do have backups.  PLEASE let me know of any issues.

Title: Re: Sorting out special characters in older posts
Post by: Chris from Nailsea on December 25, 2015, 20:43:47
May I refer the honourable member to the comment made in post number 9 above?  :P ::) ;D

Title: Re: Sorting out special characters in older posts
Post by: grahame on December 26, 2015, 08:07:51
OK - looks good after a further 24 hours, and overnight I came up with a further test to check our that we hadn't lost, at least, whole messages:

mysql> select count(id_msg) from smf_old_messages;
| count(id_msg) |
|        185432 |
1 row in set (0.39 sec)

mysql> select count(id_msg) from smf_messages;
| count(id_msg) |
|        185459 |
1 row in set (0.11 sec)

Also tells me we've had 37 posts since the early hours of Christmas morning. 

We can probably consider the matter concluded ... we may see a few re-appear in new posts that use special characters, but having fixed the issue once it's very easy to do it again if it has to be - and next time would be very quick and easy.   For (my) record later - code used:

open FH,"preclean.sql";
while ($line = <FH>) {
$line =~ s/[\x80-\xff]{2,}(?:[\x00-\x7f][\x80-\xff]{2,})*/^/g;
print ($line);

Title: Re: Sorting out special characters in older posts
Post by: JayMac on December 27, 2015, 17:33:03
I note a few spurious characters appeared in a post on the Fare's Fair board yesterday when the pound sign was used. I'm wondering if any ongoing issue will be from posts tgat come from mobile devices. Testing...


Title: Re: Sorting out special characters in older posts
Post by: grahame on December 27, 2015, 17:55:55
From this morning:

Another of those occasions this morning - fixed.   Any posts from 10:18 to about 10:40 may need to be re-submitted - sorry about that.   A couple of strange character have crept back in, but I can fix them next time and they won't overwhelm us again.

I need to run restores through my filter script in future ...

Title: Re: Sorting out special characters in older posts
Post by: Adelante_CCT on December 27, 2015, 20:57:54
Whether it's of any use or not but my post from that thread was posted using a laptop.
I note a few spurious characters appeared in a post on the Fare's Fair board yesterday when the pound sign was used. I'm wondering if any ongoing issue will be from posts tgat come from mobile devices.

Title: Re: Sorting out special characters in older posts
Post by: grahame on December 27, 2015, 21:19:49
Whether it's of any use or not but my post from that thread was posted using a laptop.
I note a few spurious characters appeared in a post on the Fare's Fair board yesterday when the pound sign was used. I'm wondering if any ongoing issue will be from posts tgat come from mobile devices.

I know exactly what the issue is ... just rushed it back up this morning when the database got corrupted without going through the extra filter.  However - thanks for the clues ... 90% of the time such clues are pure gold.

In my defence for the error, the orange juicer has just failed, there was water pouring through the ceiling at reception and setting the fire alarms off ... then the database issue.  Things go in threes, so it got easier thereafter.

Title: Re: Sorting out special characters in older posts
Post by: Chris from Nailsea on December 27, 2015, 21:31:14
My personal view is that you should have instructed the excellent Phil to deal with the first two of those issues (after all, he doesn't do much else all day :P), and deal with the third yourself.  ;D

Only joking, Phil.  ;)

Title: Re: Sorting out special characters in older posts
Post by: grahame on December 27, 2015, 21:51:14
Ah - I took on Christmas and others have the pleasure of the New Year, which I am rapidly selling to make it far busier than Christmas was  ;D

Title: Re: Sorting out special characters in older posts
Post by: Chris from Nailsea on December 27, 2015, 23:19:25
In which case, Phil, please accept my abject apologies: that orange juicer at WellHouse has phased me, more than once.  :P :-[ ::)

Title: Re: Sorting out special characters in older posts
Post by: stuving on February 04, 2016, 14:32:14
You may have seen stories today about a French spelling reform, promising "the end of the circumflex" or the like. Well, don't get your hopes up - it's much more limited, and confusing, than that. And they are adding a load of accents too, notable grave ones (some replacing acute with no change in pronunciation).

In any case the change is introduced first in schools, and is optional for those who are old enough to have learnt the old rules (even if they didn't). So it may be some time before it is visible - presumably first on line, where news items are (and may still be in twenty years time) cobbled together by kids.

What I suspect may be most noticeable to us is the change in the rules for "immigrant" words, a lot of which of course come from English. In general these will become single words (no hyphen), and will pluralise with an added -s (usually silent) whatever happens in in the source language. Thus week-end becomes weekend, jazzmen becomes jazzmans, (matches is already matchs) and lieder becomes lieds.

Incidentally, grahame reported in another thread:
Sorry about the 10 minute outage - just back.    You may find special characters "breeding" again - don't worry; I'll take them out next time. I only realised once I'd taken the database down that the trick code was ... there in the database!

What exactly is that "trick code"? Is it what's  needed to trap the sequences to be removed, or is it the code that produced them (either during read or write of backups) in the first place? Or something else altogether?

Title: Re: Sorting out special characters in older posts
Post by: froggycat on February 05, 2016, 10:14:51
You may have seen stories today about a French spelling reform, promising "the end of the circumflex" or the like. Well, don't get your hopes up - it's much more limited, and confusing, than that. And they are adding a load of accents too, notable grave ones (some replacing acute with no change in pronunciation).

In any case the change is introduced first in schools, and is optional for those who are old enough to have learnt the old rules (even if they didn't). So it may be some time before it is visible - presumably first on line, where news items are (and may still be in twenty years time) cobbled together by kids.

As a French expat, in love with both my native language and the language of this beautiful island, I was horrified about this reform of the language. And all because poor darlings at school need to be spared learning anything a bit too challenging.
And simplifying the spelling at a time when spell checkers and other such validating tools are getting better and better sounds ludicrous.
Let's just hope English does not go down the same root and remains the amazing, complex language that it is! And at least you don't have to worry about too many special characters  :D

Title: Re: Sorting out special characters in older posts
Post by: trainer on February 05, 2016, 11:20:36
And all because poor darlings at school need to be spared learning anything a bit too challenging.
Let's just hope English does not go down the same root ...

It already's called American English.  ;)

Title: Re: Sorting out special characters in older posts
Post by: Chris from Nailsea on February 06, 2016, 19:54:04
Let's just hope English does not go down the same root ...

It already's called American English.  ;)

That doesn't make it right.  ::)

Title: Re: Sorting out special characters in older posts
Post by: grahame on June 02, 2016, 06:35:53
Database was out for a few minutes ... I have re-run the naughty character cleanup script as they were starting to breed again in one or two places.   Should be OK now.  Anticipate this cleanup every six months or so.

This page is printed from the "Coffee Shop" forum at which is provided by a customer of Great Western Railway. Views expressed are those of the individual posters concerned. Visit for the official Great Western Railway website. Please contact the administrators of this site if you feel that content provided contravenes our posting rules ( see ). The forum is hosted by Well House Consultants -