UTF-8 database executing an SQL file saved in ANSI

I hope someone here is smarter than me and knows more about UTF-8.

We are going to UTF-8 because our ERP vendor has gone to UTF-8 and we need to
upgrade.

One of the things I tested is a job that contains the following WHERE clause:

where first_name like ‘%Ê%’

Now an E with a hat in ANSI is hex C8, bin
1100 1000

Here is where it gets interesting

If the first bit is 0
it is a single byte character

If the first 3 bits are 110 it is a two byte
character

If the first 4 bits are 1110 it is a 3 byte character

If the first 5 bits are 11110 it is a 4 byte character

Now back to the example. Ê = bin 110 0 1000

So Oracle thinks it is a two byte character so the bit pattern for the % sign is
considered the second byte of the two byte character. So of course nothing is
found. This is BAD.

I am hoping someone has been through this and has some advice.

Erwin,

I hope someone here is smarter than me and knows more about UTF-8.
I'll try!

We are going to UTF-8 because our ERP vendor has gone to
UTF-8 and we need to upgrade.
Have you run the character set conversion utility to check if any problems will be detected?

where first_name like '%Ê%'

Now an E with a hat in ANSI is hex C8,
bin 1100 1000
This is fine, in whatever SINGLE BYTE character set you are using in the database right now, but it's not a valid UTF8 character. For the very reasons you give below.

...

Now back to the example. Ê = bin 1100 1000

So Oracle thinks it is a two byte character so the bit
pattern for the % sign is considered the second byte of the
two byte character. So of course nothing is found. This is BAD.
This is because your data are not in UTF8.

I am hoping someone has been through this and has some advice.
The globalisation manual had details, chapter 10 I think, on using the character set conversion utility.

You must run a validation first to ensure that your from and to character sets would lose data, and if all is ok, you can then run a conversion for real.

I presume that you won't be doing this testing in production! :wink:

HTH

Cheers,
Norm. [TeamT]

Information in this message may be confidential and may be legally privileged. If you have received this message by mistake, please notify the sender immediately, delete it and do not copy it to anyone else.

We have checked this email and its attachments for viruses. But you should still check any attachment before opening it.
We may have to make this message and any reply to it public if asked to under the Freedom of Information Act, Data Protection Act or for litigation. Email messages and attachments sent to or from any Environment Agency address may also be accessed by someone other than the sender or recipient, for business purposes.

If we have sent you information and you wish to use it please read our terms and conditions which you can get by calling us on 08708 506 506. Find out more about the Environment Agency at www.environment-agency.gov.uk

The data in the database has been converted correctly. I missed the point that if my work station is single byte then Oracles middleware will convert it appropriately. So if I compare a single-byte E Accent to a utf8 E accent on the database everything is OK.

The problem is when I save the query as an ANSI file and have it run on the UNIX box whose default NLS_LANG is UTF8 then it expects to do no conversion and therefore croaks on the single byte E accent because it sort of looks like a multi-byte. And of course our job scheduling system runs on Unix and submits via SQL*Plus and it MUST be UTF8 because that is what the ERP vendor requires.