Convert UTF16 to Latin1: Difference between revisions

From Cuis CookBook
Jump to navigation Jump to search
(imported material)
 
(syntax hilight)
 
Line 1: Line 1:


'''Problem'''. We got a data file in XML, but it is encoded in UTF16. Cuis is bases on Latin1 characters, so, if possible we should convert to this character set before doing any operation.
'''Problem'''. We got a data file in XML, but it is encoded in UTF16. Cuis is based on Latin1 characters, so, if possible we should convert to this character set before doing any operation.


'''Solution'''. Provided by Juan on the mailing list on date 11-Jun-2022.
'''Solution'''. Provided by Juan on the mailing list on date 11-Jun-2022.<syntaxhighlight lang="smalltalk">
<code>utf16 _ 'expo-test-IT-UTF16.xml' asFileEntry binaryContents.
utf16 := 'expo-test-IT-UTF16.xml' asFileEntry binaryContents.
possibleBOM _ utf16 copyFrom: 1 to: 2.
possibleBOM := utf16 copyFrom: 1 to: 2.
isLittleEndian _ true. "use your best guess"
isLittleEndian := true. "use your best guess"
possibleBOM = #[255 254] ifTrue: [
possibleBOM = #[255 254] ifTrue: [
isLittleEndian _ true.
isLittleEndian := true.
utf16 _ utf16 copyFrom: 3 to: utf16 size ].
utf16 := utf16 copyFrom: 3 to: utf16 size ].
possibleBOM = #[254 255] ifTrue: [
possibleBOM = #[254 255] ifTrue: [
isLittleEndian _ false.
isLittleEndian := false.
utf16 _ utf16 copyFrom: 3 to: utf16 size ].
utf16 := utf16 copyFrom: 3 to: utf16 size ].
String streamContents: [ :out |
String streamContents: [ :out |
index _ 1.
index := 1.
[index < utf16 size] whileTrue: [
[index < utf16 size] whileTrue: [
codePoint _ utf16 unsignedShortAt: index bigEndian: isLittleEndian not.
codePoint := utf16 unsignedShortAt: index bigEndian: isLittleEndian not.
out nextPut: (Character codePoint: codePoint).
out nextPut: (Character codePoint: codePoint).
index _ index + 2 ]].</code>
index _ index + 2 ]].
</syntaxhighlight>

Latest revision as of 20:16, 12 May 2025

Problem. We got a data file in XML, but it is encoded in UTF16. Cuis is based on Latin1 characters, so, if possible we should convert to this character set before doing any operation.

Solution. Provided by Juan on the mailing list on date 11-Jun-2022.

utf16 := 'expo-test-IT-UTF16.xml' asFileEntry binaryContents. 
possibleBOM := utf16 copyFrom: 1 to: 2. 
isLittleEndian := true. "use your best guess" 
possibleBOM = #[255 254] ifTrue: [ 
       isLittleEndian := true. 
       utf16 := utf16 copyFrom: 3 to: utf16 size ]. 
possibleBOM = #[254 255] ifTrue: [ 
       isLittleEndian := false. 
       utf16 := utf16 copyFrom: 3 to: utf16 size ]. 
String streamContents: [ :out | 
     index := 1. 
     [index < utf16 size] whileTrue: [ 
       codePoint := utf16 unsignedShortAt: index bigEndian: isLittleEndian not. 
       out nextPut: (Character codePoint: codePoint). 
       index _ index + 2 ]].