Reading UTF-8 Encoding xml file sqlserver

Category: sql server xml

Question

Chakradhar Masineni on Wed, 05 Nov 2014 04:44:49


Hi ,

I am recieving a xml file from a third party vendor. it is encoded in UTF-8. while i am reading it i am getting the below error.

Msg 9420, Level 16, State 1, Line 3

XML parsing: line 30117390, character 33, illegal xml character

the characters causing the problem are like è,Ö,è.

 My database default collation is ‘SQL_Latin1_General_CP1_CI_AS’

I am using the below query to read the xmlfile.

declare @xml xml

SELECT 
	@xml= CAST(x AS XML)
FROM 
	OPENROWSET(BULK 'D:\sample.xml',SINGLE_BLOB) AS T(x)  
	
select  
	 X.product.value('(ID/text())[1]', 'varchar(50)') as ID ,
	 X.product.value('(Name/text())[1]', 'varchar(50)') as Name  
	   
from 
	@xml.nodes('Students/Student') AS X(product) 
		
		 

how can i read the file successfully. any help is appreciated.

Thanks in advance.

 

Replies

Ch. Rajen Singh on Wed, 05 Nov 2014 08:29:35


This issue normally happens when the XML file is not in the correct format. To save in the correct format open the xml file and click save as. Choose the encoding option as "UTF-8".

Ravish Shah on Tue, 07 Aug 2018 19:26:13


I confirmed the file i am trying to read is in UTF-8 format, but still could not get to read it correctly, while reading i noticed it converts characters to ANSI Format. E.g. My File has word "Télécom" i.e. in UTF-8 Format, however with SINGLE_BLOB it was read as "Télécom" i.e. in ANSI. Database DEfault Collation is "SQL_Latin1_General_CP1_CI_AS".  Any further instructions will be helpful.

Yitzhak Khabinsky on Tue, 07 Aug 2018 20:07:02


Hi Ravish,

Is it possible if you can share with us the XML file in question on MS OneDrive?

Before doing that please obfuscate the values in the file.


Erland Sommarskog on Tue, 07 Aug 2018 22:02:41


I don't this is possible with SINGLE_xLOB. I tried specifying the CODEPAGE option, but it was ignored.

However this worked to for me. This was my test document:

<Root>
   <Räksmörgås>Lantliv</Räksmörgås>
   <Lantliv>Räksmörgås</Lantliv>
</Root>

And I had saved this in UTF-8.

I had then used this format file:

13.0
1
1      SQLCHAR  0  0    "</Root>"    1    LINE  ""

And I rand this query:

SELECT convert(xml, convert(varchar(MAX), c) + '</Root>')
FROM  OPENROWSET(BULK 'C:\temp\slask.xml', FORMATFILE = 'C:\temp\slask2.fmt', CODEPAGE=65001) AS T(c)

Codepage 65001 is UTF-8

A few notes:

  • First of all this requires SQL 2014 SP2. Earlier versions does not support UTF-8 at all.
  • You need to know the closing delimiter in advance.
  • Furthermore, it needs to be a proper XML document with a single top node.
  • The field terminator in a format file can be up to ten characters. If root node is longer, you may be able to take only the last nine characters + the >. But it only works if that sequence does not appear elsewhere in the document.

Yitzhak Khabinsky on Wed, 08 Aug 2018 19:36:26


Hi Erland,

I decided to test it on the latest MS SQL Server 2017, build 14.0.3025.34.

My SSMS is v.17.8.1

And it worked!!!

Microsoft documentation says the following: "...** Important *\* Versions prior to SQL Server 2016 (13.x) do not support code page 65001 (UTF-8 encoding)..." is here:

Openrowset

XML file on the file system (explicitly specifies the UTF-8 encoding, but it is not mandatory): d:\Temp\ErlandUTF-8.xml 

<?xml version="1.0" encoding="UTF-8"?>
<Root>
   <Räksmörgås>Lantliv</Räksmörgås>
   <Lantliv>Räksmörgås</Lantliv>
</Root>

SQL:

DECLARE @xml XML;

SELECT @xml = XmlDoc   
FROM OPENROWSET (BULK N'd:\Temp\ErlandUTF-8.xml', SINGLE_BLOB) AS Tab(XmlDoc);

-- just to see if it is loaded correctly from the file system
SELECT @xml;

;WITH rs AS
(
	SELECT col.value('(Lantliv)[1]','VARCHAR(255)') AS Lantliv
	   , col.value('(Räksmörgås)[1]','VARCHAR(255)') AS Räksmörgås
	FROM @xml.nodes('/Root') AS tab(col)
)
SELECT * FROM rs;

Output:

<Root>
  <Räksmörgås>Lantliv</Räksmörgås>
  <Lantliv>Räksmörgås</Lantliv>
</Root>
Grid output:
Lantliv	Räksmörgås
Räksmörgås	Lantliv







Erland Sommarskog on Wed, 08 Aug 2018 21:38:39


Great! Then the horrible kludge I suggested can be forgotten.

I did not test with blob, because I thought it would interpret the bits as the internal represenation of the xml data type, but I was wrong.