当前位置：首页 > news >正文

C#的CSV 在8859-1下中乱码和技巧

news 2025/10/14 6:07:05

1. 写在前面：

并不代表我解决了这个问题，主要还是笔记

原因是CSV出现乱码的原因多种多样。
所以，我没有能力列举出所有的可能性。只是一个issue一个issue地解说。
主要与是否有BOM头和编码格式有关。

1.1 问题和问题复杂化的原因

问题是读CSV时中文乱码,主要是描述8859-01的情况.
问题复杂化的原因:
（1）C#语言的局限性。
（2）需要使用CsvHelper的CsvReader来读取。实际上，如果自己手工处理，很可能不会遇到这个问题。
（3）8859-1，一般而言，如果是标准的含BOM头 , 以及文件是标准的UTF-8，是不会遇到这个问题的。

1.2 乱码出现的原理

CSV的全称是：逗号分隔值（Comma-Separated Values，CSV）；
所以，这个‘，’很重要，如果不能识别出这个是分隔符，还是一内容的一部分，就会出乱码。

2. ISO 8859-1

网上看到的多是UTF-8格式

但是不清楚，是否是我的操作问题，我是用Excel标准方式操作，存出来的CSV，还是ASCⅡ 格式。(印象中以前不是这样的，所以，这里只当是疑问)。

啊对，这里稍提一句，“其实计算机是8进制的”，突然想到大学的老师给我们讲的这句，一直没明白。其实0x55, 这样的16进制，是由两个8进制组成。因为英语体系的人发明计算机时，认为127个可见字符，加一些扩展字符就够用了。当然不是什么硬件原因！只是一个简单的需求+实现的具现化。2的7次方是127。

通用技巧之file

知道当前的csv是什么字符编码的，很重要。
如果需要(事实上一定是需要的,而且是第一步),装好git后，或者如果你有linux，先看一下它怎么说：

file my.csv

例如
在这里插入图片描述
我将8859-1放在这里，是反馈标题：实际上，对于大多数计算机语言来讲，ASC-Ⅱ码，根本就不是问题。因为那是它们天生的模式。
但对C#来讲，反而是最困难的。因为其特化的Encoding只有这几个：

可能有人认为Default或者ASCII是可以的，这个读者们，可以帮我查一下Default是什么，可能是Unicode-16的，而ASCII是不是与8859一样呢？理论上一样，但实际不一样。
因为8859多某种角度来说，还真有点像UTF-8，只是，不像UTF-8事实上有起始标记位，8859没有。
解决的办法是很简单的：

                // 读取文件内容
                byte[] fileBytes = File.ReadAllBytes(defectCodeMappingPath);
                using (var stream = new MemoryStream(fileBytes))
                //using (var reader = new StreamReader(stream, Encoding.UTF8)) //wrong
                //ISO-8859 encoding,  GBK or UTF-8 
                using (var reader = new StreamReader(stream, Encoding.GetEncoding(936)))  //ok

即用：Encoding.GetEncoding(936) 代替即可。
然后

using (var csv = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture)))

3. UTF-8与BOM

虽然CSV有许多种可能，我们如果认为微软心中的标准，那么就是BOM头（3bytes）+ UTF-8.
这种情况，也是需要考虑的。
例如，这是一段我写的代码


        public bool ReduceCsv_v2(string inputdataall_file,ref CsvMemObj curCsvObj)
        {
            try
            {
                lstAllSeg.Clear();
                using (var reader = new StreamReader(inputdataall_file))
                using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
                {
                    var inputData = csv.GetRecords<PadItem>();//.ToList(); 

                    var qrySegNgs = from cust in inputData group cust by cust.MountNb;

                    // customerGroup is an IGrouping<string, Customer>
                    foreach (var customerGroup in qrySegNgs)
                    {
                        //Console.WriteLine(customerGroup.Key);
                        lstAllSeg.Add(int.Parse(customerGroup.Key));
                        //List<int> lstNgSeg = new List<int>();
                    }
                    lstAllSeg.Sort();
                    reader.Close();
                    csv.Dispose();
                }

                foreach (var item in lstAllSeg)
                {
                    curCsvObj.lstPannels.Add(new Panel(item));
                }
                return true;
            }
            catch (System.Exception ex)
            {
                GenUtils.MyErrExcu.E(ex);
                return false;
            }
        }

这里没有CSV相关编码转换的过程,
是因为在之前我处理好了.
即有一个程序,收到文件后,进行检查和转换: 有无BOM头,如果没有则补上, 如果是其它的编码, 转为UTF-8.
然后就可以了。
类似这样:

public class CvsCovertUtf8
    {

        private static Lazy<CvsCovertUtf8> _Instance = new Lazy<CvsCovertUtf8>(() => new CvsCovertUtf8());

        public static CvsCovertUtf8 Instance
        {
            get
            {
                return _Instance.Value;
            }
        }
        public string ScrFolderPath { get; set; }
        public string TargetFolderPath { get; set; }
        //成功,将ASC,转为utf-8
        public bool CovertAsc2Utf8(string absSrcFilePath,string absDestFilePath)
        {
            try
            {
                if (!File.Exists(absSrcFilePath))
                {
                    MyLogInfo.LError("FileDoesNotExist:" + absDestFilePath);
                    return false;
                }
                if (File.Exists(absDestFilePath))
                {
                    MyUtils.DelGenFile(absDestFilePath);
                }
                using (StreamReader sr = new StreamReader(absSrcFilePath, Encoding.Default))
                using (StreamWriter sw = new StreamWriter(absDestFilePath, false, Encoding.UTF8))
                {
                    string strtxt = sr.ReadToEnd();
                    sw.Write(strtxt);
                    sw.Flush();
                    sw.Close();
                    sr.Close();
                }
                return true;
            }catch(System.Exception ex)
            {
                MyErrExcu.E(ex);
                return false;
            }

        }
    }

4. 探测和添加BOM头

                // 读取文件内容
                byte[] fileBytes = File.ReadAllBytes(defectCodeMappingPath);

                // UTF-8 BOM
                byte[] bom = Encoding.UTF8.GetPreamble();

                // 检查文件是否包含BOM
                bool hasBom = fileBytes.Length >= bom.Length && fileBytes.Take(bom.Length).SequenceEqual(bom);

                // 如果文件没有BOM，手动添加BOM
                if (!hasBom)
                {
                    byte[] fileWithBom = new byte[bom.Length + fileBytes.Length];
                    bom.CopyTo(fileWithBom, 0);
                    fileBytes.CopyTo(fileWithBom, bom.Length);
                    fileBytes = fileWithBom;
                } 不能加bom头，加入后乱码

                using (var stream = new MemoryStream(fileBytes))
                using (var reader = new StreamReader(stream, Encoding.GetEncoding(936)))  //ok
                using (var csv = new CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture)))

上面这个是原始版本,我想虽然思路是对的,但少了一步转为UTF-8.
也就是说,BOM头一定是与UTF-8相伴生的.也可能是UTF-16,但肯定是不可能与 8859这样的编码并存.