Support of partially malformed csv files

Jul 29, 2010 at 9:33 PM
Hello, first of all: Thank you for sharing this very useful library with us. Since the world is not perfect, I sometimes encounter csv files like this: first,second,third "value1","value2","value3" "value4", "value6",value7","value8" One can clearly see, that line 2 is malformed, since it should be something like this: "value4","","" or at least: "value4",, But the trailing value delimiters were - for whatever reason - omitted. The problem here is, that when traversing through all lines of the file with KBCsv, you get an ArgumentOutOfRangeException, when accessing the second row by dataRecord["third"] or dataRecord[2]. A possible solution to this would be to always fill the DataRecord with the maximum number of columns expected, i.e. the column count in the header or the maximum column count in the rows that were read so far if there is no header line present. The point in the code to hook in for that could be: CsvParser.GetValues() (CsvParser.cs, Line 665). What do you think of that?
Coordinator
Aug 26, 2010 at 12:50 PM

Hi there,

This scenario is supported, although it takes a little more work than usual. You can obtain the index of a header column via the HeaderRecord[string] indexer. Then, for each data record, you can check whether the value exists via the DataRecord.Values list. Here's a little test I just threw together to show how:

 

[Fact]
public void TestReadingDataWithSomeFieldsMissing()
{
    _csvReader = CsvReader.FromCsvString(string.Format("first,second,third{0}1,2,3{0}4,5{0}6,7,8", NewLine));
    var headerRecord = _csvReader.ReadHeaderRecord();
    var firstIndex = _csvReader.HeaderRecord["first"];
    var secondIndex = _csvReader.HeaderRecord["second"];
    var thirdIndex = _csvReader.HeaderRecord["third"];

    while (_csvReader.HasMoreRecords)
    {
        var dataRecord = _csvReader.ReadDataRecord();
        var firstValue = dataRecord.Values.Count > firstIndex ? dataRecord[firstIndex] : null;
        var secondValue = dataRecord.Values.Count > secondIndex ? dataRecord[secondIndex] : null;
        var thirdValue = dataRecord.Values.Count > thirdIndex ? dataRecord[thirdIndex] : null;

        Console.WriteLine("{0} {1} {2}", firstValue, secondValue, thirdValue);
    }
}

Admittedly, it's a little clunky. I'll have a think about whether there's anything I can do to the API that makes this more natural. Perhaps a GetValueOrDefault method on the DataRecord class would suffice:

while (_csvReader.HasMoreRecords)
{
    var dataRecord = _csvReader.ReadDataRecord();
    var firstValue = dataRecord.GetValueOrDefault("first");
    var secondValue = dataRecord.GetValueOrDefault("second");
    var thirdValue = dataRecord.GetValueOrDefault("third");

    Console.WriteLine("{0} {1} {2}", firstValue, secondValue, thirdValue);
}


Best,
Kent