No, this is not yet another post about Excel’s horrible/wrong default handling of CSV files.
This is a rant about how dealing with Excel files can still be painful in 2022, despite the open XLSX format and long-standing libraries like NPOI that are supposed to make reading them easier.
Being a rant, this post will not be diplomatic, it will not weigh different perspectives against each other, or — god forbid — consider other use-cases but my own ;-).
Most of the issues are on the coding side where NPOI in my opinion either lacks abstraction or leaks implementation details of the XLSX format that are not per se Excel concepts.
One of the issues is on the UX side of Excel.
Don’t get me wrong.
I am sure that even getting to the API that NPOI offers was a lot of work.
But this makes it even sadder, that all this work was done and then the last 10 % of work for getting a half-way sane API is left to the user.
I am aware of ExcelMapper, but that already was a step too far for me.
All I wanted was to be able to read single cells without having to deal with low-level Excel idiosyncracies and even less the idiosyncracies of its file formats.
Ok, here we go.
Blank vs. Empty vs. Non-Existing Cells
Despite Excel’s laissez-faire approach to cell content types, NPOI strictly differentiates between blank cells, empty cells and non-existing cells.
While the first two may have corresponding semantics in Excel, the latter is just an artifact of how the XLSX format works.
Neither of them is abstracted away by NPOI.
In addition, we have to consider cells with formulas in them.
Those have their own type and then another property for the type of the formula result.
When reading data from Excel files in C#, I don’t care how the data was entered, but just what the final cell value is.
The way NPOI works, checking if a cell has no data in it, looks like this:
bool IsCellInRowEmpty(IRow row, int columnIndex) => row.GetCell(columnIndex) is ICell cell ? cell.CellType is CellType.Blank || (((cell.CellType is CellType.Formula && cell.CachedFormulaResultType is CellType.String) || cell.CellType is CellType.String) && string.IsNullOrEmpty(cell.StringCellValue)) : true
And I would not be surprised if this code was still missing some edge-case.
This is not what a sane API looks like.
Multiline Cells
Excel string cells can have multiple lines in them.
When reading the cell content of a multi-line cell with NPOI, you get a .NET string
.
So far, so good.
How do you get the individual lines from that string?
If you guessed multiLineCellContent.Split(Environment.NewLine)
, you guessed wrong.
This happens to work on Linux, but will not work on Windows.
Why?
Because NPOI will always give you a string with \n
(LF) as a line break.
This is surprising.
Why does NPOI do that?
This is an artifact of the XLSX format.
XLSX is a zip file with some folder structure and a bunch of XML files in those folders.
One of those XML files contains the strings that are used in the workbook’s cells.
Our multiline cell content is saved as a multiline string as the content of some XML element.
The line endings used in that file do not matter, they could be either CRLF or LF, because of what comes next.
The XML specification requires XML processors to “behave as if it normalized all line breaks” to LF, see https://www.w3.org/TR/REC-xml/#sec-line-ends.
Apparently, the XML processor employed by NPOI replaces all line-endings with LF (in contrast to merely “behave as if”).
Thus, our multiline string ends up with LF line-endings after XML processing and is returned to us like that.
I raised an issue about this, but it was not considered important.
Changing Cell Types as a User
On top of all the purely technical difficulties come the complexities of making an Excel file a user-friendly input option.
The user may have a cell of type text and enter a number.
The code then upon trying to read a number from a cell should also consider cells of type string and parse them according to some specified format.
In the early stages of my code other things had a higher priority, so it did not do any parsing, but required that numbers be entered in number cells.
This required some users to change the cell type of the cell where they had entered the number from string type to number type
Guess what happens when you change the cell type in the menu and then save the file?
The answer is: nothing.
To actually reflect the cell type change in the saved XLSX file, you have to re-enter the value.
Great job, Excel!
The End
In the end, I solved all of the above and more by establishing yet another layer of abstraction on top of NPOI, extensively covered by XLSX-driven unit tests.
That layer gave me proper enumeration of rows and cells, proper string line-endings and proper checks for empty cells and the actual cell type when using formulas.
Yes, all is well that ends well, I suppose.
I still would rather not have learned those intricate details of XLSX and instead have stayed blissfully ignorant while using a sane XLSX library.