Monday, March 22, 2010

Extracting OpenOffice.org Calc cells from the command line

In theory it should be simple... get cell value from ODS file, sheet x, row y, column z.

With OpenDocument format, content.xml stores spreadsheet data with the following schema:

<table:table>
  <table:table-row>
    <table:table-cell> ....

So a simple XPath expression like //table:table[1]/table:table-row[5]/table:table-cell[3]  should give me "Sheet1:C5".

With xmlstarlet command would be:

xmlstarlet sel -N office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" -N table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" -t -v "//table:table[1]/table:table-row[5]/table:table-cell[3]" content.xml

BUT.. in order to save space, blank rows and columns are not stored.. instead an attribute is used to indicate how many blank rows or columns, so you can see something like <table:table-row table:number-rows-repeated=10/>.

This breaks the easy XPath expression above, and makes it difficult to find specific co-ordinates.

I've developed an XSLT transformation that will take content.xml and expand it to be indexable.


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
   xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
   xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">



<xsl:output method = "xml" indent = "yes" encoding = "UTF-8" omit-xml-declaration = "no"/>

<xsl:template match="table:table-cell">
 
        <xsl:choose>
        <xsl:when test="@table:number-columns-repeated">
          <xsl:variable name="numcols" select="number(@table:number-columns-repeated)"/>
          <xsl:choose>
          <xsl:when test="$numcols > 1000">
              <xsl:call-template name="processcol">
            <xsl:with-param name="currentCol" select="0"/>
            <xsl:with-param name="totalCols" select="1"/>
              </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
              <xsl:call-template name="processcol">
            <xsl:with-param name="currentCol" select="0"/>
            <xsl:with-param name="totalCols" select="$numcols"/>
              </xsl:call-template>
          </xsl:otherwise>
          </xsl:choose>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="."/>
        </xsl:otherwise>
        </xsl:choose>
</xsl:template>

<xsl:template match="table:table-row">
 
        <xsl:choose>
        <xsl:when test="@table:number-rows-repeated">
      <xsl:variable name="numrows" select="number(@table:number-rows-repeated)"/>
          <xsl:choose>
          <xsl:when test="$numrows > 3000">
              <xsl:call-template name="processrow">
            <xsl:with-param name="currentRow" select="0"/>
            <xsl:with-param name="totalRows" select="1"/>
                 </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
              <xsl:call-template name="processrow">
            <xsl:with-param name="currentRow" select="0"/>
            <xsl:with-param name="totalRows" select="$numrows"/>
              </xsl:call-template>
          </xsl:otherwise>
          </xsl:choose>
    </xsl:when>
        <xsl:otherwise>
          <xsl:call-template name="processrow">
        <xsl:with-param name="currentRow" select="0"/>
        <xsl:with-param name="totalRows" select="1"/>
          </xsl:call-template>
        </xsl:otherwise>
        </xsl:choose>

</xsl:template>

<xsl:template match="@*|*">
   <xsl:copy>
     <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>

<xsl:template name="processrow">
    <xsl:param name="currentRow"/>
    <xsl:param name="totalRows"/>
        <xsl:choose>
    <xsl:when test="$currentRow < $totalRows">
<xsl:copy>
<xsl:apply-templates select="table:table-cell"/>
</xsl:copy>
        <xsl:call-template name="processrow">
        <xsl:with-param name="currentRow" select="$currentRow + 1"/>
        <xsl:with-param name="totalRows" select="$totalRows"/>
        </xsl:call-template>
    </xsl:when>
    </xsl:choose>
</xsl:template>

<xsl:template name="processcol">
    <xsl:param name="currentCol"/>
    <xsl:param name="totalCols"/>
        <xsl:choose>
    <xsl:when test="$currentCol < $totalCols">
<xsl:copy-of select="."/>
        <xsl:call-template name="processcol">
        <xsl:with-param name="currentCol" select="$currentCol + 1"/>
        <xsl:with-param name="totalCols" select="$totalCols"/>
        </xsl:call-template>
    </xsl:when>
    </xsl:choose>
</xsl:template>

</xsl:stylesheet>

This can be called simply using : xmlstarlet tr /usr/local/bin/expand.xslt content.xml > output.xml

and then queried using: xmlstarlet sel -N office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" -N table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" -t -v "//table:table[1]/table:table-row[5]/table:table-cell[3]" output.xml

NOTE: Since this is using recursive XSLT, I put a cap on 3000 repeated rows, and 1000 repeated columns.  So this will work *unless* you have a spreadsheet with over 3000 blank rows (and you want to query row 3005), OR over 1000 blank columns (and you want to query column 1001) .


Putting this all together in a bash script we have:
#!/bin/bash

# $1  ods file
# $2  sheet #
# $3  row #
# $4  column #

tmpdir=/tmp/ods.$$
mkdir $tmpdir
cp "$1" $tmpdir
cd $tmpdir
IFS="|"
filename=`basename "$1"`

unzip -qq "$tmpdir/$filename"

xmlstarlet tr /usr/local/bin/expand.xslt content.xml > output.xml

xmlstarlet sel -N office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" -N table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" -t -v "//table:table[$2]/table:table-row[$3]/table:table-cell[$4]" output.xml

cd - &>/dev/null
rm -rf $tmpdir
Invocation looks like this:

oo_extract_cell.sh Test.ods 1 5 3