logo jcomma: help

jcomma converts CSV (and TSV) files to other formats and sanitizes them so that they are not subject to the vagaries of the generator of the CSV.

Live version at https://jcomma.davidearl.uk

CSV is a truly terrible format for data interchange, especially internationally. Yet, it is still ubiquitous in all sorts of scenarios, and it is popular because it is fairly easy for non-programmers to understand (partly because it glosses over the problems!) so we often have to deal with such files when we’d rather not.

For example, banks frequently generate CSVs with:

Furthermore, CSV files don’t have any means to indicate what character set encoding they use, and very often Excel - or the person who used Excel to send it to you - produces one (typically Windows) and your consumer needs another (often UTF-8). Particularly a problem with things like £ signs.

jcomma addresses these problems, especially when you encounter files laid out the same way repeatedly.

Usage

See also installation.

top

As a web app

Fill in the form and click Do It!. Each reload of the page remembers the recipe for interpreting the CSV file, and you can also save and restore a recipe from file or browser storage.

If you give a URL to a recipe on the web, jcomma will load that rather than the most recently used. For example:

https://jcomma.davidearl.uk/?recipe=https%3A%2F%2Fexample.com%2Fsome_recipe.json  

which means you can share a recipe efficiently. (If you put it in Google Drive or Dropbox, we’ll convert the link they give you when you share the file to one which is directly download-able). Don’t forget that the second URL must be percent-encoded when used as a parameter in the main URL (you can do that most easily by loading it in the box provided and then copying the URL)

top

Via API

Either:

For example:

curl -F "[email protected]" -F "[email protected]" "https://jcomma.davidearl.uk/jcomma.php"

You can use the web app to make a recipe file, save it, and then use that programatically; or make one on the fly.

top

As a library

To use as a library when installed with composer, include vendor/autoload.php, or otherwise include lib/JComma.php in your PHP application, and use like this:

$j = new \DavidEarl\JComma\JComma($pathtocsv, $recipeobject /* not JSON: already decoded */);
$errors = $j->validate(); // produces array of error message strings, or empty array if recipe is OK
if (empty($errors)) {
    $result = $j->convert(); // produces array of objects
    $j->output($result, $filename, $cl); /* optional: if you want to actually emit a file or string */
}

$recipe->outputTo can be set to ‘string’ (unlike when used via the API), when it is converted to the output format but returned as a string rather than emitted to stdout. $cl=TRUE (“command line”) is optional, and omits all headers

top

In a shell command line

Like this, for example:

php jcomma.php -s recipe.json my.csv

The result is written to standard output. recipe.json is the file containing a json representation of the options, as shown below. If the input csv path is ‘-’ or omitted, it is read from standard input, so the command can be used in a pipe:

echo my.csv | php jcomma.php -s recipe.json > my.xml

top

Load recipe

You can load the recipe previously saved with the link at the bottom of the page (or prepared elswhere) by opening the JSON file with the choose file button here. In Chrome you can just drop the file onto the button.

You can also load the recipe from the cloud by providing a URL. If successful, the URL is included in the address of the page, so you can distribute that URL to share the recipe pre-loaded in jcomma. Note that shared Google Drive and Dropbox URLs would normally take you to a download page, but jcomma converts these so they can be loaded directly from the sharing URL provided by these apps.

Alternatively, copy the JSON and paste it into the box provided.

top

Reset recipe

Clicking this just resets the page to the simplest possible recipe for you to then exand on (it may be needed since the page remembers each change).

top

Choose recipe

When you give your recipe a name, the browser automatically remembers it using that name, so you can reload it by selecting the name from this menu.

top

Delete recipe

As the browser saves all named recipes separately, there needs to be a way to clear them out. Just load the relevant recipe from the menu and then press the delete link.

top

Save recipe

All the settings described in the form are saved as a jcomma recipe, a JSON file, to a local file. Note that tghe whole recipe is saved on every change so if the page is reloaded, changes are not lost.

Alternatively copy the JSON from the box provided to where you need it, optionally laid out prettily by checking “pretty print layout”.

As well as restoring them to the page, a saved recipe can be used in automated workflows using jcomma, so that the recipes do not need to be hand written in JSON.

Output settings

top

Recipe name

You can give your recipe a name, for your own reference. Named recipes are stored by the browser so can be retrieved by name.

top

Comment

This is just saved with the recipe so when you look in the file there is something to tell you what it is for. It’s not saved with the output, just with the recipe.

top

Output to

Choose whether to save the output to a downloaded file or to display it in a browser tab. Note that csv and xlsx files are always downloaded (because browsers don’t know what to do with them).

This corresponds to the "outputTo": "..." in the recipe, with values ‘inline’ (for display) and ‘attachment’ (for download), and when used as the library ‘string’ to retrieve the result as a string.

top

Output format

jcomma can write its result in a variety of formats ("outputFormat": "json" for example in the recipe). Some of these have additional options:

Input settings

top

CSV file

Select the CSV file to process. In Chrome you can just drag and drop the file onto the Choose File button. Everything that follows then describes how that CSV file should be interpreted to convert it to output records (which are then formatted according to the chosen file format and options).

top

Paste CSV

Instead of choosing a CSV file to be processed, you can paste the data into the box provided. This is most useful when the data source is a table in a web page, which you can then copy and paste directly. In this case, the copy puts a TAB character between each cell, not a comma, so you will need also to set the delimiter character to Tab.

top

Input CSV encoding

CSV files are completely ambiguous when it comes to characters that aren’t represented in the basic ASCII character set, such as accented characters, currency symbols and exotic punctuation, not to mention whole languages like Japanese and Hebrew. While Excel does have an option to choose the character set when it is saved, most people do not know about it and are completely unaware of the consequences, namely that a consumer of the CSV may not see currency symbols or non-English names correctly. Typically Excel exports CSVs using a special Windows character set, while Google Sheets does so using the standard international character set called UTF-8.

This option ("encoding": "whatever" in the recipe) lets you explicitly state which character set to expect in the CSV (we will try to detect this when set to auto, but it is not always possible to be correct).

top

Header rows

Many CSV files have a header row labelling the columns. Some CSV files have additional header rows where people put arbitrary data (for example, a bank statement CSV might use one line for the account number, and another for the sort code).

Specify the number of header rows in the CSV here ("headerRows": 1 for example in the recipe). Zero means there is no header row at all, while 1 is common.

If header rows is 1 or more, the last of them is assumed to contain column headings, and these can be used to identify columns instead or as well as column letters. This can make the settings easier to follow. Column headings and letters are not treated as case-sensitive. Also note that conventionally the 27th and subsequent columns are “AA”, “AB”, “AC” etc, and the 53rd “BA”, “BB”, “BC”, and so on, as in Excel.

top

Each record formed from … rows

When the CSV file is processed, typically each row is used to create one (or sometimes more) output records. However, sometimes a record might be formed from a group of rows if the data has been spread over more than one row.

This number ("rowCount": 1 for example in the recipe) says how many rows to consume (after any rows that are ignored below have been discarded) to construct each record (or group of records). 1 is typical.

top

Delimiter

Columns in the CSV are expected to be separated by this character ("delimiterChar": "," for example in the recipe), usually a comma. If you have tab-separated columns (aka a TSV file), check the “tab” box (as you can’t easily type a tab character).

HTML tables copied and pasted as plain text are usually tab-separated, so if you copy a table off a web page and paste it into the paste csv box, setting the delimiter to TAB, it should be possible to process data without saving it to a spreadsheet first.

top

Enclosing character

Mostly, column data in a CSV file is just listed verbatim, but columns which need to include the delimiter character (e.g. a comma) need to be enclosed by a pair of these characters ("enclosureChar": "\"" for example in the recipe), usually a double quote. This setting allows you to change that character - if the file uses a single quote for example.

Where an enclosed field also contains the enclosure character, it needs to be doubled. For example, with a conventional double quote we might have

   ...,"This field has a "" symbol",...

Or if the delimiter is a slash character and needs to include a single slash:

   ...,/This field has a // symbol/,...

top

Ignore rows…

People will often put random rows in the middle of CSVs which makes them particularly hard to process in an orderly manner. For example, banks often repeat their a header row every “page”, whatever that might mean.

This setting ("ignoreRows": [ ... ] in the recipe) lets you ignore such rows based on one or more criteria. Add a new criterion by pressing the + button, remove an existing one with X, and drag ↕ to change the order. The row is ignored if any one of the criteria is satisfied.

You can ignore rows based on the value of a cell in a specified column in the incoming CSV (provide its name, either the column letter or column heading), or based on the value of one of the fields computed from the incoming CSV (provide its name to match one you specify below).

See conditions for details about the conditions (the same set of conditions is used in several different places).

Note that rows are not ignored in this way when extracting any header row(s) (where exactly the number given are consumed from the CSV).

top

Combine rows…

While you can read more than one row for all the content, sometimes the CSV will pop in ano extra unexpected row. You can combine rows selectively based on some condition (e.g. first column is empty).

This setting ("combineRows": [ ... ] in the recipe) lets you include such rows together based on one or more criteria. Add a new criterion by pressing the + button, remove an existing one with X, and drag ↕ to change the order. The row is combined with the previous if any one of the criteria is satisfied.

See conditions for details about the conditions (the same set of conditions is used in several different places).

Note that rows are not combined in this way when extracting any header row(s) (where exactly the number given are consumed from the CSV).

Note also that rows will continue to be combined until some row does not meet the condition, and that if each record is normally formed from more than one row, that will continue to apply to the following uncombined rows following.

top

Output records

Typically you’ll produce one output record for each input row (that isn’t ignored according to ignore rows) or group of however many rows given by ‘Each record formed from…’. However, occasionally it may be helpful to make more than one record for each row. Each such record would appear consecutively in the output, e.g. as an object in a JSON array or an extra row in an output CSV.

For example, you might produce two records from a PayPal transaction where you want to represent a payment as one record (a credit), and the PayPal fee (which PayPal puts in the same row in its exports) as another (a debit: a bank charge).

Therefore, the + here lets you add one or more recipes of records to be produced, though you’ll often only need one. ("records": [ ... ] in the recipe). Click X to remove one, and drag ↕ to change the order

Each record then comprises a set of fields, and a set of conditions based on those fields when the record will be discarded.

top

Fields of output record

Each field ("fields": [ ... ] in the recipe) of a record comprises a name, one or more columns and bits of verbatim text from which its value is initially composed, and a list of options applied in turn which can transform, omit or reject the value.

Add a new field with +, delete an existing one with X, and drag ↕ to change the order

Fields are delivered in the order they are specified in the output records.

top

Field name

Field names ("name": "whatever" in the recipe) is used in the output, for example as object member names in JSON, as element names in XML or column headings in CSV, HTML and XLSX.

If names contain periods and/or integers in square brackets, then subordinate objects and arrays are constructed for those formats which can support hierarchical output. For example, we might have output fields called “salutation”, “address.city” and “address.postcode”. In JSON output these would appear as:

[
    {
        "salutation": "Miss Doe",
        "address": {
            "city": "Lerwick",
            "postcode": "ZE1 1AA"
        }
    },
    ...
]

and in XML (with elements named “person”) as, for example:

<persons>
    <person>
        <salutation>Miss Doe</salutation>
        <address>
            <city>Lerwick</city>
            <postcode>ZE1 1AA</postcode>
        </address>
    </person>
</persons>

Similarly a field name “people[0].name” might produce

[
    {
        "people": [
            {
                "name": "Jane Doe"
            }
        ]
    },
    ...
]

In this way multiple columns of a spreadsheet can be distributed as array elements in the output. It isn’t currently possible to include arbitrary numbers of array elements.

In table formats (HTML, CSV, XLSX), the values just appear as consecutive columns headed (when requested) by the dotted/indexed name. Column headings are worked out from the field names of the first record.

You can usefully have more than one field with the same name (usually consecutive), providing you set options to omit each in opposite circumstances. For example, say you have debit and credit columns in the original CSV but require a single simple number, positive for credit and negative for debit. So you could make one field from credit and the same field from debit, each of which has the option to omit if zero, and convert to a number, while the debit version of the field also includes the option to negate when converting to a number. (However, if the opposite column is actually always blank rather than zero - the more common case - it would be easier to create a single field comprising both columns and setting the option to prefix the debit column with a minus sign in the concatenation).

don’t include in output

("exclude": true in the recipe) excludes the field from the output altogether. This differs from the “omit” conditions in that the field remains in the output record until it is output and is therefore available to be referenced from other fields. In this way, it behaves a bit like a variable.

top

Field concatenated from

Concatenate several columns, interleaved with verbatim separators, to make one output field, before applying any options. For example, some banks provide several fields which one might usefully put into a single Description field. You would almost always need at least one column. ("comprising": [{"item": "column", "column": "A", ...}, ...] in the recipe

Where there is more than one input row (rowCount is greater than 1), you’ll need to say from which of those rows the CSV cell is obtained (this also allows you to concatenate several values vertically from the same column), by giving the row offset (0 for the first row, 1 for the second in each group of rows, and so on).

You can also include:

For fields and columns, check boxes are available to:

in any combination. This is just a shortcut for including these as verbatim text.

Click + to include a new column, field or text, X to remove one, and drag ↕ to change the order

top

Field options

After a field value is derived by concatenation from one or more cells in the CSV, that value can be transformed in a variety of ways using field options. More than one option can be applied, in the order they are given. Click + to add a new option, X to delete an existing one, and drag ↕ to change the order.

Available options are as follows:

Each of the condition options (skip, omit and stop) can compare the value given with any of:

top

Don’t output record…

Having calculated all the fields for a record, the values computed can be used to determine that their record should not be included in the output at all if any of the conditions given here are satisfied. Press + to add a new condition, X to remove an existing one, and drag ↕ to change the order.

Settings which are used in several places

top

Conditions

top

Regular expressions

Regular expressions are a language for expressing the syntax of a string of text. jcomma uses so-called PCRE-regular expressions, as in PHP. The regular expression must include the delimiters (any suitable pair of characters) and any trailing modifiers.

See the PHP manual for full details of regular expression syntax.

For example, the following would match a string comprising only the letters A, B, C or a, b, c (using tilde as the delimiter, the trailing modifier ‘i’ indicating case-independence, ^ and $ requiring start and end of string, and square brackets to indicate a range of characters):

~^[a-c]$~i

Note that when you need to escape a character, in a PHP string you often find you need two backslashes, one for PHP’s literal string syntax, and another for the regular expression itself. Here, these are not PHP literal strings, so only one is required.

Where replacement is offered, parenthesised matches can be substituted using $1, $2 etc, just as in PHP.

top

JSON recipe (“the recipe”)

The values entered into the form are turned into a JSON object. This can be saved to a file.

When used from the API, as a library, or in a shell command, the recipe is supplied in this structured form. (Yes, I know JSON can’t really have comments like this).

{
    "recipeVersion": 3, # always this for these options
    "outputFormat": "json", # csv, html, xlsx, xml
    "outputStyle": "pretty", # for json
    "outputBulkElastic": "true", # for json, any non empty value
    "outputName": "whatever", # filename, also used to name elements where needed by format
    "outputTo": "inline", # or "attachment", or when used as a library, "string"
    "outputEncoding": "UTF-8", # or "Windows-1251". For CSV files only (the others are fixed by the file format)
    "outXMLElements": "true", # <x><k>v</k>...</x> rather than <x k="v" ...></x>
    "encoding": "UTF-8",
    "headerRows": 8,
    "delimiterChar": ",",
    "enclosureChar": "\"",
    "rowCount": 1, # default 1
    "ignoreRows": [ # one or more of these (row ignored if any is true):
        {"item": "column", "name": "A", "condition": "...", "value": "..."}, # conditions as before
        {"item": "field", "name": "name", "condition": "...", "value": "..."},
        ...
    ],
    "records": [
        {
            "fields": [
                {   "name": "...",
                    "comprising": [
                        {"item": "column",
                         "column": "A",
                         "rowOffset": 0, # optional, N=0 by default, otherwise from row relative to current from the N specified for the record in rowCount
                         "trimSpaces": true, # applied first if set
                         "prefixMinus": true,
                         "appendComma": true,
                         "appendSpace": true # comma first if appendComma also set
                        },
                        {"item": "text",
                         "text": "whatever"
                        },
                        {"item": "field",
                         "field": "name",
                         "trimSpaces": true, # applied first if set
                         "prefixMinus": true,
                         "appendComma": true,
                         "appendSpace": true # comma first if appendComma also set
                        },
                        ...
                    ],
                    "options": [ # any of the following, evaluated in turn:
                        {"item": "ignoreCurrency", "currencies": "pound-sign etc"},
                        {"item": "bookkeepersNegative"}, # (123) or 123- => -123
                        {"item": "trim"}, # trim surrounding white space
                        {"item": "replaceString", "matches": "string", "output": "substitution"},
                        {"item": "replaceRegExp", "matches": "regexp", "output": "stringwithdollarsubstitutions"},
                        {"item": "convertToNumber", "errorOnType": true, "negate": true}, # any non blank value ok for options
                        {"item": "convertToDate", "errorOnType": true, "dateFormatUS": true, "dateFormatTime": true},
                        {"item": "convertToCustomDate", "errorOnType": true, "dateFormatUS": true, "dateFormatStyle": "j M Y"}, # per PHP date function
                        {"item": "omitIf", "test": "value", "condition": "match", "value": "..."},
                        {"item": "omitIf", "test": "field", "field": "name", "condition": "match", "value": "..."},
                        {"item": "skipIf", "test": "value", "condition": "match", "value": "..."},
                        {"item": "skipIf", "test": "field", "field": "name", "condition": "match", "value": "..."},
                        {"item": "skipUnless", "test": "value", "condition": "match", "value": "..."},
                        {"item": "skipUnless", "test": "field", "field": "name", "condition": "match", "value": "..."},
                        {"item": "errorOnValue", "test": "value", "condition": "match", "value": "..."} # match, eq etc, blank, white, nonNumeric
                        {"item": "errorOnValue", "test": "field", "field": "name", "condition": "match", "value": "..."} # match, eq etc, blank, white, nonNumeric
                    ]
                } ,
                ... # more fields
            ],
            "unless": [ # generate record from row unless any condition is true
                { "field": "name", "condition": "eq", "value": "..."},
                ... # more 'unless' conditions, record discarded if any is true
            ]
        },
        ... # more records (occasionally)
    ]
}

top

Installation

Requires PHP >= 5.4. Does not work on older browsers (it’s using a recent version of jQuery). It will probably work on PHP 5.3 except for JSON Pretty Print which would produce a warning in the server error log if used.

Put the files in the document root of your web server, ideally a https website, or as a sub-directory of a website.

cd to top level and run composer to generate autoload and install dependencies:

composer update

If you can’t use composer, include lib/JComma.php in your project. You will need composer to create the website.

You might want to increase the individual file and total file upload limits in your server settings from the PHP default.

Acknowledgments

Apart from PHP and server software, the only dependencies are