Workflow Parallelization

Tutorial: Workflow Parallelization¶

Warning

Workflows should be optimized to an image test-set before running a whole dataset. See the VIS workflow tutorial or VIS/NIR tutorial. Our download tool, which talks to a LemnaTec database system, has a specific file structure, which may be different than yours unless you are using our tool, but we also have instructions to run PlantCV over a flat file directory (just keep this in mind).

Workflow parallelization step-by-step guide ¶

Running PlantCV workflows over a dataset¶

We normally execute workflows in a shell script or in in a cluster scheduler job file. The parallelization tool plantcv-run-workflow has many configuration parameters. To make it easier to manage the number of input parameters, a configuration file can be edited and input.

Configuration-based method¶

To create a configuration file, run the following:

plantcv-run-workflow --template my_config.json

The code above saves a text configuration file in JSON format using the built-in defaults for parameters. The parameters can be modified directly in Python as demonstrated in the WorkflowConfig documentation. A configuration can be saved at any time using the save_config method to save for later use. Alternatively, open the saved config file with your favorite text editor and adjust the parameters as needed (refer to the attributes section of WorkflowConfig documentation for details about each parameter).

Some notes on JSON format:

Like Python, string variables (e.g. "VIS") need to be in quotes but must be double " quotes.
Unlike Python, true and false in JSON are lowercase.
None in Python translates to null in JSON
\ characters need to be escaped in JSON e.g. \d in Python becomes \\d in JSON
There are no comments in JSON

Differences between JSON and Python will be automatically converted appropriately if you make changes to the config in Python and then use save_config.

Once configured, a workflow can be run in parallel over a dataset using the command:

plantcv-run-workflow --config config.json

As noted on the WorkflowConfig page, plantcv-run-workflow can be configured to run PlantCV workflows locally or distribute workflows to a cluster using a scheduler service (e.g. HTCondor, SLURM, etc.).

Running PlantCV workflows over a flat directory of images¶

Note

PlantCV can analyze images in parallel that are stored in a directory (including subdirectories). Our aim is to make this process as flexible as possible but consistency in naming images is key. Ideally image filenames are constructed of metadata information separated by a consistent delimiter (though we provide a regular expression-based parser if needed). Please follow the instructions below carefully.

In order for PlantCV to extract all of the necessary metadata from the image files, image files need to be named in a particular way.

Image name might include:

Plant ID
Timestamp
Measurement/Experiment Label
Image Type
Camera Label
Zoom

Example Name:

AABA002948_2014-03-14 03-29-45_Pilot-031014_VIS_TV_z3500.png

Plant ID = AABA002948
Timestamp = 2014-03-14 03-29-45
Measurement Label = Pilot-031014
Image Type = VIS
Camera Label = TV
Zoom = z3500

Valid Metadata

Valid metadata that can be collected from filenames (basenames) are camera, imgtype, zoom, exposure, gain, frame, rotation, lifter, timestamp, id, barcode, treatment, cartag, measurementlabel, and other. Additionally, the file path starting from the input_dir can be used as filepath or individual components of it as filepath{1:N}, the file name is available for regex filtering with the basename key, which may be useful for regex based filtering.

Note

Note that if a metadata.json or SnapshotInfo.csv file exists in your config.input_dir directory then that file will be used to supply metadata instead of any parsing specified in the metadata key. If one of those files exists the filepath, filepath{1:N}, and basename keys are still available to filter your data.

To correctly process timestamps, you need to specify the timestamp format (timestampformat configuration parameter) code for the strptime C library. For the example above you would use "timestampformat": "%Y-%m-%d %H-%M-%S".

Example configuration¶

Sample image filename: cam1_16-08-06-16:45_el1100s1_p19.jpg

{
    "input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
    "json": "edger-round1-brassica.json",
    "filename_metadata": ["camera", "timestamp", "id", "other"],
    "workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
    "img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
    "tmp_dir": "."",
    "start_date": null,
    "end_date": null,
    "imgformat": "jpg",
    "delimiter": "_",
    "metadata_filters": {"camera": "cam1"},
    "metadata_regex": {},
    "timestampformat": "%y-%m-%d-%H:%M",
    "writeimg": true,
    "other_args": {},
    "groupby": ["filepath"],
    "group_name": "auto",
    "cleanup": true,
    "append": false,
    "cluster": "HTCondorCluster",
    "cluster_config": {
        "n_workers": 16,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
    ...
    }
}

Running plantcv-run-workflow --config config.json with the example configuration options above will run the PlantCV workflow script 2016-08_pat-edger_brassica-cam1-splitimg.py on the images in the input directory using an HTCondor compute cluster with up to 16 worker jobs checked out of the cluster.

Using a pattern matching-based filename metadata parser¶

If image filenames do not use a consistent delimiter (e.g. rgb_plant-1_2019-01-01 10_00_00.png) throughout, then using the delimiter parameter with a single separator character will not split the filename properly into the component metadata parts. An advanced option to extract metadata in this situation is to use pattern matching, or regular expressions. The delimiter parameter will accept a regular expression in place of a delimiter character. Example:

Example filename: rgb_plant-1_2019-01-01 10_00_00.png Metadata components: imgtype, plantbarcode, timestamp Delimiter = "_" will not work because the timestamp contains _ characters. Regular expression: '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})'

Interpreting the example pattern

A key part of the pattern is the use of parenthesis because in regular expression syntax these mark the start and end of a group that will be returned from a match (or in other words parsed for our purposes). Regular expression patterns can be as general or specific as needed. The pattern above reads as:

Group 1 (camera): any 3 characters

Underscore

Group 2 (plantbarcode): 1 or more of any character

Underscore

Group 3 (timestamp): 4 digits, dash, 2 digits, dash, 2 digits, space, 2 digits, underscore, 2 digits, underscore, 2 digits

Note that the number of groups returned by the regular expression must match the number of metadata terms provided to in a list to the filename_metadata parameter.

Example configuration:

{
    "input_dir": "input_directory",
    "json": "output.json",
    "filename_metadata": ["camera", "plantbarcode", "timestamp"],
    "workflow": "user-workflow.py",
    "img_outdir": "output_directory",
    "tmp_dir": ".",
    "start_date": null,
    "end_date": null,
    "imgformat": "jpg",
    "delimiter": '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})',
    "metadata_filters": {},
    "metadata_regex": {},
    "timestampformat": "%Y-%m-%d %H_%M_%S",
    "writeimg": true,
    "other_args": {},
    "groupby": ["filepath"],
    "group_name": "auto",
    "cleanup": true,
    "append": false,
    "cluster": "HTCondorCluster",
    "cluster_config": {
        "n_workers": 16,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
    ...
    }
}

If you need help building a regular expression, https://regexr.com/ is a useful site to help build and interpret patterns. Also feel free to post an issue.

Using a pattern matching-based filename metadata filter¶

Regex patterns can be supplied as a dictionary to the metadata_regex parameter of the configuration json.

The filepath key will search for a regex pattern anywhere in the absolute path to an image. The filepath1 key will search for a regex pattern in the first directory starting from the input_dir. Additional keys up to filepathN (the max depth directory containing an image in input_dir) are available as well as basename if filtering with regex is easier than with exact metadata terms.

In this example we filter several levels of the file path and the basename. Here the first component of the file path starting from the input_dir represents a barcode, which we filter to start with "barcode" followed by two 0s and a number between 5 and 9, then matching either AB or AD. We don't filter the second component of the filepath, but do filter the third to start with "snapshot". Finally, we filter the basename for top view rgb images with "TV_VIS.*".

Example configuration:

{
    "input_dir": "input_directory",
    "json": "output.json",
    "filename_metadata": [""],
    "workflow": "user-workflow.py",
    "img_outdir": "output_directory",
    "tmp_dir": ".",
    "start_date": null,
    "end_date": null,
    "imgformat": "jpg",
    "delimiter": '_',
    "metadata_filters": {},
    "metadata_regex": {"filepath1":"^barcode00[5-9]A[B|D]", "filepath3":"snapshot.*", "basename":"TV_VIS.*"},
    "timestampformat": "%Y-%m-%d %H_%M_%S",
    "writeimg": true,
    "other_args": {},
    "groupby": ["filepath"],
    "group_name": "auto",
    "cleanup": true,
    "append": false,
    "cluster": "HTCondorCluster",
    "cluster_config": {
        "n_workers": 16,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
    ...
    }
}

Grouping images for multi-image workflows¶

Advanced PlantCV workflows can co-analyze multiple images. For example, a dataset containing an RGB and grayscale near-infrared image could be co-analyzed in a single workflow.

Sample image filenames: rgb_16-08-06-16:45_el1100s1_p19.jpg and nir_16-08-06-16:45_el1100s1_p19.jpg

Note in the example above, the two filenames are the same other than the indicated image type (rgb or nir).

In the example configuration below, we can group these images by timestamp because they share this metadata. To identify each image within our workflow, we will name them based on the imgtype metadata values (rgb and nir).

{
    "input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
    "json": "edger-round1-brassica.json",
    "filename_metadata": ["imgtype", "timestamp", "id", "other"],
    "workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
    "img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
    "tmp_dir": ".",
    "start_date": null,
    "end_date": null,
    "imgformat": "jpg",
    "delimiter": "_",
    "metadata_filters": {},
    "timestampformat": "%y-%m-%d-%H:%M",
    "writeimg": true,
    "other_args": {},
    "groupby": ["timestamp"],
    "group_name": "imgtype",
    "cleanup": true,
    "append": false,
    "cluster": "HTCondorCluster",
    "cluster_config": {
        "n_workers": 16,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
    ...
    }
}

### Convert the output JSON file into CSV tables

```bash
plantcv-utils json2csv -j output.json -c result-table

See Accessory Tools for more information.

Source Code: Here