Tutorial: Workflow Parallelization¶
Warning
Workflows should be optimized to an image test-set before running a whole dataset. See the VIS workflow tutorial or VIS/NIR tutorial. Our download tool, which talks to a LemnaTec database system, has a specific file structure, which may be different than yours unless you are using our tool, but we also have instructions to run PlantCV over a flat file directory (just keep this in mind).
Workflow parallelization step-by-step guide ¶
Running PlantCV workflows over a dataset¶
We normally execute workflows in a shell script or in in a cluster scheduler job file. The parallelization tool
plantcv-run-workflow has many configuration parameters. To make it easier to manage the number of input parameters,
a configuration file can be edited and input.
Configuration-based method¶
To create a configuration file, run the following:
plantcv-run-workflow --template my_config.json
The code above saves a text configuration file in JSON format using the built-in defaults for parameters. The parameters can be modified
directly in Python as demonstrated in the WorkflowConfig documentation. A configuration can be
saved at any time using the save_config method to save for later use. Alternatively, open the saved config
file with your favorite text editor and adjust the parameters as needed (refer to the attributes section of
WorkflowConfig documentation for details about each parameter).
Some notes on JSON format:
- Like Python, string variables (e.g. "VIS") need to be in quotes but must be double
"quotes. - Unlike Python,
trueandfalsein JSON are lowercase. Nonein Python translates tonullin JSON\characters need to be escaped in JSON e.g.\din Python becomes\\din JSON- There are no comments in JSON
Differences between JSON and Python will be automatically converted appropriately if you make changes to the config in Python and then use save_config.
Once configured, a workflow can be run in parallel over a dataset using the command:
plantcv-run-workflow --config config.json
As noted on the WorkflowConfig page, plantcv-run-workflow can be configured to run PlantCV
workflows locally or distribute workflows to a cluster using a scheduler service (e.g. HTCondor, SLURM, etc.).
Running PlantCV workflows over a flat directory of images¶
Note
PlantCV can analyze images in parallel that are stored in a directory (including subdirectories). Our aim is to make this process as flexible as possible but consistency in naming images is key. Ideally image filenames are constructed of metadata information separated by a consistent delimiter (though we provide a regular expression-based parser if needed). Please follow the instructions below carefully.
In order for PlantCV to extract all of the necessary metadata from the image files, image files need to be named in a particular way.
Image name might include:
- Plant ID
- Timestamp
- Measurement/Experiment Label
- Image Type
- Camera Label
- Zoom
Example Name:
AABA002948_2014-03-14 03-29-45_Pilot-031014_VIS_TV_z3500.png
- Plant ID = AABA002948
- Timestamp = 2014-03-14 03-29-45
- Measurement Label = Pilot-031014
- Image Type = VIS
- Camera Label = TV
- Zoom = z3500
Valid Metadata
Valid metadata that can be collected from filenames (basenames) are camera, imgtype, zoom, exposure, gain, frame, rotation,
lifter, timestamp, id, barcode, treatment, cartag, measurementlabel, and other. Additionally, the file path starting from
the input_dir can be used as filepath or individual components of it as filepath{1:N}, the file name is available for regex filtering
with the basename key, which may be useful for regex based filtering.
Note
Note that if a metadata.json or SnapshotInfo.csv file exists in your config.input_dir directory then that file will be used to supply
metadata instead of any parsing specified in the metadata key. If one of those files exists the filepath, filepath{1:N}, and basename
keys are still available to filter your data.
To correctly process timestamps, you need to specify the timestamp format (timestampformat configuration
parameter) code for the
strptime C library.
For the example above you would use "timestampformat": "%Y-%m-%d %H-%M-%S".
Example configuration¶
Sample image filename: cam1_16-08-06-16:45_el1100s1_p19.jpg
{
"input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
"json": "edger-round1-brassica.json",
"filename_metadata": ["camera", "timestamp", "id", "other"],
"workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
"img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
"tmp_dir": "."",
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": "_",
"metadata_filters": {"camera": "cam1"},
"metadata_regex": {},
"timestampformat": "%y-%m-%d-%H:%M",
"writeimg": true,
"other_args": {},
"groupby": ["filepath"],
"group_name": "auto",
"cleanup": true,
"append": false,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
Running plantcv-run-workflow --config config.json with the example configuration options above will run the PlantCV
workflow script 2016-08_pat-edger_brassica-cam1-splitimg.py on the images in the input directory using an HTCondor
compute cluster with up to 16 worker jobs checked out of the cluster.
Using a pattern matching-based filename metadata parser¶
If image filenames do not use a consistent delimiter (e.g. rgb_plant-1_2019-01-01 10_00_00.png) throughout,
then using the delimiter parameter with a single separator character will not split the filename properly
into the component metadata parts. An advanced option to extract metadata in this situation is to use pattern
matching, or regular expressions. The delimiter parameter
will accept a regular expression in place of a delimiter character. Example:
Example filename: rgb_plant-1_2019-01-01 10_00_00.png
Metadata components: imgtype, plantbarcode, timestamp
Delimiter = "_" will not work because the timestamp contains _ characters.
Regular expression: '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})'
Interpreting the example pattern
A key part of the pattern is the use of parenthesis because in regular expression syntax these mark the start and end of a group that will be returned from a match (or in other words parsed for our purposes). Regular expression patterns can be as general or specific as needed. The pattern above reads as:
Group 1 (camera): any 3 characters
Underscore
Group 2 (plantbarcode): 1 or more of any character
Underscore
Group 3 (timestamp): 4 digits, dash, 2 digits, dash, 2 digits, space, 2 digits, underscore, 2 digits, underscore, 2 digits
Note that the number of groups returned by the regular expression must match the number of metadata terms provided to
in a list to the filename_metadata parameter.
Example configuration:
{
"input_dir": "input_directory",
"json": "output.json",
"filename_metadata": ["camera", "plantbarcode", "timestamp"],
"workflow": "user-workflow.py",
"img_outdir": "output_directory",
"tmp_dir": ".",
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": '(.{3})_(.+)_(\d{4}-\d{2}-\d{2} \d{2}_\d{2}_\d{2})',
"metadata_filters": {},
"metadata_regex": {},
"timestampformat": "%Y-%m-%d %H_%M_%S",
"writeimg": true,
"other_args": {},
"groupby": ["filepath"],
"group_name": "auto",
"cleanup": true,
"append": false,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
If you need help building a regular expression, https://regexr.com/ is a useful site to help build and interpret patterns. Also feel free to post an issue.
Using a pattern matching-based filename metadata filter¶
Regex patterns can be supplied as a dictionary to the metadata_regex parameter of the configuration json.
The filepath key will search for a regex pattern anywhere in the absolute path to an image.
The filepath1 key will search for a regex pattern in the first directory starting from the input_dir.
Additional keys up to filepathN (the max depth directory containing an image in input_dir) are available
as well as basename if filtering with regex is easier than with exact metadata terms.
In this example we filter several levels of the file path and the basename.
Here the first component of the file path starting from the input_dir represents a barcode,
which we filter to start with "barcode" followed by two 0s and a number between 5 and 9,
then matching either AB or AD. We don't filter the second component of the filepath,
but do filter the third to start with "snapshot".
Finally, we filter the basename for top view rgb images with "TV_VIS.*".
Example configuration:
{
"input_dir": "input_directory",
"json": "output.json",
"filename_metadata": [""],
"workflow": "user-workflow.py",
"img_outdir": "output_directory",
"tmp_dir": ".",
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": '_',
"metadata_filters": {},
"metadata_regex": {"filepath1":"^barcode00[5-9]A[B|D]", "filepath3":"snapshot.*", "basename":"TV_VIS.*"},
"timestampformat": "%Y-%m-%d %H_%M_%S",
"writeimg": true,
"other_args": {},
"groupby": ["filepath"],
"group_name": "auto",
"cleanup": true,
"append": false,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
Grouping images for multi-image workflows¶
Advanced PlantCV workflows can co-analyze multiple images. For example, a dataset containing an RGB and grayscale near-infrared image could be co-analyzed in a single workflow.
Sample image filenames: rgb_16-08-06-16:45_el1100s1_p19.jpg and nir_16-08-06-16:45_el1100s1_p19.jpg
Note in the example above, the two filenames are the same other than the indicated image type (rgb or nir).
In the example configuration below, we can group these images by timestamp because they share this metadata.
To identify each image within our workflow, we will name them based on the imgtype metadata values (rgb and nir).
{
"input_dir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1",
"json": "edger-round1-brassica.json",
"filename_metadata": ["imgtype", "timestamp", "id", "other"],
"workflow": "/home/mgehan/pat-edger/round1-python-pipelines/2016-08_pat-edger_brassica-cam1-splitimg.py",
"img_outdir": "/shares/mgehan_share/raw_data/raw_image/2016-08_pat-edger/data/split-round1/split-cam1/output",
"tmp_dir": ".",
"start_date": null,
"end_date": null,
"imgformat": "jpg",
"delimiter": "_",
"metadata_filters": {},
"timestampformat": "%y-%m-%d-%H:%M",
"writeimg": true,
"other_args": {},
"groupby": ["timestamp"],
"group_name": "imgtype",
"cleanup": true,
"append": false,
"cluster": "HTCondorCluster",
"cluster_config": {
"n_workers": 16,
"cores": 1,
"memory": "1GB",
"disk": "1GB",
"log_directory": null,
"local_directory": null,
"job_extra_directives": null
},
"metadata_terms": {
...
}
}
### Convert the output JSON file into CSV tables
```bash
plantcv-utils json2csv -j output.json -c result-table
See Accessory Tools for more information.
Source Code: Here