This document describes how to operate the Harvest-PDAP Tool software to ingest into the PDS Search Service. The following topics can be found in this document:
Note: The command-line examples in this section have been broken into multiple lines for readability. The commands should be reassembled into a single line prior to execution.
Harvest-PDAP Tool can be executed in various ways. This section describes how to run the tool, as well as its behaviors and caveats.
The following table describes the command-line options available:
Command-Line Option | Description |
---|---|
-c, --config | Specify a policy configuration file to set the tool behavior. (This flag is required) |
-l, --log-file | Specify a log file name. Default is standard out. |
-v, --verbose | Specify the message severity level and above to include in the log (0=Debug, 1=Info, 2=Warning, 3=Error). Default is Info and above (level 1). |
-V, --version | Display the release number and copyright information. |
-h, --help | Display harvest usage. |
The Harvest-PDAP Tool operates with a policy fie to register product metadata. Details on host to create this policy file can be found in the Policy File section.
This section demonstrates some of the ways that the tool can be executed:
The following example demonstrates how to ingest the ESA datasets into the PDS Search Service:
%> harvest-pdap -c ../harvest-conf/examples/harvest-pdap-policy-esa-search.xml -C ../search-conf/defaults -l output.log
In the example above, the -c flag option specifies the example harvest policy configuration file while the -C flag option specifies location for the default search policy configuration files. The following command is a MAVEN-specific example:
The above command will register the full product label into a Solr Collection index named .system, where it can be looked up using its Logical Identifier, but with period . characters instead of colon : characters due to limitations with Solr. Additionally, Harvest will write out the search index files for the target bundle into a solr-docs directory at the current working directory. In an environment where multiple bundles will be indexed, that directory should be renamed and then reside in a location that can be retrieved at a later point in the event that the Search Service will need to be re-indexed.
Once the Harvest run is complete, use the Solr Post Tool to ingest the Search documents. Depending on the deployment set up of the Search Service, run the appropriate command below:
For Non-Dockerized Search Service Instances
%> $SOLR_HOME/bin/post -c pds -params tr="add-hierarchy.xsl" $HOME/harvest-pdap-2.0.0/bin/solr-docs
The above command assumes that you have SOLR_HOME defined in your environment. The last parameter assumes that the solr-docs directory was created in the harvest-pdap-2.0.0/bin directory.
For Dockerized Search Service Instances
%> docker exec -it search-service post -c pds -params "tr=add-hierarchy.xsl" /data/solr-docs
The /data/solr-docs directory references a location within the Search Service Docker Container that is bind-mounted to the solr-docs directory at the Host machine. So this path should always get passed in for each Docker Post command.
In both scenarios above, the Search Documents get ingested into a Solr collection named pds.
The Harvest-PDAP policy file is an XML-based configuration file that the tool uses to find products and register their metadata. This section details how to setup the policy file to do PDS product registration. The following is an example of a policy file to perform registration of ESA/PSA products into the Search Service:
<policy> <pdsSearch url="http://localhost:8983/solr"> <!-- Specify this attribute to point to the location of the Search Core configuration files. Default is to point to the harvest-pdap/search-conf/defaults location. <searchConfigDirectory>/path/to/search-conf</searchConfigDirectory> --> <!-- Specify this attribute to point to a location where Harvest-PDAP will write the Solr document files. These document files will be written to a directory called solr-docs. Default is to write to the current working directory. <solrDocsDirectory>/path/to/solr-docs</solrDocsDirectory> --> </pdsSearch> <pdapServices> <!-- Currently, the only valid value for 'agency' is 'esa'. --> <!-- Can optionally specify a 'startDate' to only get data sets from the given date. --> <pdapService agency="esa" url="https://archives.esac.esa.int/psa"/> </pdapServices> <resourceMetadata> <title>The Planetary Science Archive METADATA Query Service</title> <type>System.Browse</type> <slot name="resource_name"> <value>The Planetary Science Archive METADATA Query Service</value> </slot> <slot name="resource_description"> <value>The Planetary Science Archive METADATA Query Service</value> </slot> </resourceMetadata> </policy>
The policy file is made up of the following complex type elements: pdsSearch, pdapServices, productMetadata and resourceMetadata.
pdsSearch
Specify this element to register products into the Search Service. The following table describes the child elements that are allowed:
Element Name | Description |
---|---|
searchConfigDirectory | Specify to point to the top level directory of the Search Core configuration files. If this is not specified, then the default is to point to the search-conf/ folder found in the Harvest-PDAP package. |
solrDocsDirectory | Specify this to set the directory location to output the Solr Document files. If this is not specified, it will write these files to a solr-docs/ folder in the current working directory. |
pdapServices
Specify this element to indicate the PDAP service endpoint for accessing product metadata for registration with the Registry Service. The following table describes the child elements that are allowed:
Element Name | Description |
---|---|
pdapService | Specify the PDAP service endpoint by populating the two required attributes. The attributes are named agency and url. The only valid value for agency at this time is "esa" with the corresponding url value of "https://archives.esac.esa.int/psa". |
An optional atrribute can be specified within the pdapService element to fine tune the query: startDate. This attribute represents the date at which the data set was released to the public. The format of the date value should be YYYY-MM-DD As an example, the following specification will return ESA data sets starting from 2015-01-01:
<pdapService agency="esa" url="https://archives.esac.esa.int/psa" startDate="2015-01-01"/>
resourceMetadata
Specify this element to include metadata for every resource product. A corresponding resource product is registered for every data set product registered. The following table describes the child elements that are allowed:
Element Name | Description |
---|---|
title | Specify a title for the resource product. |
type | Specify the type of resource product. The most common value is "System.Browse". |
slot | Specify additional metadata for the resource product. This element contains a required name attribute to specify the name of the slot to use in the registry. The value child element specifies the slot value. This child element may be repeated multiple times to indicate multiple values. |
This section describes the contents of the Harvest-PDAP Tool report. At this time, the tool only outputs a series of log messages. The log will report the success or failure of a discovered product attempting to be registered. A log consists of a severity level, file name, and a message. The following is an example of some of the log messages that can be expected from the Harvest Tool:
PDS Harvest-PDAP Tool Log Version Version 2.0.0-dev Time Fri, Feb 22 2019 at 08:59:35 AM Severity Level INFO PDAP Target(s) [https://archives.esac.esa.int/psa] Search Configuration /Users/mcayanan/harvest-pdap-2.0.0-dev/search-conf Solr Docs Output Directory /Users/mcayanan/harvest-pdap-2.0.0-dev/bin/solr-docs INFO: Connecting to PDAP Service: https://archives.esac.esa.int/psa ******* AdaptiveByteStore default memory limit = 2048M * 0.125 = 256M ************ ******* malloc 4861928 bytes ************ INFO: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Processing dataset. INFO: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Additional metadata needed. Getting dataset catalog file. INFO: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Attempting to retrieve the catalog file using the file name: DATASET.CAT INFO: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Successfully parsed file: \ https://archives.esac.esa.int/psa/pdap/fileaccess?ID=EARTH/AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0/CATALOG/DATASET.CAT SUCCESS: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Successfully created Solr Document for dataset. SUCCESS: [AIRUB-C-PHOTOCAM-2-EDR-HALLEY-1986-V1.0] \ Successfully created Solr Document for resource. INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Processing dataset. INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Additional metadata needed. Getting dataset catalog file. INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Attempting to retrieve the catalog file using the file name: DATASET.CAT INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Could not retrieve catalog file using the file name DATASET.CAT INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Retrieving VOLDESC.CAT to look up the data set catalog file name. INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Successfully parsed file: \ https://archives.esac.esa.int/psa/pdap/fileaccess?ID=EARTH/ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0/VOLDESC.CAT INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Retrieved the catalog file name from the VOLDESC.CAT: ESO_C_DFOSC_3_RSA_WIRT_DS.CAT INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Retrieving the catalog file 'ESO_C_DFOSC_3_RSA_WIRT_DS.CAT' INFO: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Successfully parsed file: \ https://archives.esac.esa.int/psa/pdap/fileaccess?ID=EARTH/ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0/CATALOG/ESO_C_DFOSC_3_RSA_WIRT_DS.CAT SUCCESS: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Successfully created Solr Document for dataset. SUCCESS: [ESO-C-DFOSC-3-RSA-WIRTANEN-IMG-V1.0] \ Successfully created Solr Document for resource.