OperationThe following topics can be found in this section: Note: The command-line examples in this section have been broken into multiple lines for readability. The commands should be reassembled into a single line prior to execution. Tool SetupIn order to execute Harvest Tool, the user's environment must first be configured appropriately. This section describes how to setup the user environment on UNIX-based and Windows machines. UNIX-Based SetupThis section details the environment setup for UNIX-based machines. The preferred method is to specify the shell script, Harvest, on the command-line. Setting the PATH environment variable to the location of the script, enables the shell script to be executed from any location on the user's machine. The following command demonstrates how to set the PATH environment variable, by appending to its current setting: % setenv PATH ${PATH}:$HOME/harvest-0.4.0/bin The tool can now be executed via the shell script as demonstrated in the following example: % Harvest <policy file> <command-line arguments> Additional methods for setting up a UNIX-based environment can be found in the UNIX Setup Options section. If viewing this document in PDF form, see the appendix for details. Windows SetupThis section details the environment setup for Windows machines. The preferred method is to specify the batch file, Harvest.bat, on the command-line. Setting the PATH environment variable to the location of the file, enables the batch file to be executed from any location on the user's machine. The following command demonstrates how to set the PATH environment variable, by appending to its current setting: C:\> set PATH = %PATH%;C:\harvest-0.4.0\bin The tool can now be executed via the batch file as demonstrated in the following example: C:\> Harvest <policy file> <command-line arguments> Additional methods for setting up a Windows environment can be found in the Windows Setup Options section. If viewing this document in PDF form, see the appendix for details. Additional Tool SetupThis section details how to re-configure the Harvest Tool to interface with another instance of the Registry Service. The Harvest Tool points to the Registry Service via the pds.registry Java System Property. If a secured, Registry Service instance is being pointed to, then the pds.security.keystore Java System Property must also be set. The following table details these 2 Java System properties:
By default, the Harvest shell script and batch file point to local installations of the Registry Service. Additionally, they automatically point to the keystore file that is included with the Harvest package. The sections below detail how to modify these scripts to point to another instance of the Registry. UNIX-Based UsersOpen the Harvest shell script and go to the last line in the file. It should look like the following: % java -Dpds.registry="http://localhost:8080/registry-service" -Dpds.security.keystore="${KEYSTORE}" -jar ${HARVEST_JAR} "$@" Replace the URL value of pds.registry with the URL to the desired instance of the Registry. For example, making the following change to the script will have Harvest pointing to the secured, operational instance of the Registry at the Engineering Node: % java -Dpds.registry="https://pdsops2.jpl.nasa.gov/registry-service" -Dpds.security.keystore=${KEYSTORE} -jar ${HARVEST_JAR} "$@" Windows-Based UsersOpen the Harvest batch and go to the last line in the file. It should look like the following: % java -Dpds.registry="http://localhost:8080/registry-service" -Dpds.security.keystore="%KEYSTORE%" -jar "%HARVEST_JAR%" %* Replace the URL value of pds.registry with the URL to the desired instance of the Registry. For example, making the following change to the batch file will have Harvest pointing to a secured operational instance of the Registry at the Engineering Node: % java -Dpds.registry="https://pdsops2.jpl.nasa.gov/registry-service" -Dpds.security.keystore="%KEYSTORE%" -jar "%HARVEST_JAR%" %* Tool ExecutionHarvest Tool can be executed in various ways. This section describes how to run the tool, as well as its behaviors and caveats. Command-Line OptionsThe following table describes the command-line options available:
Execute Harvest ToolThis section demonstrates execution of the tool using the command-line options. The examples below execute the tool via the batch/shell script. Alternate methods for executing the tool can be found in the Tool Setup section. The Harvest Tool operates with a policy file to register product metadata. Details on how to create this policy file can be found in the Harvest Policy File section. The following command demonstrates how to run the Harvest Tool against a policy file, policy.xml, using a valid username and password, with the output going to standard out: % Harvest policy.xml -u {username} -p {password} The following command demonstrates how to run the Harvest Tool with the output going to a log file, log.txt instead of standard out: % Harvest policy.xml -u {username} -p {password} -l log.txt When registering product metadata to a non-secured instance of the Registry (such as one running on your local machine), the -u and -p command-line option flags do not need to be passed into the tool. The following command demonstrates how to run the Harvest Tool to register product metadata to a non-secured instance of the Registry Service, with the output going to a log file: % Harvest policy.xml -l log.txt Persistance ModeThe Harvest Tool can be run in persistance mode through an XML-RPC accessible web service called a daemon. Under this scenario, the Harvest Tool wakes up periodically, inspects a target directory or directories, and registers the latest products. This section details how to set this up. In order to run the tool through the daemon, the command-line option flags -P and -w need to be used. This tells the Harvest Tool the port number to use and how long to sleep in between crawls, respectively. When the daemon is running, it can be accessed through the following url: http://localhost:{port number}/xmlrpc. The following command demonstrates launching the Harvest Tool through the daemon on port 9000, where it will wait 120 seconds in between crawls: % Harvest policy.xml -u {username} -p {password} -l log.txt -P 9000 -w 120 After running the above command, the daemon will be accessible at http://localhost:9000/xmlrpc. In order to stop the daemon from running, a daemon controller is needed. The bin/ directory of the Harvest Tool release package contains a shell script, HarvestController, and a batch file, HarvestController.bat, which are used to gracefully shut down the daemon service on a UNIX-like and Windows system, respectively. In addition, they can provide a few additional statistics about the crawling. The following table describes the command-line options available for the HarvestController:
The following table describes the operation names available to pass into the --operation command-line flag option:
The following examples demonstrate how to run the HarvestController using a few of the different operations. For demonstration purposes, assume that the daemon service is located at the following url: http://localhost:9000/xmlrpc. Shutdown the daemon service The following command demonstrates shutting down the daemon service: % HarvestController --url http://localhost:9000/xmlrpc --operation --stop Find Out The Status Of The Daemon Service The following command is used to find out if the daemon service is still running: % HarvestController --url http://localhost:9000/xmlrpc --operation --isRunning Harvest Policy FileThe Harvest policy file is an XML-based configuration file that the tool uses to find products and register their metadata. The schema for the policy file can be found in the Harvest Policy Schema section. If viewing this document in PDF form, see the appendix for details. This section details how to setup the policy file to do PDS data product registration. PDS4 Data Product RegistrationThe following is an example of a policy file to perform registration of PDS4 data products: <?xml version="1.0" encoding="UTF-8"?> <policy> <bundles> <file>/home/pds4/context-bundle/bundle.xml</file> </bundles> <collections> <file>/home/pds4/insthost/collection_instrument_host.xml</file> </collections> <directories> <path>/home/user/pds4/geo/product_files</path> <filePattern>*.xml</filePattern> </directories> <validation> <enabled>true</enabled> </validation> <candidates> <namespace prefix="geo" uri="http://pds.nasa.gov/schema/pds4/geo"/> <productMetadata objectType="character_table"> <xPath>//geo:Product_Identification_Area/geo:creation_date_time</xPath> <xPath>//geo:Subject_Area/geo:instrument_name</xPath> <xPath>//Subject_Area/observing_system_name</xPath> </productMetadata> <productMetadata objectType="Product_Target"> <xPath>//alternate_title</xPath> <xPath>//creation_date_time</xPath> <xPath>//identifier</xPath> <xPath>//Subject_Area/target_name</xPath> </productMetadata> </candidates> </policy> This policy file is made up of the following complex type elements: bundles, collections, directories, validation, candidates, and productMetadata. bundlesSpecify this element to tell the Harvest Tool to register and crawl a bundle file. The following table describes the elements that are allowed:
In the example above, the Harvest Tool will register the bundle file named /home/pds4/context-bundle/bundle.xml. It will then crawl the bundle file, looking for collection files to register and process. collectionsSpecify this element to tell the Harvest Tool to register and crawl a collection file. Crawling only occurs when the collection file is a primary collection. This is indicated by a value of true in the is_primary_collection element tag within the collection. The following table describes the elements that are allowed:
In the example above, the Harvest Tool will register the collection file named /home/pds4/insthost/collection_instrument_host.xml. It will then crawl the file, looking for products to register if it is a primary collection. directoriesSpecify this element to tell the Harvest Tool where to crawl for data products. The following table describes the elements that are allowed:
In the example above, the Harvest tool will crawl the directory location, /home/user/pds4/geo/product_files, looking for files that have a .xml file extension. The default is to touch all files in the directory if the filePattern element is omitted from the policy file. validation Specify this element to tell the Harvest Tool to validate a data product before registering it. If the data product does not pass the validation step, the data product will not be registered. The following table describes the elements that are allowed:
By default, if the validation element is not specified in the policy file, validation is turned on. candidates Specify this element to tell the Harvest Tool what product types to register and what metadata to extract from a data product. This is a required element in the policy file. The following table describes the elements that are allowed:
By default, the Harvest Tool defines the default namespace to be the PDS namespace, http://pds.nasa.gov/schema/pds4/pds. To override this default, specify the default attribute in the namespace element and give it a value of true. The following makes the geo namespace the default namespace: <candidates> <namespace prefix="geo" uri="http://pds.nasa.gov/schema/pds4/geo" default="true"/> ... Namespaces need to be defined in the Harvest policy file only if the metadata to be extracted exists in a namespace other than the PDS namespace. In the example above, a namespace with the prefix geo and uri http://pds.nasa.gov/schema/pds4/geo has been defined. This means that any xPath expressions defined in the policy file will be able to use the geo prefix to be able to extract metadata that are within the geo namespace. xPaths will be explained in greater detail in the productMetadata section. productMetadataSpecify this element to tell the Harvest Tool what metadata to register. It requires an attribute called objectType that tells the Harvest Tool what product types to register. The following table describes the elements that are allowed:
In the example above, the policy file tells the Harvest Tool to look for and register the character_table and Product_Target object types. Also in the example is a set of xPath elements found under each productMetadata element. This defines what metadata to extract from the different products. XPath is a query language that uses path expressions to select nodes in an XML document. These path expressions look very much like expressions in a traditional computer file system. In its simplest form, prepending a // before a name will find the element no matter where it is in the XML file. The following XPath expression will find the creation_date_time element within the default namespace, no matter where this element is located in the file: //creation_date_time The following XPath expression will find the creation_date_time element within the geo namespace, no matter where this element is located in the file: //geo:creation_date_time The following XPath expression will find all target_name elements that are children of Subject_Area within the default namespace: //Subject_Area/target_name The following XPath expression will find all target_name elements that are children of Subject_Area and that have a value of MARS: //Subject_Area/target_name[text()="MARS"] For a more detailed explanation on XPath, go to your favorite search engine and type XPath tutorial. PDS3 Product RegistrationBy default, the tool registers discovered PDS3 products under the Product_Proxy_PDS3 objectType in the registry. Additionally, the tool has to dynamically create certain metadata in order to support ingestion of PDS3 data products into the registry. The Harvesting of PDS3 Data Products section details how the Harvest Tool behaves when registering PDS3 data products. If viewing this document in PDF form, see the appendix for details. The following is an example of a policy file to perform product registration of PDS3 data products: <?xml version="1.0" encoding="UTF-8"?> <!-- Example of a Harvest policy configuration file that will do PDS3 data product registration --> <policy> <!-- Specify a single directory containing the PDS3 data products to register --> <pds3Directory> <path>/data/pds3/dataset</path> <filePattern>*.LBL</filePattern> </pds3Directory> <candidates> <!-- Harvest will register PDS3 data products under the objectType 'Product_Proxy_PDS3' --> <pds3ProductMetadata> <!-- Prefix to add to the LID of a PDS3 product registration --> <lidPrefix>URN:JPL:PDS:ENGINEERING</lidPrefix> <!-- Associations to register with discovered PDS3 products --> <associations> <!-- Specify either a LID or LIDVID reference --> <association> <referenceType>has_Target</referenceType> <lidVidReference>URN:NASA:PDS:target.MARS::1.0</lidVidReference> </association> <association> <referenceType>has_Mission</referenceType> <lidReference>URN:NASA:PDS:mission.MER</lidReference> </association> </associations> <!-- Register any additional metadata. They will be registered as slots with their element names in lowercase form. Default is to register metadata defined in the identification area of the Product_Proxy_PDS3 schema. --> <ancillaryMetadata> <elementName>START_DATE_TIME</elementName> <elementName>STOP_DATE_TIME</elementName> </ancillaryMetadata> <includePaths> <path>/data/pds3/label</path> </includePaths> </pds3ProductMetadata> </candidates> </policy> This policy file is made up of the following complex type elements: pds3Directory, pds3ProductMetadata, association, ancillaryMetadata, includePaths. pds3Directory Specify this element to tell the Harvest Tool the directory location to crawl. The following table describes the elements that are allowed:
In the example above, the Harvest Tool will crawl for PDS3 data products starting at the location /data/pds3/dataset, looking for files with a .LBL file extension. pds3ProductMetadata Specify this element to tell the Harvest Tool what metadata to ingest into the registry when registering PDS3 data products. This element must be specified within the candidates tag as shown in the example. The following table describes the elements that are allowed:
In the example above, the logical identifiers of every discovered PDS3 data product will be prefixed with URN:JPL:PDS:ENGINEERING. association Specify this element to tell the Harvest Tool what associations belong to each discovered PDS3 data product. Specifying one or more association elements is allowed and they must be within the associations tag as shown in the example. The following table describes the elements that are allowed:
Note that lidVidReference and lidReference cannot be used together within the same association tag. Only one can be chosen. In the example above, each discovered PDS3 product will have two associations: one with a LIDVID of URN:NASA:PDS:target.MARS::1.0 and association type of has_Target, and one with a LID of URN:NASA:PDS:mission.MER and association type of has_Mission. ancillaryMetadata Specify this element to tell the Harvest tool what additional metadata to register. The following table describes the elements that are allowed:
In the example above, the values from the following elements will be extracted from a PDS3 product label: START_DATE_TIME and STOP_DATE_TIME. If they are found in the label, they will be registered as slots in the registry, using their element names in lowercase form as the slot names. In this case, start_date_time and stop_date_time will be used as slot names in the registry. includePaths Specify this element to tell the Harvest tool the locations of where to find file references specified in a label. By default, the tool will look for the file reference in the location of the label file. The following table describes the elements that are allowed:
In the example above, the tool will look at the /data/pds3/label directory for file references if they cannot be found in the same location as the label file. Report FormatThis section describes the contents of the Harvest Tool report. At this time, the Harvest Tool only outputs a series of log messages. The log will report the success or failure of a discovered product attempting to be registered. Additionally, any syntactical errors in a discovered product are reported. A log consists of a severity level, file name, and a message. The following is an example of some of the log messages that can be expected from the Harvest Tool: PDS Harvest Tool Log Version Version 0.2.0-dev Time Wed, Sep 29 2010 at 02:02:27 PM Registry Location http://localhost:8080/registry-service INFO: [C:\pds4\geo\BUGLAB_Archive_Bundle.xml] Begin processing. SKIP: [C:\pds4\geo\BUGLAB_Archive_Bundle.xml] 'archive bundle' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\BUGLAB_Archive_Bundle.xml] Begin processing. SKIP: [C:\pds4\geo\schema\BUGLAB_Archive_Bundle.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\BUGLAB_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\schema\BUGLAB_Collection.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\BUGLAB_Schema_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\schema\BUGLAB_Schema_Collection.xml] 'collection' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\BUG_BDRF_product.xml] Begin processing. SKIP: [C:\pds4\geo\schema\BUG_BDRF_product.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\BUG_Document_Set.xml] Begin processing. SKIP: [C:\pds4\geo\schema\BUG_Document_Set.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\Data_Dict_2010-04-22f.xml] Begin processing. SKIP: [C:\pds4\geo\schema\Data_Dict_2010-04-22f.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\Data_Dict_commpds3_2010-04-22f.xml] Begin processing. SKIP: [C:\pds4\geo\schema\Data_Dict_commpds3_2010-04-22f.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\Data_Types_2010-04-22f.xml] Begin processing. SKIP: [C:\pds4\geo\schema\Data_Types_2010-04-22f.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\schema\Product_XML_Schema.xml] Begin processing. SKIP: [C:\pds4\geo\schema\Product_XML_Schema.xml] 'XML_Schema' is not an object type found in the policy file. INFO: [C:\pds4\geo\mars_analog_data\aref_235_450.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_450.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_450::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_480.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_480.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_480::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_530.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_530.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_530::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_600.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_600.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_600::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_670.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_670.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_670::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_750.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_750.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_750::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_800.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_800.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_800::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_860.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_860.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_860::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_900.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_900.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_900::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_930.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_930.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_930::1.0 INFO: [C:\pds4\geo\mars_analog_data\aref_235_990.xml] Begin processing. SUCCESS: [C:\pds4\geo\mars_analog_data\aref_235_990.xml] Succesfully registered product: \ URN:NASA:PDS:BUGLAB-GB:BUGLAB-GB:MARS-ANALOG-SAMPLE-DATA:AREF_235_990::1.0 INFO: [C:\pds4\geo\mars_analog_data\MAS_Data_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\mars_analog_data\MAS_Data_Collection.xml] 'collection' is not an object type found in the policy file. INFO: [C:\pds4\geo\geometry\BUGLAB_Geometry_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\geometry\BUGLAB_Geometry_Collection.xml] 'collection' is not an object type found in the policy file. INFO: [C:\pds4\geo\geometry\geominfo.xml] Begin processing. SKIP: [C:\pds4\geo\geometry\geominfo.xml] 'document_set' is not an object type found in the policy file. INFO: [C:\pds4\geo\context\BUGLAB_Context_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\context\BUGLAB_Context_Collection.xml] 'collection' is not an object type found in the policy file. INFO: [C:\pds4\geo\context\bug_instrument.xml] Begin processing. SKIP: [C:\pds4\geo\context\bug_instrument.xml] 'document_set' is not an object type found in the policy file. INFO: [C:\pds4\geo\context\bug_laboratory.xml] Begin processing. SKIP: [C:\pds4\geo\context\bug_laboratory.xml] 'document_set' is not an object type found in the policy file. INFO: [C:\pds4\geo\context\bug_mars_data_set.xml] Begin processing. SKIP: [C:\pds4\geo\context\bug_mars_data_set.xml] 'document_set' is not an object type found in the policy file. INFO: [C:\pds4\geo\about\aareadme.xml] Begin processing. SKIP: [C:\pds4\geo\about\aareadme.xml] 'document_set' is not an object type found in the policy file. INFO: [C:\pds4\geo\about\BUGLAB_About_Collection.xml] Begin processing. SKIP: [C:\pds4\geo\about\BUGLAB_About_Collection.xml] 'collection' is not an object type found in the policy file. Summary: 11 of 30 files are candidate products, 19 skipped 11 of 11 candidate products registered. 0 of 0 associations registered. End of Log
|