PDS4 Data Release: Discipline Nodes
This tutorial details the steps necessary for a PDS Discipline Node to ingest a PDS4 data delivery locally. This procedure assumes that the Node has received either a new or updated bundle from a Data Provider and that the bundle is staged in an area on disk that is accessible by the locally installed PDS4 software (including but not limited to the Validate Tool and Harvest Tool). If the PDS4 software has not been installed, the for a given PDS4 Build provide a high-level procedure for installing the PDS4 software.
See the PDS4 Data Release Process page for more information on how this procedure fits into the full data release process.
- Install / Upgrade PDS4 Software Components
- Validate the Bundle
- Ingest with Harvest
- Post Data To Search Index
- Generate and Deliver Products Deep Archive
- Notify EN
Install / Upgrade PDS4 Software Components
Go to PDS4 Releases page and install or upgrade the necessary components per the “Discipline Node Environment” section. Improvements of most PDS4 tools are ongoing so having the latest version installed can help ensure a successful data release.
In order for a Discipline Node to complete a PDS4 data release, the following components, at minimum, must be installed:
Validate the Bundle
The first step to perform after receiving new or updated data from a data provider is to validate that data. This can be accomplished by executing the Validate Tool. The documentation included in the software download package provides details for installing and operating the validation software. There are a few ways to execute the software depending on the state of development of the data. In this case, we would expect to have a complete bundle with fully resolvable schemaLocation references in the PDS4 labels. Perform the following command to validate the target as a PDS4 bundle and force the tool to validate against the XML Schema and Schematron files specified in a label:
$ validate -t <path-to-bundle-root-dir> -R pds4.bundle -r validate.log
The flags utilized in the above command are described in the Validate Tool Operation Guide. Once the validation run completes, review the log file for any errors. The end of the log for a clean run will look like the following:
... Summary: 0 error(s) 0 warning(s) End of Report
Any errors generated during the run should be corrected prior to moving to the next step. Depending on the error(s) encountered, the Data Provider may need to be involved in the correction.
Ingest with Harvest
NOTE: Registry installation must be completed prior to running Harvest.
The next step is to ingest the data into the Discipline Node Registry using Harvest Tool. Depending on the Node, this ingestion may be preceded by an ingestion into a Node-specific data system or simply copying the data to an online directory for access by the public. The rest of this section focuses on ingesting the data with the locally installed PDS4 software. For additional details on how to install and operate Harvest Tool:
This involves executing the Harvest Tool which will harvest the metadata from each product in the delivery and ingest into the Registry. The documentation included in the software download packages provides details for installing and operating the harvest software.
Configure Harvest
If this is an update to a bundle or collection, no additional configuration is needed and you may proceed to the next step.
For new PDS4 bundles, harvest and search policy files for the Harvest Tool will need to be generated.
For a quick test run, complete the Setup and Minimal Config portions of the Harvest Tool Operation guide to get up and running with minimal config updates.
For operational use, you will want to refine the config to meet you search needs. That will involve more detailed update to the Harvest Policy file as well as the Search API Policy Files (see conf/search/defaults/pds/pds4/*.xml).
More information on the contents of the policy files can be found in the Harvest Tool Operation guide. Unfortunately, the learning curve for defining these configurations can be a bit cumbersome until you get used to it. Future plans include a simplification of these configurations, however, in the meantime, please feel free to reach out to the PDS Engineering Node for assistance.
Harvest the Data
Once the policy files are ready, we are ready to run the tool. For usage instructions see the Harvest Operations Guide for how to run the tool and advanced usage capabilities.
Example
Here is a brief example of running the tool with a dockerized Registry installed on a Linux OS:
- Test Data: Download DPH_Examples_V11300.zip, unpack, and placed the directory under $HOME directory
- Harvest Policy File: master harvest policy file that comes packaged with the tool (conf/harvest/examples/harvest-policy-master.xml)
1. First, Let’s ingest some data:
# See where I am % pwd /usr/local # See what is in here % ls -l registry registry-2.2.0 harvest-2.5.0 # Let's go to where the tools live $ cd harvest-2.5.0/bin # As long as you downloaded and unpacked the example data, and placed in your $HOME directory, and $REGISTRY_HOME is set per your installation # We can just run Harvest with the example config right out of the box $ ./harvest -c ../conf/harvest/harvest-policy-master.xml -o $REGISTRY_HOME/../registry-data/ -l ../harvest-$(date +%Y%m%d).log # After waiting a few seconds, you should now see the command prompt return. We can then take a look at the entire log file $ less ../harvest-$(date +%Y%m%d).log # Or if you just want to see the summary: $ tail -50 ../harvest-$(date +%Y%m%d).log
... Summary: 15 of 15 file(s) processed, 3 other file(s) skipped 0 error(s), 0 warning(s) Product Labels: 15 Successfully registered 0 Failed to register Search Service Solr Documents: 15 Successfully created 0 Failed to get created XPath Solr Documents (Quick Look Only, Ignore Failed Ingestions): 15 Successfully registered 0 Failed to register Product Types Handled: 1 Product_Observational 5 Product_Collection 2 Product_Document 1 Product_Browse 3 Product_File_Text 2 Product_Context 1 Product_Bundle Registry Package Id: e297e819-1fb9-4e96-96d3-ce0c6bce3565 End of Log
... SUCCESS!
See Harvest Operations Guide for more details on the expected output.
2. Next, let’s go and see the data Registered in the registry collection: http://localhost:8983/solr/registry/select?q=*%3A* (Update host/port as needed for your installation).
TBD screenshot
OR you can try going to the Solr admin interface and selecting the registry collection from the dropdown on the left sidebar, to get to: http://localhost:8983/solr/#/registry/collection-overview
TBD screenshot
And select the Query tab and Execute Query to attempt a search:
TBD screenshot
Success! You have now completed your first ingestion of data into the registry! Continue on for posting to Search and eventual cleanup.
Post Data To Search Index
The next step is to post the data to the Search Index to take advantage of the improved accessibility and discoverability of data through the Search index.
See the Registry Operations Guide for how to Add Data to the Registry and Search.
Example
Continuing from the example above, we are now ready to ingest this data into the search index:
$ ./registry-mgr $REGISTRY_HOME/../registry-data/solr-docs /usr/local/openjdk-11/bin/java -classpath /opt/solr/dist/solr-core-7.7.2.jar -Dauto=yes -Dparams=tr=add-hierarchy.xsl -Dc=pds -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /data/solr-docs SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/pds/update?tr=add-hierarchy.xsl... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log Entering recursive mode, max depth=999, delay=0s Indexing directory /data/solr-docs (1 files, depth=0) POSTing file solr_doc_0.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/pds/update?tr=add-hierarchy.xsl... Time spent: 0:00:00.688
We can then go and see the JSON output of data in the search index by querying the data collection: http://localhost:8983/solr/data/select?q=*%3A*
TBD screenshot
OR again, you can go to the data collection from the Solr Admin Interface and execute the query: http://localhost:8983/solr/#/data/query
Success! You have now completed your first ingestion of data into the registry and search index!
For help on how to cleanup the Registry and Search indexes for operational use, see the Registry Documentation.
Generate and Deliver Products Deep Archive
TBD software planned for Build 10b to generate Archival Information Package for delivery to EN and Submission Information Package for delivery to NSSDC.
Notify EN
The final step in the release process is to notify the EN operations staff that a new delivery is available for release to the public. An email should be sent to pds-operator@jpl.nasa.gov with the following information (TBD update with new PDS Release Notification Policy):
- Release Information
- Title of the release
- Brief description of the release
- LIDVID of the Bundle
- URL of the Bundle’s online location
- Dates
- Delivery Date(s) – date(s) data was delivered by Instrument team
- Online Date – date the data was release at your node