SCR 3-0002 Vote Comments


RINGS

Mark Showalter

Rings Node votes YES. I have a few minor comments, appended below, that can be addressed later as updates to the Standards Ref.

While I understand the objections raised by Mike and Dick, I think it’s important to remind one and all that our user community has severely criticized PDS in the past for taking too long to adopt new standards. The SPREADSHEET (perhaps poorly named) has been requested by our data providers, discussed for years by our Tech people, and fills an identified need. PDS loses an enormous amount of public credibility if we delay the adoption any further.

–Mark

Residual comments about the SPREADSHEET:

I would like the Standards Reference to state explicitly that use of the SPREADSHEET is discouraged in favor of the TABLE. A SPREADSHEET should only be used in situations where the savings in data volume of a SPREADSHEET relative to the equivalent TABLE is substantial.

I believe that the MISSING_CONSTANT should be a required keyword in the FIELD object. If it is present, then it becomes possible to perform a direct translation between a SPREADSHEET object and the corresponding TABLE object. If it is not present, then the aforementioned conversion becomes impossible, because the conversion algorithm would not know how to handle a field of the SPREADSHEET that happens to be empty.

This may sound trivial but I don’t think it is. PDS has and supports tools (Object Access Library, NASAView) that handle TABLE objects just fine. If we ever want these tools to support SPREADSHEETs in an equivalent way, then the MISSING_CONSTANT is needed.

The Standards Reference could alternatively offer vaguer wording that the MISSING_CONSTANT is needed for any field that might be empty, but not for those fields that are always populated.


RS

Dick Simpson

I realize that we will almost certainly be in the minority and possibly a minority of one, but I think it is important to address a significant philosophical issue. This is an issue on which PPI and SBN long ago agreed to disagree while still working happily together, but it is an issue which much of the MC would prefer to ignore. The basic point in my mind is that we should absolutely minimize the number of different formats that end-users of the archive are facing. New formats should be introduced not because data providers want them and not because we as archivists want them, but only because we think the end users will need or really want this new format twenty years from now.

Let me say at the outset that I have no particular problem with any of the details of the object …

In the interests of promoting discussion on this philosophical point, I am going to withhold a ‘YES’ vote for the time being. If the only way to ensure discussion is with a ‘NO’, then the vote counters should record it so.

I’m not so worried about end users. Philosophically, I believe the archive itself should be kept as stupidly simple as possible. Translators, if PDS chooses to implement them, can convert the simple formats to user-friendlier ones as the climate changes.

Like SBN, I have no particular problem with this object; in fact, I have invested a fair amount of time in refining the proposal currently on the table. But I’m not convinced that defining new objects is good practice when existing objects will do the job. So a broader discussion would be helpful.

Also, since I’m not convinced the ‘standards process’ is bug-free, this provides an opportunity to review its theoretical function and its operation under modest stress. For example, we were told last week ‘If there are several “no” votes, and/or questions, the SCR will return to the working group for rework.’ What is ‘several’, and is referral to the working group appropriate when there appear to be no TECHNICAL issues? But SPREADSHEET isn’t even on the telecon agenda for tomorrow; so perhaps technical issues have been raised(?).


SBN

Mike A’Hearn

Spreadsheet object: SBN votes NO.

Mike – reasons explained below.

I realize that we will almost certainly be in the minority and possibly a minority of one, but I think it is important to address a significant philosophical issue. This is an issue on which PPI and SBN long ago agreed to disagree while still working happily together, but it is an issue which much of the MC would prefer to ignore. The basic point in my mind is that we should absolutely minimize the number of different formats that end-users of the archive are facing. New formats should be introduced not because data providers want them and not because we as archivists want them, but only because we think the end users will need or really want this new format twenty years from now.

Let me say at the outset that I have no particular problem with any of the details of the object. I frequently use spreadsheets, both for doing calculations and for keeping simple lists. Spreadsheets are a very handy way of keeping data logs as one goes along. I frequently import flat ascii tables into spreadsheets and then carry out calculations based on what is imported. For simple calculations, this is even easier than reading the table into IDL, for example. Since I have IDL on my Mac, I am now at last in a position where I don’t need a separate computer to use both spreadsheets and a real data analysis environment. I am, however, concerned about the whole spectrum of users and when I characterize what “I” would do below, I am trying to be the generic user, not just me personally.

Let me discuss the advantages of having such an object, then why those advantages are not significant, and then the disadvantages of having the object. I will follow with some comments on circumventing the need for the object.

Reasons for having a spreadsheet object as proposed:It is really easy for the data producer to make by simple export from most spreadsheet programsIt is more efficient in the use of space (on CD or hard-drive or whatever) than the corresponding ascii tableIt is really easy to re-import the data into a spreadsheet for visualization and/or for simple subsequent calculations

Why these reasons don’t really matter:Simplicity for the data provider, while nice, should be a very secondary consideration compared with value to the end user. All other things being equal, this would be a good way of choosing between alternatives, but other things are rarely equal.For sparsely populated tables, the savings in space can be a large percentage. However, if these are tables that are small enough to fit into most spreadsheet programs, then the total wasted space is small compared to the gigabytes and terabytes that we are normally concerned with. Unless we are considering files that are much larger than I envision, this is not a practical issue with current technology. Some specific aspects will be addressed below.While it is easy to import things back into a spreadsheet, why would I want to do this? The spreadsheet will do the formatting for me, to put it back into the format of a fixed width ascii table (duh!). The spreadsheet will allow me to do some further calculations – but would I want to do them here rather than in the environment in which I am analyzing images and spectra, for example? – probably not unless it is a peripheral calculation. I will have lost any information that was used in the original spreadsheet to do calculations, since only values are exported, not formulae.

Reasons for not having a new object:Our outgoing program exec objects to the proliferation of formats. Who knows what the incoming program exec will want. This could be the most important reason of all!The proliferation of potential formats really does make it hard for subsequent users. There are still people who use Unix boxes (Unices?) for analysis. Unless they use StarOffice (which is not really very good according to the people I know that have tried it), they generally don’t have a spreadsheet program available without switching to a different computer. Most users rapidly get annoyed at formats designed for specific software that they are not actually using. Personally, if I wanted to do real analysis that required the data from the spreadsheet object, I would rather have it in the same environment as the rest of my data (IDL for me, IRAF for some other users, I suppose VICAR for some diehards, and so on) so that the output of any calculation would be directly usable in the rest of my analysis. Even having a spreadsheet program on the same computer platform, I would only use it if I just wanted a separate window on my desktop for viewing the spreadsheet as a text file. Now this can be doneWe have in the past tried to severely limit the variety of formats that data providers to SBN can use (mainly because of item 2 immediately above), with modest but far from complete success. Having a multitude of data formats out there, all advertised in the standards documents, encourages data providers to argue that they should be allowed to use such a format since it will make it easier for them and is allowed by the standards.The comparison with FITS, that Barry has been fond of making, is an important one. Even though FITS establishes very rigid format standards, nobody in the astronomical community complains about it, neither data providers nor data users, because it was developed from the bottom up by users concerned with making it the simplest format that was completely cross-platform. The FITS community has consistently avoided proliferation of formats. Although the FITS community has a problem with lack of required content in labels, most providers are good about providing the content, albeit using a variety of keyword names. When one talks to PDS users, one does not get the same universal agreement that PDS formats are the only way to go.

How to help meet the goals of the spreadsheet objectThere are two different ways in which tables can be very sparse, yet still involve numbers that are used in calculations. One case is that of a table of numerical data with, e.g., the rightmost column being a comments field. The comment field is often blank but it sometimes contains a paragraph or more of text. Putting that comment field as the last column of a fixed-width ascii table does make the whole table unwieldy. However, it is obviously the case that the comment field is never meant to be operated on by the computer – it is there for visual inspection only. The “obvious” solution is to do what one does when publishing a table in a journal – make the last column a two-character footnote column, identifying relevant footnotes that are then contained in a text object such as the table description. I note that a fixed width table can be read into a spreadsheet just as easily as a csv table and is far more readable on its own.Another type of sparse table is one in which, e.g., only a few columns are measured at each line of a large table, i.e. there are mostly “missing data” flags such as “NK” or “NA” In most of these cases, I think the actual volume of the wasted space is small (in Gigabytes but not in percentage) and the table is no more readable in the spreadsheet than as a flat ascii table. In some cases, it may make more sense to represent the data as a 3-d object rather than a 2-d object (which a table is inherently).My bottom line is that all of the examples that I think are likely to occur, can be handled as ascii or binary tables or as other existing objects, without any loss of capability to the end user of the archive.