Wednesday, December 14, 2011

Seeking open review of provocative data publication essay

Peter Fox and I have submitted an essay to the Data Science Journal entitled "Is data publication the right metaphor?". If accepted, the essay would be part of a special issue resulting from the inaugural conference of the new ICSU World Data System.

In this essay we discuss the importance of metaphor in framing how we think, and we critically examine the data publication and several other metaphors. Our intention is to be somewhat provacative and to spark a conversation in the community about how we conceive of, describe, and evolve the overall data management enterprise.

While the paper is submitted for formal peer-review, the DSJ editors have also allowed us to make the essay available for open community review. The essay is available at http://dl.dropbox.com/u/546900/parsons_fox_metaphor_dsj_open.docx. We encourage everyone to read it and to provide constructive critique in the comments of this blog.

Apologies for cross posting, but we really want to get the word out. 

Thank you,

Mark Parsons and Peter Fox

35 comments:

  1. This is a valuable piece of work.

    I might suggest that a seventh critical attribute of an effective approach might be something like: “the work of making trusted data discoverable, usable, and understandable, and preserving this data is recognized and rewarded.” Much of the drive for a data publication is the current under-recognition of the work of handling data within the workflow of the scientific method, particularly in the emerging 4th paradigm science effort.

    Data are central to the historically core practice of the scientific method: reproducibility. Without robust data stewardship, science becomes anecdotal. Of course, data are also key to science innovation and discovery. Stamping a “publication” onto the final output of the labor and the expertise of science data wranglers may be neither appropriate nor sufficient to recognize their various contributions. Government agency data providers are also not supportive of adding personal recognition to individuals for publicly available data services.

    One advantage to the data ecosystem metaphor might be ability to populate this is various “badges” of recognition that could offer many of the rewards of publication (including professional advancement) while avoiding subsuming the various important roles in the data value chain to a single role as “author.”

    final note: any data article that mentions both Lakoff and Goffman is brilliant...

    ReplyDelete
  2. I think stimulating this debate is valuable, and I also echo Bruce's call for the 7th critical attribute w.r.t. recognition and reward. The sociological aspect is important, whether we want it to be a factor or not. Many careers rest (fairly or not) on citation metrics.

    I also am concerned about the publication metaphor sucking data into a morass of copyright and license. However, I note that there is some evolution in publications as well toward "open access" as an alternative to traditional restrictions on e-pubs.

    Also, I think the publication metaphor has a lot of history behind it that gives it extra power. There was a time, not all that long ago, when datasets were small enough that they were often published as part of a journal article, in some cases even as the main part.

    I think the publication metaphor can perhaps even be pushed further in some aspects, such as the idea of "preprint" data, which would be the data during its evolutionary phase before "final" publication.

    Finally, I had some trouble with the "Big Iron" metaphor, as our data center is part of that, yet has no big machines. To me, it is more about the management of data by officially designated institutions that specialize in data management. This admittedly does not make for a snappy metaphor.

    ReplyDelete
  3. I enjoyed reading this article. I noticed that the "alternate" metaphors actually describe the ecosystem metaphor quite closely. A common research activity in ecosystem analysis is to trace the disposition of a single material such as carbon (or energy itself) throughout the system - identifying its various forms, transformations, sources, sinks, and effects. This is a perfect metaphor for tracking data through a marketplace. The material or energy may be stored (archived) in a tree or other component. And ecosystems feature "specialists" that thrive on material or energy in a particular form - analogous to data users who need data in a particular form). An ecologically rich area such as a wetlands might be a good metaphor for the bazaar, where these activities are maximized due to the great availability of carbon (or data). The presence of catalyst chemicals, such as nitrogen fertilizers help maximize the throughput, much as specialists who help maximize data use. Data releases are modeled by introduction of new species into the ecosystem. Finally, a characteristic of ecosystems is their interactions with the environment. Often, this role is more than a passive one, in that the ecosystem changes the environment - similar to how data affects the underlying science. So, the "alternate" metaphor, is not fundamentally different from the ecosystem metaphor.

    ReplyDelete
  4. I'm just going to read the paper and comment, but meanwhile, you probably want to read this:http://goo.gl/Bg9aA (or the longer pre publication draft at http://is.gd/gilUt)

    ReplyDelete
  5. Thanks for the feedback. Mores soon. Meanwhile here's a PDF for this who struggle with Word. http://dl.dropbox.com/u/546900/parsons_fox_metaphor_dsj_open.pdf

    ReplyDelete
  6. Thanks Bryan, we'll make sure your paper is cited in the revised version (oversight on my part)

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Interesting stuff!

    One thing I'd like to point out is that you don't need a DOI to cite a dataset, and that you can cite a dataset, even if it's not complete/frozen. What the DOI gives the user is the guarantee that what has been cited last week is the same thing that they'll be able to download today. If the user doesn't really care about that, then they can cite the data as it's being produced or changed (by simply using a URL in the citation string instead of a DOI).

    Like publications, I can picture a situation where certain data citations will rank higher than others. Just like a publication in Nature outweighs a publication in "The Internet Journal of Very Important Things", a dataset citation with DOI from a trusted repository will be trusted more than a dataset citation with no DOI from a blog post somewhere.

    I'd quibble with your comment "[data managers] do not seem to recognize the often-broad use and evolution of a data set long before it may be formally published." We're spending a lot of time thinking about this, and how best to serve our users. Yes, we want to get the data in, and out to those who can make use of it. But credit for the data producers is a major driver for DOI assignment. We want a completed, DOI-ed dataset to be something to be proud of and to aspire to. That doesn't mean that the steps along the way are ignored - far from it! And, a lot of the time, our users don't want their data to be published before they've managed to extract all the research value they can out of it, meaning that the dataset is kept private until all of that evolution of the data is complete.

    I think the data publication method is so prevalent because publication has been the route of dissemination of scientific knowledge since the very beginning of the scientific method, and seeing as scientists (myself included) can be a conservative bunch when it comes to getting recognition for their work. It's not the best metaphor, but it's one that appears to be gaining ground. Yes, we need to think of a new, improved method of getting data out there, and getting credit for the data creators, but that could take a while. In the meantime I'll be aiming for the less-than-perfect solution, because it's better than what we've got now!

    ReplyDelete
  9. Thank you all for your thoughtful responses. This is clearly going to be a changed paper. Keep 'em coming.

    Some specific responses:

    @Bruce Glad you liked Lakoff and Goffman. I only dabble, but I think cognitive science, social science, and philosophy need to play a bigger part in data science practice.

    @Rob I too am digging the ecosystem metaphor. You describe it well and with rich metaphor. You may be interested in a recent paper where I delve into this a bit: http://jis.sagepub.com/cgi/content/long/0165551511412705v2

    @Bryan I have only scanned your paper, but it looks good. We'll definitely cite it in the final.

    @Chris You're right. There probably is no "right metaphor". They're like models. All metaphors are wrong, but some are useful. And indeed, "more metaphors please".

    @Sarah Point well taken on the effort and concerns data centers take throughout the data life cycle. I'm not a big fan of keeping data private until you're done with it, though. Also to counter-quibble, the scientific method (Aristotle?) long precedes scientific journals (1600s).

    ReplyDelete
  10. Regarding the additional critical attribute of credit. First, I should qualify. The list of attributes came largely from a Fox/Parsons brainstorming session. It is not a absolute list, and our assessment of the different paradigms against the list is also informal (a rigorous study may be warranted). That said, is fair credit a critical attribute of a data system? Maybe. I guess I'd say, credit's twin--accountability--is critical. Regardless, we do need to consider credit and accountability.

    I'm a big advocate of data citation, but I'm wondering now if that is the answer. It is not especially welcomed by scientists. Some like the idea; some see it as a diluting citations to their paper. Bruce also pointed out that some funding agencies question the idea of recognizing individuals as data authors. I don't agree with the detractors, but maybe we need to disaggregate or decouple the citation concept, as well our metaphors. The conflation of credit, reference, and peer review in a simple little citation complicates matters. We need to separate concerns. I think this is related to some of Chris R's concerns about DOIs. Also Bruce's badging idea warrants further exploration.

    ReplyDelete
  11. As one might expect, I have some opinions about this paper ... but they're too long to fit here. You can see them at http://goo.gl/Ym9sU

    ReplyDelete
  12. @Mark - We're not big fans of keeping data private either, but until there's a way of getting credit for the data producers if they make it open, we're a bit stuck.

    Fair point on the scientific method being older than journals, but I'm not too sure that Aristotle was following the scientific method when he was claiming that human males had more teeth than human females... (http://en.wikipedia.org/wiki/Aristotle)

    ReplyDelete
  13. Repost:

    From: Chris Rusbridge
    Date: 15 December 2011 10:50:41 GMT
    To: Research Data Management discussion list
    Subject: Re: "IS DATA PUBLICATION THE RIGHT METAPHOR?"

    Sometimes I have to remind myself that twitter is not the only, let alone the best medium of information transfer these days! John Wilbanks alerted us by twitter to an article submitted for publication in the "Data Science Journal". It's being reviewed by the traditional means, but the authors have also been permitted to put the article on their web site and invite open review through the comment stream.

    See http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html

    I thought readers of this list might like the article. Some might be incensed, others agree.

    I did try to comment, but for some weird reason it won't accept my comment. So, FWIW I'll put my comment here...

    This is an excellent and thought-provoking article, which I very much enjoyed. I found myself agreeing and disagreeing in almost equal measure.

    I rather doubt there is any "right metaphor". Looking back over a long career I can remember many meetings where we adopted metaphors that at first helped and then hindered us. The parallels that started out interesting and thought-provoking ended up seeming narrow and flawed. The metaphor only works so far.

    Perhaps the title should be "Is the data publication metaphor limiting?". The authors do suggest a couple of times that the data publication metaphor is a pretty good fit, just not good enough.

    Partly this is about world view. One classic case of clashing world view is the librarian versus archivist problem. Librarians often don't understand how much their world view is shaped by the multiplicity of the items they curate. Archivists are very aware of the uniqueness of their items, but sometimes forget the massive demands of volume, and over-state the importance of maintaining context. Neither world view is really compatible with handling research data, whether hot from the experiment or sliced and diced for publication.

    Although it may be as good as we've got, I certainly worry about DOIs for datasets. Apart from the obvious that DOIs are not (automatically) persistent identifiers, I do think that the DOI model was always designed to identify a stable, published thing that might be moved around. The "thing" tends to be small (by data standards), and usually fairly atomic (as far as the DOI is concerned).

    It seems to me likely that shoe-horning a very different use-case into the DOI system could break it. So here we have potentially vaster size, much greater structure and hence many more possibilities for needing to cite parts sliced different ways, and much greater potential for change (sometimes very rapid) to be a factor.

    Perhaps we should respect our metaphors for what they are: useful aids up to a point. We are blind men feeling the elephant, and we have to share our perceptions to understand it better. We don't yet understand data properly. More metaphors, please!

    --
    Chris Rusbridge

    ReplyDelete
  14. Kudos for a stimulating piece and getting this conversation going out in the open. You make some excellent points. My main criticism with the current draft would be that I think the conclusion is both too strong (in emphasizing the flaws with 'data publication') and too weak (in that we don't just need a bunch of different metaphors, but we need a set of complementary metaphors that clarify different aspects of the problem, and so each may be specific to context).

    While there may be some limits to the publishing metaphor, it's not being given entirely fair due here, in part because 'publishing' is being circumcised too narrowly. Publishing is not a monolithic thing, but a variety of processes that mediate between the producer and consumer of a scholarly work. The social contract responsible for the success of scientific publishing is that one makes one's work 'public' so that others may evaluate and build upon it, and in return one receive's professional credit for the work. Anything that contributes to execution of that contract may be considered to be an aspect of scientific publishing, including potentially the dissemination of research data.

    More pragmatically, there are all sorts of issues in traditional article publishing that have near direct analogs in the data dissemination realm, and we have yet to explore as a community the many, many lessons that one has for the other. Just a few random ones that would be productive to consider: the division between curation/editorial process and peer review, business models for ensuring accessibility and preservation, the balance between narrative and structured metadata, how to capture post-publication review, how to align incentives for scholarly communication with community need and professional reward. Many of these are the subject of radical experimentation and debate in the article publishing realm, so I think it is premature to say that the solutions that have been found for articles do not inform what solutions we should employ for data.

    Furthermore, the 'data publication' metaphor resonates uniquely well with researchers because of their experience with it. Strategically, I think we are better off more fully exploiting the 'data publication' metaphor to build awareness of issues in data dissemination among researchers. Other metaphors, such as ecosystem' or 'infrastructure', may be useful in conveying the idea that there are myriad components that dynamically interact, or that there is an engineered substructure on which data publication depends, but don't help researchers visualize their role in nearly so tangible a way as does 'data publication'.

    ReplyDelete
  15. I admit, I only skimmed through the paper, but I'd mostly end up parrotting Bryan. (and now I know where the 'big P vs. little p' reference comes from that was mentioned a few times at the DataCite meeting in August).

    One other thing to consider is that scholarly Publication is really limited by page count ... in theory, peer review is supposed to make sure that everything is necessary for reproduction of the research, but we assume tacit knowledge for the discipline. Much of the data released has similar issues ... so is it considered 'Published' or simply released?

    We also have cases where there may be a Publication without any actual data ... someone registers the data in some system, documenting the existence of the data, but it's encumbered in some way that requires people to go through a process to gain access to the data. Should we consider the description 'Published', even if the data isn't?

    The P-vs-p issue also comes up for active archives -- there's no way for me to "Publish" the data ... but I could create something describing the state of the data at a given time / release, and Publish that document which would persist even if the underlying data were moved / replaced / removed / etc. (but if we then updated the document to make note of that fact, per the recommendation of the technical breakout at the BRDI meeting in August, would it still be considered 'Published'?)

    (I think I sent the writeup of the BRDI technical breakout session to the AGU ESSI mailing list ... but it looks like archiving isn't enabled, so if you didn't get it, bug me and I'll send it to you. It also gives a way for to credit all of the people involved with the data production (see Sarah's comment), rather than trying to stuff the equivalent of movie credits into a citation)

    ps. if you like Lakoff's work, I highly recommend 'Fire, Women and Dangerous Things', but hopefully you won't get obsessed with categorizing stuff like I have.

    ReplyDelete
  16. I find myself in agreement with most of the other comments - mostly appreciative of the dialog that has been sparked and the need for deeper examination of the many assumptions we make in the use of language to deal with a complex problem space.

    My only significant issue is in regard to the continued separation of publication as somehow distinct from synthesis and commentary about those data (i.e., the typical verbiage in a published paper). I feel that we have arrived at this rather odd publishing paradigm as a result of prior constraints on publishing and disseminating data in any useful form. That time has now passed. I am not convinced that it does us any good to think about datasets as separate and special entities from whatever else might be related to them.

    Instead, it seems to me we should be moving to a publishing approach that demands better adherence to certain key principles in advancing knowledge, such as verifiability, replicability, discoverability, remixability, etc. Research results (as data and accompanying objects and commentary) can be shared as early as practicable, and then a "publication" can result from improvement on these different metrics through additional effort and revision. If the data are robust but not otherwise terribly compelling, then we need not require any significant commentary beyond a quick summary of the data and perhaps a reflection or two. On the other extreme, we can still support publications that may have somewhat limited and perhaps even compromised data that have lengthy commentary because they happen to also tell a particularly interesting story. I imagine a day when any given journal will comfortably accommodate the entire spectrum of data/analysis combinations and we no longer need to refer to "data publication" at all, any more than we would expect to refer to "data-reliant publication but absent the actual data".

    ReplyDelete
  17. All - our open review period continues as the deadline for the Data Science Journal special issue is now Jan. 31. Please push the discussion as far as you choose, in your own thinking and to colleagues, random strangers, innocent family members, etc.

    ReplyDelete
  18. Thank you for sharing this thought-provoking paper, which has now drawn out so many thoughtful comments. I write primarily to support the ecosystem metaphor. I suggest that you can take the metaphor further. In the natural world, we like to see “healthy” ecosystems. While that is a squishy, value-laden term, it might suggest some additional ideas. For example, ecosystems include species of organisms and populations of those species. In general, we want to preserve them in order to maintain healthy ecosystems. Are there useful analogies between datasets and iterations of datasets, and species and populations? Also, we tend to want to avoid the introduction of non-native invasive species into ecosystems because they can out-compete and extirpate natives. Again, are there any analogies with data? Could there be a sort of Gresham’s Law of data where bad data drive out the good? While this may seem unlikely today, could it happen in some future dystopia and how could that be prevented? Finally, speaking of Gresham’s Law, might there be an economics metaphor worth elucidating? You are almost touching that in mentioning marketplaces; how about data markets and what makes them work efficiently? Are data the currency of science?

    ReplyDelete
  19. The "data ecosystem" metaphor brings to mind the related notion of the 'information ecology.' On this front, Davenport is coming more from an intra-organizational perspective, though I like his breakdown of the ecology components. This, despite the fact that it seems to leave out data management entirely!

    "This new approach, which I call information ecology, emphasizes an organization's entire information environment. It addresses all of a firm's values and beliefs about information (culture); how people actually use information and what they do with it (behavior and work processes); the pitfalls that can interfere with information sharing (politics); and what information systems are already in place (yes, finally, technology)."

    Thomas Davenport, 1997. Information and Its Discontents: An Introduction. Chapter 1 in Information Ecology, Oxford University Press. (not online that I can quickly find, though you can get a few pages at a time through google books)

    Baker and Bowker adopt and expand this metaphor:
    KSBaker and GCBowker, 2007. Information Ecology: Open System Environment for Data, Memories and Knowing. Journal of Intelligent Information Systems 29(1):127-144. Available here: http://interoperability.ucsd.edu/docs/05BakerBowker_JIIS_final.pdf

    Seems to me that Baker and Bowker identify or at least imply a data management paradigm not yet included in your big 4: that of 'Community infrastructuring'. Alternate names may be 'Data integration' or 'Science support'? The niche for such a paradigm (not yet at a metaphor level I suppose) might be best illustrated by adding 'Standardization Focus' as an additional comparative element to Table 1. I would fill in the row with: Map-making to gridded locations, Data publication to citations, Big Iron to versions and formats, linked data to ontologies. It seems to me that there is a gap here where these hopefully all tie back to the standardization of scientific practices. This is where I might categorize efforts like QARTOD, the LTER Information Management community, the Salmon Data Access Working Group and others. To fill out the rest of Table 1 for this group, I would add analog: artisanal production, data: small and diverse, Data models: geospatial, hierarchical, relational or none(?), Focus: synthesis and reproducibility, Metaphorical terminology: (the parallels might break down here?) scientist, staff, data manager, project. Obviously more work is needed to clarify and articulate!

    My hope is to continue expanding the data management landscape to include work at multiple levels. Along these lines, I appreciate the addition of 'Map Making' in the list as it is work at a different scale/scope than the other three, though arguably LOD would be as well. Seems to me that ecologies and ecosystems are both inherently multi-scale.

    Thanks for the opportunity to comment, the article was a pleasure to read.

    ReplyDelete
  20. I like the metaphors intro and the Lakoff framing. Other groups that might be worth looking at in the "metaphors" detail section include ICPSR (social science data, at scale but heavily curated). The discussion of map-making makes me think that other structural sensemaking approaches could be mentioned -- like taxonomy/classification work in biology, or structural chemistry curation.

    Table 1 is not particularly useful for me, that may just be a personal problem. :)

    "For example, a data “producer” may not carry the same intellectual weight as a data “author.” Perhaps most critical is the actual term data “publication.” A publication is typically viewed as complete and only subject to minor updates, whereas data can be very dynamic, even long after they are made available." -- This is perhaps the most interesting part. Have you looked back to manuscript models of "publication"? Clearly the Liquid Publications movement is something you're thinking of -- please cite that (see http://project.liquidpub.org/ for a start).

    Your discussion of data citation seems related to the post-publication peer-review approach -- may want to mention that. Also cite Micah Altman's work on data citation.

    Could you say more about PANGAEA? I'm not familiar with it.

    ReplyDelete
  21. All,

    We are humbled by the many thoughtful reviews we have received on our little essay. In addition to the comments here, there has been some discussion on other blogs, and we have received some very insightful and thorough reviews directly.

    As Peter said, the deadline for the special issue is now 31 Jan, so we're not sure when we will be receiving formal blind reviews and when we will be doing the presumed rewrite. Peter and I will actually be together later this week, so we will plan some sort of initial response then.

    We have a lot to digest! Keep it coming.

    Meanwhile, some specific comments.

    @Bryan and tjv. Yes there are many more nuances to "publishing" than our crude drawing. There is also value to considering different types of publication (big P little p). Nonetheless, the very conversation here shows that those nuances and differences are not universally or well understood or even agreed upon.

    @Joe. If your a classifying nut you need to read Sorting Things Out: Classification And Its Consequences by Bowker and Star

    @Ahrash "I am not convinced that it does us any good to think about datasets as separate and special entities from whatever else might be related to them." Brilliant.

    @Invasives Man. I too have been thinking we need to explore the marketplace metaphor more deeply (while avoiding any direct payment for access analogies). I also want to explore the ecosystem more, but its "Squishiness" presents as much a problem as the different nuances of publication.

    @jodischneider. Thank you. I'll look up your references. To find out more about Pangea look for publications by M. Diepenbroek, J. Brase, J. Klump, and/or H. Pfeiffenberger.

    ReplyDelete
  22. Thank you for your excellent essay relating to the metaphors influencing our understanding of data management and data exposure. To supplement the excellent discussion it has already prompted, I would like to add some belated comments relating to DOIs for dataset versions, peer review of datasets, metadata as first class scientific objects, and linking the data publication and linked data metaphors, which are to be found at http://opencitations.wordpress.com/2012/01/03/comments-on-is-data-publication-the-right-metaphor/.

    ReplyDelete
  23. @lynn Nice points. Sorry for the delay in your post. It was flagged as spam for some reason.

    ReplyDelete
  24. First some nitpicking which you have probably already addressed.

    a) Footnote 10 and 11 are the same

    b) Table 1 mentions examples in science for each metaphor, but many of these are not cited in the text. Our own research group myGrid is listed as example of Linked Data - but without a reference I am not quite sure what you refer to.

    c) Each section is fairly well written, but don't connect well to the next to form a consistent story across the paper. Perhaps some more 'glue' text is needed in the beginning/end of each section.


    Finally, I think this paper puts an important emphasis on the limitations of the metaphors that seem to drive our current methods of publishing data. You mention briefly that one alternative metaphor would be that of software releases.

    I am a strong believer in this approach (but as you say - not exclusively). I have personally witnessed researchers wasting days over curious results, caused by an undocumented version mismatch between two datasets. When using and citing a dataset, including the version (or at least access date) should be a minimal requirement, DOI or not.

    In fact, with the growing size and complexity of raw data, it is often difficult to use the data for new research without also having access to the custom software and workflows that the authors used for analysing it. Publishing and preserving all of these digital artefacts as a whole, not to mention trying to reproduce the outcome of using them, is the challenge that we are trying to address using Research Objects - http://users.ox.ac.uk/~oerc0033/preprints/wfpreservev10.pdf

    ReplyDelete
  25. Dave Carlson, Editor of Earth System Science Data and former director of the International Polar Year Programme Office, writes:

    Commendations to Mark and Peter, and particularly to several thoughtful commentators, for helping us do more than “think and speak in the frames [we] know and understand”. We can accelerate progress (!), on data sharing and on data discourse, though these exchanges.

    I do not advocate data publication. I advocate long-term free and open data access, which necessarily implies curation and preservation. I see data publication as one tool, but one very useful tool, for two primary reasons, both already mentioned: like or hate it, the scientific community already uses and (for the most part) trusts reviewed publications, and a reviewed publication offers professional credit for professional work. In particular, we have no other element of the data ecosystem that offers that essential element of return on effort. As in any other ecosystem, the system will not evolve without substantial (in energy or fitness terms) return on real effort.

    I share concerns about several aspects of data publication. I feel quite disappointed that, despite the focus on “interdisciplinary Earth system science”, we have really missed entirely what we have tried to do with the data publication journal ‘Earth System Science Data’ – fair to say, I think, the only currently practicing data publication and review journal in earth system sciences? Most importantly, we tried to break through the endless cycle of theory and white papers to gain some practical experience!

    I don’t claim (yet) success for ESSD, but it has already disproved several of the ‘aspersions’ assigned to data publication in this article.

    1) I react to this section, loaded with imprecise codewords: “Its focus is often on research collections where data are extremely diverse in form and content, but tend to be small in size. Data Publication seeks to define discrete, well-described data sets, ideally with a certain level of quality assurance or peer-review. These data sets often provide the basis for figures, tables, etc. in research articles and other publications.” Do we use the word ‘research’ here as a pejorative, indicating something casual off of a researcher’s desk as opposed to ‘real’ data sets from an ‘official’ WDC? Small in size? By satellite standards? By Excel spreadsheet standards? Global ocean carbon data sets, covering all basins and taking a decade or more in the collection and preparation, although probably less than a GB in volume – does that constitute a ‘small’ data set? “Well-described data sets, with a level of quality assurance” – don’t we all aspire to those standards? What do the modifier’s ‘discrete’ and ‘certain level’ add here, if not a mild disparagement. “Provide the basis for figures, tables, etc.” – why not? Shouldn’t all data support these analytical views? And what about colour visualizations and animations, 3D fly-throughs, digital earth overlays, and ingestion into and validation of global climate and ecosystem models? Most of the data publications in ESSD focus enthusiastically on these modern tools of data use, on access for a wider user community, and on application to models of all scales and complexity. Why else put such a large effort into producing and sharing a data set if not to allow and encourage widespread and diverse subsequent use and re-use?

    [to be continued]

    ReplyDelete
  26. [continued from Dave C.'s post above]

    2) “More seriously, publication implies some level of imprimatur (Callaghan et al., 2009). Especially when associated with literary publication, a “published” data set may be assumed to have undergone some sort of peer-review, yet there are no standards or even agreement on what peer-review of a data set might mean (Parsons et al., 2010).” Of course we want imprimatur – that represents the return for effort. No progress without it. And how will we achieve those standards? By further working groups and white papers? In ESSD, we learn about those standards every day, with every paper. We, actually we the review community, wrestle with quality, calibration, coverage, access, stability, inclusivity, utility, identifiers, units, interpolations, and intercomparisons, trying to find and ensure the best combination among often competing factors. Neither do science journals have fixed standards or universal agreement on quality factors! They evolve working standards according to the journal, the discipline, and to the evolving needs of the science community as developed and applied through the review process. Without multiple working practicing community-based data journals such as ESSD, data access (not solely through data publication) stays in the rarefied world of the data theorists.

    3) “tend to combine the review of the data set with review of a more conventional literary publication” In ESSD we hope to combine a data publication with an exceptional ground-breaking science publication! We have not yet managed a simultaneous science and data publication, despite interesting discussions with partner journals. Again, use of the word ‘conventional’ seems pejorative to my ear. In any case, discussions between science publication journals and data publication journals tend to focus on practical factors such as joint or separate reviews, divergent schedules, and dependent or independent editorial and publication decisions and practices. Earth sciences has open access science journals and one open access data journal, but we have not yet produced a truly open access combination of simultaneous science paper and data set. Why that has not happened makes for a very interesting discussion.

    [to be continued]

    ReplyDelete
  27. [final segment from Dave C.]

    4) “This means that data need to be citable, ideally with a persistent identifier like a Digital Object Identifier (DOI), as soon as they are available for use by anyone other than the original creator. Despite this, we have heard time and again from different data managers that they do not want to assign a DOI to a data set until it has reached some level of stability and quality control. They really do see publication and the assignment of DOI as a sort imprimatur. They do not seem to recognize the often-broad use and evolution of a data set long before it may be formally published.” As written, this seems like a data management problem, not a weakness of the data publication process? ESSD requires a DOI for all data sets before they enter review (or at least assignment to and acceptance by a data centre where we know they will get a DOI). We observe substantial confusion on both sides, from the data publishers (us) and from the data centres, on these issues. I more object to a persistent mis-impression in this paper about data as something fixed, immutable, absolute, above the fray, while science flows, evolves, explores, and achieves eventual if unpredictable progress. Perhaps this represents our largest semantic disconnect? Data also changes, moves, fails, improves and evolves. Even our hoped-for ‘absolute’ numbers change! What we thought were single-celled organisms we now recognize as colonial organisms – five ‘individuals’ now become one, or vice versa. What we thought were absolute earth-referenced velocities we now recognize as relative at best or erroneous at worst. What we thought were routine albeit careful measurements of inorganic carbonate and alkalinity now turn out to have immense importance as pH values to determine ocean acidification. My best, final, absolutely perfect data set from yesterday will turn out to have an error, tomorrow. Our reliance on 30-year satellite time series and on the persistence factors there in prevents us from seeing that the system itself and the fundamental assumptions of the measurements have changed. Data occasionally, less often than this paper implies, has clear starting points, but it always has improvements, corrections, disruptions, new interpretations or even, dare we say, obsolescence? Unfortunately, data as a messy, evolving and ill-defined process – who knows this better than we? – does not fit the crystal-clear, firmly-identified immutable data set paradigm. For that reason as much as any other, we need data publication processes to keep up with change in the data, just as science uses the publication process to (mostly) keep up with change in information.

    5) Finally, “it is the close association with copyright, restricted-access, literary publications, etc. that troubles us most about the data publication metaphor.” I agree. But ESSD only publishes open access data sets, and the journal itself operates under a CC0 open access attribution practice. We get no notice or credit for that, in this paper? More over, as a private citizen interested in access to data and to science (what this paper defames as ‘literary’) publications, I can report from current hands-on experience that copyright and restricted access apply almost as strongly to data sets as they do to science publications. Try to get any of the IPY satellite data NOT at NSIDC. Or, try to find and download a full, quality-controlled and consistent ocean carbon data set outside of the ESSD CARINA publication. I detect an attitude here of ‘look what we do with our accessible data compared to what science does with its copyrighted publications’, but in fact we have a very long way to go in open access to data, such a long way in fact that I think we should not throw adjectives like ‘restricted-access’ at other parties until we have confronted those issues within the data community.

    ReplyDelete
  28. Great article that illuminates some of the problems I've struggled with for 30+ years.


    The classic Publication metaphor is changing, as open journals, wide distribution of preprints (cough), blogs (cough-cough), and other methods grow in popularity. This paper's criticism of that metaphor could be obsolete in a few years, when Nature shuts down because their business model doesn't work any more.

    Widespread use of the "data ecosystem" metaphor has one serious limit: many scientists will have trouble quickly grasping a metaphor well outside their primary field. How well will a particle physicist, geophysicist, or astronomer understand the nuances of that metaphor? What happens when they misunderstand "data ecosystem", based on their Fox News understanding of "ecosystem"? I'm being deliberately provocative to make a point here.

    Perhaps you could add another metaphor: "Data management". Unfortunately, most of the implied characteristics are negative: by including the word "management", research scientists are immediately repelled ("that's not science"), and the phrase implies drudge work, required by others, avoid when possible, detracts from "real science", outsource to others lower on the totem pole if absolutely necessary. Look at some of the sometimes humorous reactions to the new NSF data management plan requirement. Using this metaphor mostly results in no data management at all.

    Finally, rather than trying to find One New Metaphor To Rule Them All, we should use several metaphors as appropriate, increase awareness of their limitations, and fix their worst problems. We could increase the use of phrases like "A well-known problem with the Publication metaphor for data is ...")

    We need a metaphor ecosystem :-)


    (I'm commenting semi-anonymously due to conflicting organizational policies over my use of "social media".)

    ReplyDelete
  29. @Bruce Caron wants "the work ... recognized and rewarded."

    We have struggled with this concept for decades, and have not found a good solution. In part, recognitions and rewards outside of the primary activity (research and publication), don't motivate creative people like research scientists. And we always need the cooperation of the research scientists. Recognition for good data management would require major culture changes for most research communities.

    I see a need for psychological-sociological research on what works, to motivate research scientists, groups, and organizations to perform good data management. In my experience, sticks don't work, and the carrots we've tried so far don't work. I'm still looking for the right carrots.

    =====

    @Chris Rusbridge

    Your comment on archivists vs librarians reminds me of this quote:

    "Scalability: the problem that tens of thousands of engineers yearn for, but only six actually have". Ted Dziuba, The Register

    =====

    Note that "geodata.gov" now redirects to "geo.data.gov", and it's future is in doubt due to severe budget cuts -- just after we spent several years incorporating geodata.gov and Geospatial One Stop into our workflow!

    ReplyDelete
  30. I didn't strictly review this paper, but I found it a highly interesting read which certainly got me thinking... I've reflected on this paper looking at it from the perspective of someone trying to mobilize data in the biodiversity science world. The full text is available at the BioFresh blog, but I'll try to summarize the main points here.

    Basically I agree to the majority of commenters that the paper got me both agreeing and disagreeing with certain points. I certainly agree that the Data Publishing metaphor could create some false expectations in terms of peer review and citations. But, although the term Data Submission could be a better reflection of the process of making data available in a number of cases, I don't see an immediate alternative to the Data Publication metaphor for convincing scientists yet. E.g. I don't see the notion of becomming part of a data ecosystem convince a lot of scientists in the near future (although I hope I am mistaken).

    ReplyDelete
  31. You missed an important metaphor, that of software product. Data gets "reviewed" when it is used--you already note that it is hard to just look at data and find mistakes, so review in the same sense as scientific publications doesn't happen for data as well as it does for text and ideas.

    How does software get "review" or "credit"? If software is not a publication, do academics not make software? The need here on assuring credit is a sickness of the academic community, not something to be held in high esteem.

    My most important contributions to the scientific community since the Ph.D. have been non-published software. I know some stuffy academic types would simply brand me a failure for this, but I argue that these have made a greater impact on scientific work than those who with lofty language publish long texts.

    So I think your paper might be improved by considering data products as a kind of software product, and looking at how that metaphor works for the issues you're concerned about.

    ReplyDelete
  32. One thing that you did not mention that seems very important, and is generally missing from discussions about our data-sharing paradigms, is creating an underlying mechanism for attributing credit to the originator of the data itself. This is crucial since scientists' currency is heavily driven by 'cycles of credit' based on authorship, acknowledgement and citation (Latour & Woolger 1979). The metaphor you use here is terrific and I'd like to congratulate you on a piece of work that really clarifies a couple of 'big picture' issues for me.

    ReplyDelete
  33. We could enumerate and describe the many ways that different paradigms don't work for data. Then we could focus on each paradigm's shortcomings for correction, or ensure that new paradigms don't make the same mistakes.

    For example, the review process for a traditional publication assumes that the reviewers and editors will read every word of the publication, and check the references for relevance and completeness.

    For a web-based data release with multiple terabytes - no one could possibly review every byte; or hundreds of thousands of links ("references") to other databases.

    A publication is "done" once published, except for rare corrections or retractions.

    In contrast, many data sets are under constant revision, and many are updated in real-time.

    ReplyDelete
  34. Revision finally submitted: http://mp-datamatters.blogspot.com/2012/09/revised-metaphor-paper-submitted.html

    ReplyDelete
  35. Thanks for a very interesting blog. What else may I get that kind of info written in such a perfect approach? I’ve a undertaking that I am simply now operating on, and I have been at the look out for such info. Admond Lee

    ReplyDelete