Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
Welcome! This is a space to keep running notes on the DASPOS [Workshop on Container Strategies for Data & Software Preservation that Promote Open Science](https://daspos.crc.nd.edu/index.php/workshops/container-strategies-for-data-software-preservation-that-promote-open-science). Please add your thoughts and notes for each session! <hr/> # Day 1: May 19 ## Keynote: What is it we want in containers anyway? Vincent Batt + [Slides](https://bit.ly/252kBnL) + Red Hat 4 years, 2-3 years in container technology + He can answer questions about C, C++, goLang, Ruby, etc. other languages + Q: how many people have an opinion of RedHat? + A: not many.. + Red Hat -- not a product company - a culture company - a lot of people who work in the open source space get paid by Redhat - products are a support model (companies need support when tech breaks) - Atomic -- a container scheduler + so, what does container mean? - to some, a tar archive. - to some, not just running traditional applications, but how systems are interacting: is it a long running process, a job? how are these scheduled environments interacting? + use-case: reproducibility - when I build something, I want other people to be able to run it ("it runs on my laptop!!") + use case: emphemeral environments - you can try something at least one and not have to worry about the host - environments are a lot less precious + Short Demo - docker un -it docker.io/fedora bash (wow! so cool! look at all this stuff I can install! I don't have to do it on my machine! It can all go away!) + use case: freedom from host restrictions!! - Q: how many people have experience with a specific OS or software config? glibc version whatever? - A: everyone. - you are running on something so awful I have to rewrite everything...a big use case for containers - package up your stuff->RedHat example: rail 5 and rail 4 containers on top of a host so they can have a modern kernel with some archaic software + use case: integrate to existing process - because you can iterate on software, it's no longer a deployment plan; no longer have to set aside how upgrade is going to happen + once you push out a container, it slowly loses context (source, where was it added from, who made it) - add labels to docker images! metadata!! + [Open Shift](https://github.com/OpenShift): Red Hat's application hosting platform makes it easy to run container-based web applications in the cloud for free. - [Source to Image](https://github.com/openshift/source-to-image): a tool for building reproducible Docker images from the OpenShift project + Sharing containers: - registry (docker-registry, dockyard, homegrown/self-hosted) + Tools: - [LXC/LXD](https://linuxcontainers.org/) - [systemd-nspawn](https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html) - [lmctfy (let me containerize that for you)](https://github.com/google/lmctfy) - [Docker](https://github.com/docker/docker) - [runC](https://github.com/opencontainers/runc) - [bubblewrap](https://github.com/projectatomic/bubblewrap) + Standards!! We need to standardize: - packaging - runtime - networking - cloud - OpenContainers on GitHub -- have runtime spec, others coming down the line + Call to Action! - define your use cases first - [get involved in convo](https://groups.google.com/a/opencontainers.org/forum/#!forum/dev) - ensure container integration touchpoint stay generic to avoid lockin (PoC tooling for your integration) <hr/> ## Lightning Talk 1: Tanu Malik + Domain science perspective! + Science Dataspaces -- how they are useful for data management and reproducibility + Scientific method is self-correcting: a cycle with respective to authoring, reviewing, and publishing science - but there are breaks in publishing where there are bias, no transparency in workflow, not enough time to review papers, go on researcher reputation rather than paper + how can we make scientific publishing without breaks? policy aspects mainly + Computational Science focus + Encourage Open-X (x = access, code, data, design, standard) - use the internet! be social! + Focus on 2 examples: DropBox and GitHub - why aren't they sufficient for data management? scientists are using them but it doesn't guarantee they have dependency hell or missing provenance, or even missing context (what was I thinking when I produced this??) + SciDataspace: personalized, shareable dataspace for data management and reproducibility - pilot testing in geoscience - Git and DropBox-like python client for Linux and Mac OSX - allows people to annotate (semantic annotation), package the code, data, environment into a Docker container, and track the provenance of the scientific program - runs on CMD + SADE, CDE, PTU for provenance and packaging <hr/> ## Lightning Talk 2: Daniel Nust + [Opening Reproducible Research](http://o2r.info/) + Reproducibility has 2 sides: - "not that hard" - "overwhelmingly complex" + Wants to add UI bindings. I.E. the author says "this is a parameter the user should be able to play with" and describes what's going on in the compendium of their research - run it once, author verifies it, the paper looks the same: then you have an + domain scientists partner with the library at Muenster to leverage digital preservation knowledge + solution: put a docker container into a Bag-it bag! - working directory (as is) + Docker container + what do we want to be able to do with these bags? - download parts - swap out parts without downloading it - look inside!!!! + what metadata and tools are needed for interactive reproducible papers with spatio-temporal data? + what arch can make this work for R-based geoscientists and digital preservation experts? + what new possibilites are opened by org'd combination of data + code + paper <hr/> ## Lightning Talk 3: Euan Cochrane + Preserving containers + how long do we need to complete the result of science? - probably forever.... + Linux-dependent containers can only guaranetee to be useful as long as the OS is - so what about the OS themselves? + We need to preserve the containers!! We can't just keep them - this means to preserve the OS - preserve emulators for OS over time + one instance of OS can support preservation of limitless containers + use existing tech and methods to preserve OS + [bwFLA](http://bw-fla.uni-freiburg.de/) -- emulation as a service - an emulation simplificiation tool (generic API for a bunch of emulators) - gives you access via a web browser - enables citation of complex digital objects + click a button, get a URL into the emulator, you can make a change and it saves as a derivitive file with a URL to changed environment + researchers could use this environment to test their containers, could install published packages on a new EM derivitive hosted by a digital archive, receive a unique DOI + challenges - need archives of preserved OSs and they need to be maintained - preserve emulators over time - big data make this more complicated - scientific community needs to buy in <hr/> ## Lightning Talk 4: Da Huo + Computing is adoopting to cloud infrastructure - you have to change to adopt to alllll the clusters - a VM is not enough + now, cloud computing is adopting containerization for infrastructure deployment - each is running as a process on the bare metal of the server - this is still not enough! + scientific data requires context -- provenance, contextual information, etc. + the semantic web -- the creation of smart data rather than smart applications - smart data makes future researcher more extensible and reproducible + linking outside resources within the data + You need: data + metadata + provenance in the same world - this enhances the data + docker will keep the execution, so we add machine readable labels using each step using RDF in JASON-LDE format - then link things together in a graph + break down silos and find relationships between data + ask big, cross-disciplinary questions + machines can do the grunt work for you + How to do this? - python wrapper is added to a standard container - provenance and metadata are written directly to the label (machine readable, enables discovery in large repositories) - container is then provisioned as a smart container + there is a command line tool that replaces Docker command line tool - provenance is captured automatically - customize metadata if you want or go by default + can search for container on the server <hr/> ## Lightning Talk 5: Dave Wilkinson + [OCCAM](http://www.occamportal.org/): Open Curation for Computer Architecture Modeling + live interactive archival containers/VMs on demand + fine grained (tar archive) or course-grained (static images -- docker/VM) + repeatability != reproducibility + Describing environments: what are the common traits of these artifacts + creating virtual machines FROM metadata - describe it with minimal metadata so we can create VMs on the fly + Strengths: - modularity - flexibility - if hardware changes, we can make a new VM at the end of the day - change setting (accessibility, etc.) <hr/> ## Lightning Talk 6: Elliot Metsger + Data Conservancy effort at John's Hopkins + semantic packages + packaging: store, transport, describe files + BagIt is a lightweight extensible packaging specification + They have extended BagIt to add semantics to package content - packages contain objects with properities - relationships can be asserted between objects - RDF-based - agnostic w.r.t data model + Data Conservancy Packaging Tool - GUI for creating and manipulating package content - very user friendly (don't have to know RDF) - extract preservation information from what the user adds to a package (file format, link to PRONOM or other format registries) + hasn't been introduced to the wider world, adoption at Johns Hopkins + uses Camel + Fedora 4 (implementation of LDP spec) + exposed as LinkedData on the web! + entry point to the package is the ORE-ReM resource manifest <hr/> ## Demo & Talk: ReproZip Vicky Steeves & Remi Rampin See: [Slides from the talk](https://docs.google.com/presentation/d/1kmR9RriaZvpl5uYzZjkaZgnojx9fjq5OKNL1iyyuV0Q/edit) during which Vicky gives a shout out to Fernando Chirigati who could not be with Remi & Vicky today. Introduces Newton's quote: "If I have seen further, it is by standing on the shoulders of giants.” and emphasizes incremental nature of science and how Reproducibility is a core component of the scientific process. Describes the problem of reproducibility in software for science how "even if runnable, results may differ" . Mentions a study: Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements ([http://dx.doi.org/10.1371/journal.pone.0038234](http://dx.doi.org/10.1371/journal.pone.0038234)). Then talks about scientific communication and linking up papers, code & data and how environments are hard to preserve and share between researchers who want to repeat one another's efforts. Vicky reminds us that you can’t just include all the scripts and the data in a preservation object and expect people to be able to run it! beause Libraries get updated, operating systems change, and software/hardware versions & configurations can disrupt reproducibility aka "Dependency Hell" Describes difficulty in preserving software as it ages, and then tells the audience how ReproZip helps alleviate these problems by allowing the user to easily capture all the necessary components in a single, distributable package. Also, previews how the demo of the tool wil show how ReproZip makes it easier to reproduce an experiment by providing different unpacking methods and interfaces that dont require user to install all the required dependencies (or know how to install them) and that in turn makes it possible to more easily run aging software or re-run previous experiments under different inputs. Vicky shows us the timeine of developing ReproZip and who has contributed to the project over time, and how ReproZip has matured in features and functionality. Then gives a graphical overview of how ReproZip allows the user to do Packing in Very Few Steps to capture Data/Files: Input files, output files, parameters, Workflows: Executable programs and steps as well as Environment variables, dependencies, software packages. And how users can unpack the resultant rpz files (compressed packed containers - 60% smaller than a VM) Then, Remi gives a demonstration of interacting with a ReproZip packed project which you can try yourself with the "Stacked Up: Do Philly Kids Have the Books They Need?" example of a packed website w/Python and Postgres SQL that allows users to still interact w/data presented in an aging webapp so the data is not lost to the public. Download the example .rpz file here: https://goo.gl/mNo6I7 After the Demo Vicky picks back up and explicates some Current ReproZip Use Cases. One that is immediately practical is that ReproZip packages (.rpz) can be included with each publication and cited as data, no different than other datasets. Desribes how ReproZip is recommended by ACM SIGMOD . Describes how [several ReproZip use cases are shared openly](https://github.com/ViDA-NYU/reprozip-examples) on github. Workshop participants will get an opportuity to learn more about Reprozip in the breakout rotations, one this afternoon and two tomorrow. <hr/> ## Demo & Talk: Umbrella Doug Thain & Alex Vyushkov + how can we bring in principles of reproducibility into large scale distributed computing? + Are we doing science today if it's not reproducible? - this [Repeatability Project from Colberg Claims not](http://dl.acm.org/citation.cfm?doid=2897191.2812803) Colberg's project now desribed in ACM Communications too. - can I rerun code from 5 years ago from a colleague? - are we producing results that people can use 5 years from now? - multiple reasons why not: rapid tech change, no archival of artifacts, many implicit dependencies, lack of backwards compatibility, lack of social incentives - many different r's - typical computing experiement: source code is carefully curated, but the dependencies, and other components are not + 2 ways to work on this - preserve the mess: package everything -encourage cleanliness: 2 examples Umbrella and Prune +Umbrella -takes specification -goes to the archive and pulls everything out -constructs the environment + whenever we want to execute something in a specific environment, we need to describe all the components we need -- manually + umbrella specifies a reproducible environment while avoid duplication and enabling precise adjustments + specification more important than mechanism - Docker (create container, mount volume) - Parrot (download tarballs, mount at runtime) - Amazon (allocate VM, copy, and unpack tarballs) - Condor (request compatible machine) + several example Umbrella examples: - [Povray ray-tracing application -- umbrella metadata](http://ccl.cse.nd.edu/software/umbrella/database/povray/povray.umbrella) + What about complex scientific workflows? - Not every task is a container - Each task must be placed in a container + Prune Preservation Run Environment - Shell interface does not accurately describe environment - Replace String exec. with function invocation - Builds on GridDB....etc. + Prune kind of similar to GIT - each run gets a unique ID - put each of these together with the provenance of the peices - Problems: Naming, intersection of version control and prov., Usability, Repositories, Compatibility, Composition... - important to specify between env. and tech., implicit environment, but be explicit about needs, best practice: only include what is explicitly imported, portability and preservation are 2 sides of the same coin. ### Alex's Demo of Umbrella + Modeling Malaria - Introdution to disease - Open Malaria Models: Open Source, C++, dev.2006 - Umbrella Specification for OpenMalaria - Use it in different web applications - Demo to follow later today <hr/> ## Demo & Talk: Smart Containers James + Anatomy of Smart Container - Sematic technologies provide context + Focus on smart data and not smart applications + Tools for context? + Linked Data - Tim Berner's Lee - Use URIs - Use Https URIs to look up names + 3 more + Standard Used- Prov-O + Use OrCID as preferred method for defining human agents + 2015 Workshop - Vocamp Computational activity ontology + Computational activities can be combined together + Code Meta: open standard: a way of describing software + Prov information is stored in the image label itself + Infects or builds images + Transparency - build the image - smart containers can be an alias (not infecting but building) + Hydra standard - specify in JSON LD + Example of a Linked Data client - SPARQL Query DB Pedia example + Linked Data Platform Standard - Containers and resources + Public source code in CRCresearch Github space + (James Slides will be up on the workshop site shortly.) <hr/> ## Demo & Talk: NDS Dashboard Ian Taylor + This is a status update of what we have been working on -make research experiments reproducible + National Data Service - data-enabled transformation of science.... - How do I publish, discover etc. - Big Data but what about Long tail? - How do people best identify, link, share - Emergent vision - A concortium of people to provide basic services - Infrastructure - NDS labs and NDS Share - Helps find data - Helps use data + NDS Epiphyte Pilot - to link projects together - Backend Docker API, V1 - Research Methods - Research Data - Standadized Web APIs - Containerizable - Shows architecture - Container Execution: data can be attached to their publication with button that allows you to access (demo video) script command line output + OSF Dashboard Integration - Django rest interface is a plus - Background into the OSF: transparency and openness of data to counteract the issues that come about as a result of journal's culture. Web Appl., supports collaboration, sharing of materials, etc. - Lots of add ons - especially data services through REST API + Boatload: new mech backend functionality to create containers and monitor them - using Django Rest - Uses Fleet and CoreOS - Written in Python - can get script status, view metrics, etc.. - 2 main components: Satellites, API Server - Short over view of interface -- web GUI + OSF Dashboard Integration - Created an addon Flask Views and Maker template - Breakdown of Architecture - Have to enable each one by one - Uses CAS for authentification - Make calls to OSF to get files - Make calls to Boatload - Demo Video + Future Work - moving towards lightweight tools - EmberJS GUI to interface with the OSF API - WholeTale Project extending these models to other toolkits such as Globus + Q: How would you cite this in a paper etc. + A: unique identifier system - can do different things to projects in the osf to freeze cite, share and then extend + Q: no longer irods? + A: idea now to just allow people to deploy things in a common environment + Q: Is the NDS an archival service or other type of tool + A: Still TBD - trying to find their role in the community- There is a white paper going out. <hr/> ## Docker Presentation Evan Hazlett (His presentation is being projected via Zoom - directed to Zoom meeting) Talking all open source today - Is here until 5 today Who has written? Used Registry? + Why consistent deployments? - ability to repeat builds, faster release, easier testing, avoid problems with interoperability. + Why Docker? - Isolated, leightweight, repeatable, simple workflow (don't want to have to remember everything) + Who has used compose? + Images: - Basis for container - contain an OS - can be built from scratch - shippable - this is the build component - content trust - signed + Registry: - stores images - enables collaboration - and reproducibility (pull the image and run) - questions arise: authorship? Updates and vulnerabilities? - "Docker Hub": groups and permissions, scanning for vulnerabilities - example: scanning through layers of image for vulnerabilities, image with issues, scan binary for the signature + Dockerfile - create a recipe - file contains instructions to reproduce these images - using a base image nginx copy into image + Compose - ability to define how containers run (multiple) within the sytem - volumes and networking with parameters - Demo: using a from base image copying files into contianer - compose is written in YAML - content trust: can use to do signing - then when you push-takes metadata-asks for more information- no altered data allowed - demo with Jupyter: doesn't yet have trust data - Demo: voting app with multiple services(redis, worker, db, result, voting app) Talk to Evan after if you want to hear about other services + Q: 2 layers sharing of files? + A: potential enhancement nbut not right now + Q: How do you share secrets between containers + A: Integration with a commercial product - nothing really out there now - but something coming - presentation from most recent Dockercon that would give you an example (keep your eye out) + Q: What's the recommended way to use? Socket? + A: Don't run Docker inside of Docker - need to address this issue <hr/> ## Breakout Group Reprozip Notes Vicky introduces the session - demo and non-conference type discussion and work + Daniel Brake is a postdoc who works on Bertini - uses Reprozip with Bertini reel - still having problems with Matlab (has an RPZ file) + Introductions - what people are working on + What would you all like to see? Practical Examples + Q: Matlab - what would you like to do with that because it is proprietary software + A: recently matlab has become a problem - Occam + Vicky: Matlab may pose a technical as well as proprietary issues + In Vistrails - checked whatever they did within a spreadsheet, record of what was done + The issue is actually rerunning it - stopped at the unpacking step + Reprozip preserves things at the software level - but unpack reprozip in an emulator that would have matlab inside of it? + potentially - dependencies might not be preserved + Q: Is there anyway to add on one specific thing external to the directory + A: Get a chance to edit before you zip. + Can uppack it right into a directory, Docker container + Can use Vagrant + GitHub repository for examples (up on the documentation) - reproductions from different published results: different fields - request to see one of those in greater depth - Bechdel - Try unpacking together from GitHub - id - change to edit config file - If you have the inputs you should be able to recreate it - Q: Does it record if you take something out - A: Everything is stored in the original - can specify patterns - SQLlite db - things will still be here - on Git - provide everything that you need - can try it out and add your own - when you do reprounzip: shows compatible packages - Q: Downloading - can you extract a directory - A: Not yet (Download is downloading the output files) - ipython notebook: can see both text and code - could load up in a server in notebook format - Q: frequency of reprozip use - A: building out beyond science labs - working with someone on a grant to package up propublica, adoption from journals, - Potential interest from people such as ICPSR - Low barrier to packaging - not making a change to the workflow, just have to type two commands - If you point to a directory you could programmatically get dependencies or to exclude certain components? + Vicky is trying to get library at NYU to use this for some of their work + OS problems: the worst on windows, mac there are ways to work with things + Remi shows graph - high level view of the project + Q: Do you have some vis stuff available? + A: Remi shows big graph (really hard to see - generated from dot) + Visualizing the provenance graphs - lots of documentation and recipes for paring down the graph: Run ID, directory...can see just runs (see every command run by the bash script) + Trying to write a GUI for unzip + Daniel - it would be nice to have a view into a reproduced run + Q: Have you tried integrating into other workflow sharing tools like *Iverna(?) + A: Worked with visTrails - ( ones like myExperiment may be good to try..) + Comment that this seems more accessible than others such as PTU... + Q: How do you even deal with proprietary software? + A: Problems with PIs want specific stuff - "Culture Hacking" <hr/> ## Reconvene - Wrap-Up Day 1 Natalie - now is the time to ask some questions if you have lingering thoughts Reception will be 2 buildings down at the Mendoza Business school Interests for lunch tomorrow Jarek- Q: Domain scientists - could you would you use any of these tools? - A: Haven't hit this wall yet, but we will eventually need this (Library) - A: Disease Modeling software user - sees that this will fill a need Jarek: Missing peices - we need to be able to identify things that we are missing in the documentation capability Vicky: term "Culture Hacking" Create incentives - especially at the institutional level Topic of interest - understanding how we document and communicate about these different aspects of what we use and need - standardizing our ways of defining. Abilities to direct and share across the fields Linking to things we are working on - so that we can merge efforts and collaborate. Jarek: Q: Does Docker have a community that addresses some of this? A: Evan - We don't really - starting to have conversations here- Docker meet-ups world-wide but mostly software dev focused and not reproducibility aspects Natalie: idea to have a Docker group meeting on campus - but have a strategy to bring the science community to the Docker con. Evan: Do you have anything going on here? Research computing conferences. Doug: Disciplinary conferences are ennummerable. Natalie: It will be nice to share these with NSF Doug: What are we not doing yet? - info sci connections - not doing emulations with proprietary software well yet (Docker is working on proprietary issues right now) - Cultural changes <hr/> # Day 2 May 20, 2016 ## Kenton McHenry: Towards a National Data Service (more here see transcripts) ## Efforts dealing with now at the NDS - shift to containers + Efforts - Jupityr, DataVerse, Pecan, Hub Zero (Nanohub) + NDS Labs and Labs Workbench: meant for developers - app store, several containers with dependencies, launch button + Labs has the underlying resources behind it many resources donated and combined + Embedding groups in projects distributed among entities and institiutions (Pilot efforts) - Addressing interoperability - support development - ex: NIST + Materials Data Facility - Globus - transfer service, universal authentication + 4 CeeD - MNTL and MRL materials science (solving issues with lost metadata and failed experiments. - students remove barrier to containerization and discovery, access + Clouder (SEAD) supports image formats, publishing capabilitiy, DOIs, - Architecture: built on API, extractor: RabbitMQ - Load Balancer to API to Queue and DB storage to Extractor and back + Brown Dog - DIBBS uses Clouder - support not only diverse formats but diverse functions needed with analyses. - Data transformations - Extensibility (Key) easy to add new converters and extractors - API - Supports lots of clients + Polyglot: Software Servers - uses the wrapper scripts to interface with the services Decomposes it down to the functions that the software does (Shortest conversion path) + Demos of services discussed - whole cloud of options + The emphasis of Brown Dog is as a service + Video demo Bown Dog Gui/command line interface + Point is to get for anybody to use these things + Q: Is this freely available - just open source + A: So far is open - Proactively approaching industry; ideas for how to add + Q: Extension tools + A: Extractors, Jhove (?) + Q:Is there a place where interactions are happening? DB + A: send out a link to the space - need something more formal like earthcube + Moving towards a US national data service...open to engagement <hr/> ## Kenton McHenry Demo: NDS Labs and Share TERRA - modeling plant growth + Sensors as data sources around the country + Computational and storage resources: Roger, Nebula, Blue Waters + REF data products Sensor, traits, genomics + Shows the interface - Clowder instance deployed - Create a dataset - clowder - showing images Odum Institute Archive (Social Science Example) + Built off of irods, bitcurator, dataverse + Different deployment using irods icat + Main aspect here is development - easier to address interoperability and expose new communities + Q: How do you see this relating to "midscale science" where people have significant size amounts of data + A: Globus - transferring SciServer... many different tools address to some extent - still something that needs to be address <hr/> ## Lightning Talk 1: Jian Tao Simulocean + DockerHub created Open Malaria container + Retrieves all the information from Github to build + From Simulocean appl - shows project creation + Input containes: scenario, densities, scenarion current + container has simple index webserver running + then can export it - shows standard output + works for applications with a very simple workflow - even when complicated can still define pre-post process + Another ex: Delft 3D <hr/> ## Lightning Talk 2: Rafael Ferreira da Silva Reproducibility of execution environments in scientific workflows using semantics and containers + former equipment + semantic annotations - WICUS, Ontology network + Experiment Management tool: Precip + Workflow Management: Pegasus WMS + Reproducibility Process - DAX annotator to TC Annotator to 3 files to Information Specific Algorithm PRECIP script - uses Kickstart + Dependency Management + Use case illustrations + tries to cover all the components of the workflow <hr/> ## Lightning Talk 3: Tanu Malik Reproducibility as a service LDV - Lightweight db virtualization + application virtualization + If you have a DB application - take the entire server and the database file and put it into the package + example of use of db is text-mining + Why doesn't it work? DBs are dynamic - state will be different + LDV= capture exact insert and query statements "Slice of the DB that your applicaiton touched" + appl is provenance enabled + create a virutalized package that can then be re-enabled + LDV-audit -- flexibility to audit your application to what you need + LDV exec -- redirecting file access + Shows an example using Google Earth with landmarks db <hr/> ## Lightning Talk 4: Maciej Malawski Deployment of scientific workflows into containers with PaaSage + Hyperflow is one scientific workflow programming and execution environment "Lightweight" + Paasage is model-based - CAMEL cloud application modeling and execution - multi-cloud - autoscaling + integrate Hypefflow and PaaSage - hyper flow reports metrics - scalability rules - still in + Deployment and containers: - walks through the steps to deploy - still has some questions in regards to scaling and how to automate everything - How to take advantage of other infrastrucutres <hr/> ## Lightning Talk 5: Lukas Heinrich Parametetrized and containerized workflows + LIGO published data and code withou the readme + built a Docker container with dependencies rerun the analysis in an ipython notebook + LHC and HEP extreme use cases - lots of code - tons of data - Wide range of software that they use - 2 ingredients individual processing and workflow capture - parameterization - analogous pipeline - develop generic vocab so that everything is plugable - Workflow model - w3C prov terminology aligned (Nodes, inputs and outputs) - Package activities and then can capture the workflow - Execute in a Docker container Json API - Shows Workflow template with nodes and edges - can also have subworkflows (Infinitely nest) - Recast: don't have to run everything -Demo Cluster - See demo in slides <hr/> ## Lightning Talk 6: Anita de Waard Talking about the full research data lifecycle + Electronic Lab Notebook - Hivebench + start to think about publishing when you begin your research + Data Rescue and Data at Risk: electronic spaces that cannot be accessed (Codata) - these might be good uses + Olive archive - software that no longer works - has nothing ot run in on (Software Rescue) + Mendeley Data - place where you can store data (can be under embargo) link to GitHub or not + Looking into leveraging this potentially with long tail data at ND + Research Elements - article data types that are designed to communicate specific elements of your work + SoftwareX - shows workflow to share and publish + DataSearch - understand and work with how people search for / discover data + Reproducibility Papers - if someone does a paper publish a paper about how it can be reproduced + Reuse and Cite - with the RDA (Slides are on slideshare) - this is a linked data hub all putting work-researchers coming togehter (DLI + Co-founder of Force 11 - may be of interest + also NDS and RDA + SciDataCon meeting + Very interested in helping htese services - open <hr/> ## Lightning Talk 7: Rick Johnson SHARE Indexing and Linking Research + defining SHARE - index where things are + 115 and more data providers + Community - IMLS, COS, Sloan + Reproducibility in the stages of the lifecycle - all over the place in the lifecycle - need to capture all of these points + Many mandates internationally to make things open + Share normalizes data to make it abke to be shared more generally - better have a global index + How does the national Data service fit in this as well. + Once data is in SHARE you can get ntifications and this can come back full circle to the different interfaces + asking what is the opportunities to leverage this content + capture updates + OpenAire Services and LaReferencia - European and Latin American services - how can share be interoperable with these two? + SHARE research.org is the place to go ## Breakout Group: Umbrella tutorial - Bertini Software Demo + Executed, generated an output + When you have specification already, very easy to run it + Umbrella: consists of headers - Tracking - Creating - Preserving + Shows Example of specification - json + supports a variety of services: Git, OSF... + Execution Engine: separate + parrot - intercepts system calls and redirects + Q: What to run? + A: you can enter a command that you want to exec. CMD key gives what command should be executed + Q: OS centos 6.6 with a download url - what is it? + A: Tarball of the old stripped down system - base for the tarball and Amazon (base image) + Someof this code is just metadata (kernel version) + Q: How is this file created? + A: Have to know dependencies - but there are a couple of tools - Umbrella Portal laid out in a user interface - component library - need to now how to run and where to put it - upload the file - produces Json specification - difficult to automate this process because it is so particular Natalie: seems similar to Occom - approach <hr/> ## Breakout group Dashboard + Step 1: getting signed up with the OSF - create a new project + Step 2: Make it public by clicking th button + Step 3: Adding data from the Google drive to put in OSF storage (drag and drop) + changing settings to connect to NDS dashboard + using Ember JS with the backend API + API https://demo-api.osf.io/v2/nodes/ - Json API rest structure - nodes - spaces like Bitbucket - navigate to specific providers + on dashboard select file that uploaded + open up container + looking at slice.py code + zip gets virtually unloaded into slash mount + GUIDs to figure out the owner of the containers container ID python script Hamlet text + Locally - Hamlet.txt - script goes through the scenes and processes + Using OSF parses the play in computer understandable fashion + creates directory + generated json files + hoping to have the ability that everytime it runs that information is fed back into the system <hr/> ## Panel Talk ### 1. Whaat are the most important steps (next) to enable sci software preservation + trying to find out who the adopters are and what they should be using tool-wise + giving the same status as the research article (Zenodo) + putting the peices that each of us have solved together + peer reviewed journals - getting your software published - extremely low barrier of use - it is a large social change, - Institution - we need incentives + increase citations, altmetrics, communicating the importance to those individuals + journal perspectives: need to convince the journal editors, but we need to have "clear" stories + neeed to get those stories from all the domains + domains should get more familiar with citations for software + importance of professional roles - project management/data management issue + "scientists selfishness quotient" collaboration level - that's where more innovation can be done - issue of policy + Reputation is still caught up in the publication of journal articles + also the competitiveness - fear of someone taking idea + people can also make money from open source / open access + My science is not really mine. ### What are we missing technology wise? + ability to handle pre-emption + ability to handle other operating systems + databases + potentially use standardization processes - ontology for data environment + versions of the containers - need to have explicit information and discussion about potential needs + Containers can be problematic - too much in one bucket - use one language - best practices + need to log dynamic changes with ontologies + reproducibility initiative with publishing + trust issues - why should I believe what you say when you need a year to clean up your data + coming to terms with a matrix of reproducibility - what does a measurement look like? there are ways of describing this + different domains will have different levels of matrices + Ex: preserving simulation data - one of the problems is funding and infrastructure to able to reproduce + But don't let that be an excuse not to share + Q: Is it worth scaling up? + A: yes - definitely need scaleability before large adoption + Reproducibility is very rarely taught in higher ed - deadlines + Also - it has to be natural to the scientists workflow - the closer it can come to the native workflow the better + Q: Who should pay? + So many open access - free space + A: Federal gov? + Every astronomer's information needs several thousand only a third goes to infomration needs.. + Anita is co-chair on committee - repositories are threatened - looking for deposit fees in some cases + Rick - IR looking to be front-end from the unv. everything has its limits - we may hit limit in the future not yet - encourage the researchers to request additional funding, etc... + Trusted Digital Repository - conforms to specfic standards - what is the funding model for your repository..labor, storage, administration researcher need vs. repository manager - is the insitution telling truth + science is a public good so the public should be funding it in one way or another: cost model up-front preservation fee for archive + there should be a requirement in the grant proposal + continue to look towards a day when that project no longer exists - sustainability + How are we building cooperative solutions for projects + Things happen in theory + Agencies are not really designed to be in charge - disciplines need to take the lead - ultimately + can do demonstrablyy better science if you share - need to reiterate that + policing is needed where we have "cat and mouse games" + the other problem is journals don't want to publish what is not a positive result ### How can those of interested in use of containers do outreach to the other communities? + teaching and dessiminating the value - work with librarians + Lots of leg work - going to the user community + In HEP - weeks worth of preservation workshops in program - less incentive - expense + Call it something other than preservation - dissemination + Libraries have had a change in how they talk about themselves- maybe this is the case here too - reuse is more compelling + perhaps an area of study would be to capture the pros and cons of these different tools + people don't really define reproducibility in the same way + perhaps we need more conversations around reproducibility + Domains - reproducibility guidelines + You could teach it as a form of ethics ### What have you learned from this workshop? + make our stuff talk to each other + bring user communities into the development + design some guidelines and also some boundaries + Common mailing list and lines of communication + come up with some names about what we do - recognizable + We'll set up a Google group - some taxonomy, papers, report Kallie will send out notifications, small amounts of editing on the transcripts. Thank you to all...
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.