Important note: The second assignment in this course covers the topic of Graph Analysis in the Cloud, in which you will use Elastic MapReduce and the Pig language to perform graph analysis over a moderately large dataset, about 600GB. In order to complete this assignment, you will need to make use of Amazon Web Services (AWS). Amazon has generously offered to provide up to $50 in free AWS credit to each learner in this course to allow you to complete the assignment. Further details regarding the process of receiving this credit are available in the welcome message for the course, as well as in the assignment itself. Please note that Amazon, University of Washington, and Coursera cannot reimburse you for any charges if you exhaust your credit. While we believe that this assignment contributes an excellent learning experience in this course, we understand that some learners may be unable or unwilling to use AWS. We are unable to issue Course Certificates for learners who do not complete the assignment that requires use of AWS. As such, you should not pay for a Course Certificate in Communicating Data Results if you are unable or unwilling to use AWS, as you will not be able to successfully complete the course without doing so. Making predictions is not enough! Effective data scientists know how to explain and interpret their results, and communicate findings accurately to stakeholders to inform business decisions. Visualization is the field of research in computer science that studies effective communication of quantitative results by linking perception, cognition, and algorithms to exploit the enormous bandwidth of the human visual cortex. In this course you will learn to recognize, design, and use effective visualizations. Just because you can make a prediction and convince others to act on it doesn’t mean you should. In this course you will explore the ethical considerations around big data and how these considerations are beginning to influence policy and practice. You will learn the foundational limitations of using technology to protect privacy and the codes of conduct emerging to guide the behavior of data scientists. You will also learn the importance of reproducibility in data science and how the commercial cloud can help support reproducible research even for experiments involving massive datasets, complex computational infrastructures, or both. Learning Goals: After completing this course, you will be able to: 1. Design and critique visualizations 2. Explain the state-of-the-art in privacy, ethics, governance around big data and data science 3. Use cloud computing to analyze large datasets in a reproducible way.
“If I have seen further, it is by standing on the shoulders of giants.”
—Sir Isaac Newton
Standing on the shoulders of giants is a metaphor we often use to describe how research advances. More than an aphorism, it is a mindset that we ingrain in students when they start graduate school: take the time to understand the current state of the art before attempting to advance it further. Having to justify why you have reinvented the wheel during your PhD defense is not a comfortable situation to be in. Moreover, the value of truly reproducible research is reinforced every time a paper is retracted because its results cannot be reproduced, or every time that promising academic research—such as pursuit of important new drugs—fails to meet the test of reproducibility.
Of course, to truly learn from work that has preceded yours, you need access to it. How can you build on the latest research if you don’t know its details? Thankfully, open access (OA) is making it easier to find research papers, and Microsoft Research is committed to OA. Though it’s a good start, OA articles only contain words and pictures. What about the data, software, input parameters, and everything else needed to reproduce the research?
While research software provides the potential for better reproducibility, most people agree that we are some way from achieving this. It’s not just a matter of throwing your source code online. Even though tools such as GitHub provide excellent sharing and versioning, it is up to the researcher or developer to make sure the code cannot only be re-run but also understood by others. There are still technical issues to overcome, but the social ones are even harder to tackle. The development of scientific software and researchers’ selection of which software to use and reuse are all intertwined. We at Microsoft Research are concerned with this—see “Troubling Trends in Scientific Software” in the May 17, 2013, issue of Science magazine.
Kenji Takeda talks about reproducible research and the cloud at CW14.
Photo: Tim Parkinson,CC-BY
This year’s Collaboration Workshop (CW14), run by the Software Sustainability Institute (SSI), brought together likeminded innovators from a broad spectrum of the research world—researchers, software developers, managers, funders, and more—to explore the role of software in reproducible research. This theme couldn’t have been timelier, and I was excited to take part in this dynamic event again with a talk on reproducible research and the cloud. The “unconference” format—where the agenda is driven by attendees’ participation—was perfect for exploring the many issues around reproducible research and software. So, too, was the eclectic make-up of the attendees, so unlike that at more conventional conferences.
Hack Day winners receive Windows 8.1 tablets for Open Source Health Check. Left to right: Arfon Smith (GitHub), Kenji Takeda (Microsoft Research), James Spencer (Imperial College), Clyde Fare (Imperial College), Ling Ge (Imperial College), Mark Basham (DIAMOND), Robin Wilson (University of Southampton), Neil Chue-Hong (Director, SSI), Shoaib Sufi (SSI)
Instead of leaving after two days, many participants stayed on for Hack Day—a hackathon that challenged them to create real solutions to problems surfaced at the workshop. Eight team leaders had to pitch their ideas to the crowd, as the researchers and software developers literally voted with their feet to join their favorite team. The diversity of ideas was impressive, such as scraping the web to catalogue scientific software citations, extending GitHub to natively visualize scientific data, and assessing research code quality online. We made sure that teams were able to use Microsoft Azure to quickly set up websites, Linux virtual machines, and processing back-ends to build their solutions.
Arfon Smith from GitHub and I served as judges, and we had a tough time choosing a winning project. After much back-and-forth, we awarded the honor to the Open Source Health Check team, which created an elegant and genuinely usable service that combines some of the best practices discussed during the workshop. Their prototype runs a checklist on any GitHub repository to make sure that it incorporates the critical components for reproducibility, including documentation, an explicit license, and a citation file. The team worked furiously to implement this, including deploying it on Microsoft Azure and integrating it with the GitHub API, to demonstrate a complete online working system.
Recomputation.org aims to make computational experiments easily reproducible decades into the future.
In addition to our role at CW14, Microsoft Research is delighted to be supporting teams working on new approaches to scientific reproducibility as part of our Microsoft Azure for Research program:
- Recomputation.org is focused on taking advantage of virtual machines to preserve software experiments. By packaging up a researcher’s entire experimental setup in a VM, it becomes trivial to replicate their work. Once these VMs have been uploaded to VMDepot, it takes just a few mouse clicks to call up the complete experiment in Microsoft Azure, log in, and rerun the research. From there, it is possible to drill down and dissect the work, extend it, and then share a new version online. It’s a great collaboration with the Software Sustainability Institute, and the cloud provides an ideal environment for this platform.
- Patrick Henaff, of IAE de Paris, is working with Zeliade Systems on enhanced IPython notebooks shared via the cloud at zanadu.io. Their vision for social coding around reproducible software tackles some of the cultural issues by allowing researchers to easily share, discover, and reuse their work in an executable way.
- Titus Brown, of Michigan State University, is conducting pioneering work on open biological science, open protocols, and provenance-preserving analyses in the cloud. His pilot project uses publicly available data from the Marine Eukaryotic Transcriptome Sequencing Project to move processing workflows into Microsoft Azure in a reproducible way, allowing researchers to tweak and remix their analyses.
While we still have not achieved truly reproducible research, CW14 proved that the community is dedicated to improving the situation, and cloud computing has an increasingly important role to play in enabling reproducible research.
—Kenji Takeda, Solutions Architect and Technical Manager, Microsoft Research Connections