Skip to content

Software Carpentry Workshop report

During the weekend of Feb 21 & 22, Joe Hunter, an undergraduate research student, and I attended a Software Carpentry workshop at University of Arizona hosted by the iPlant Collaborative. The workshop was intended to teach scientists to analyze and manage data (beyond just using Excel and storing data in a laptop). In two days it covered quite a few important topics: data analysis and scripting with R, automating tasks in the shell, version control with Git, visualizing using ggplot in R and using Atmosphere clouding computing. The idea of having these topics was quite straightforward but compelling. You start with some data and you analyze and visualize that data, in this case, using R. And then you learn how to do version control using Git, which get you out of the nightmare of tracking different versions of files. It was obvious that R provides much more flexibility than Excel for data manipulation and visualization. We explored only at depth with plotting graphs using the ggplot package. This powerful package can generate beautiful plots much more easily than using Excel. An example we went over was to to generate a scatter plot of diamond values against diamond clarity and color or shape the dots by diamond color. The style, size and color of the points on the scatter plot can be customized.

R plot of diamond value against clarity, color sorted by diamond color

R plot of diamond value against clarity, sorted by diamond color

 

 

 

 

 

 

 

 

The next topic was using the shell. The shell is literally a black window that uses command lines to access the operating system’s services. You might remember the early days of using the DOS system. The shell is similar to that (but different).

After covering the use of the shell, we learned version control with Git. To understand version control, you should read this funny PhD comics about the “final.doc“. The use of Git allows one to track all changes made to documents or files. It is especially useful for text documents where you can compare different versions of the same document and see changes. This is a neat function for dealing with data files when you need to constantly create different versions to variously analyze your data. You can also use Github to perform version control online and collaborate on the same project. We each created an online Github repository and practiced version control using that.

The last toolkit covered was the iPlant Collaborative‘s “Atmosphere” cloud computing service. This service is similar to Amazon cloud computing, but it is FREE. You can run R with Atmosphere or any other programs. Unfortunately, some technical glitches prevented us from fully exploring the functions of Atmosphere.

Finally, to practice the skills we learned, we went through an exercise which required us to use the tools we learned to solve a problem of data analysis. We used R to analyze country GDP data and performed version control using Git both locally as well as online.

I found the workshop very helpful in introducing basic data analysis skills. It was very well organized and very effective. There were two main instructors, six helpers and one local host from the iPlant Collaborative. I felt it was a weekend well spent.

There will be probably another workshop at ASU on June 8 and 9. Stay tuned and check Software Carpentry’s website for updates.

No comments yet

Leave a Reply

You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS