Sunday, July 13, 2014

Using LiveCode to Stitch Together "Big Data": Yet Another Example of the Value of Craft Programming

I've been teaching a massive open online course (MOOC) for the past year - Statistics in Education for Mere Mortals - and I've been collecting research data all along. Time has come to work on writing up this research to share with my professional community. I tend to work best best when there is a deadline. "Lucky" for me, the deadline for proposals to the 2015 annual conference of the American Educational Research Association (AERA) in Chicago is just around the corner - July 22, 2014.

However, I've been slow in analyzing the data because I've been a little confounded as to how, exactly, to go about it. You see, the data have been collected by the learning management system (LMS), but not in a way that is conducive for research. In order to effectively analyze quantitative data in a research study, you need to have all of the raw data for each participant neatly organized into a single line of information. By raw data, I mean every single detail, such as individual answers to specific questions, not just totals. Put into spreadsheet terms, each person needs to be a row and all of the data for that person needs to be the columns. Then, I can import this into a statistical software tool, such as SPSS.

Collecting Data in a Learning Management System

Unfortunately, most LMSs do not organize course data in this way. It's true the LMS has a grade book that automatically keeps track of all course participants, but this only keeps track of aggregate data, such as subscores on individual quizzes, surveys, and the like. Again, I need the raw data. The LMS does keep track of the raw data, but only within individual activities. A good example for my needs is a series of surveys I ask participants to complete at the beginning, middle, and end of the course with questions such as the following:

Which of the following best describes your learning goal at this moment?

  • I intend to fully participate in the course and complete it.
  • I intend to fully participate in most of the course.
  • I intend mainly to browse and participate in a few activities.
  • I will probably just browse around with little actual participation.
  • I really don't have a specific learning goal at this point.

I don't care about their overall "score" on these surveys, instead I need to focus on answers to specific questions. For example, I'm interested in comparing among those who actually did finish the course with those who declared at the start of the course their intent to do so versus those who said they were there just to browse with no intention to actually complete the course.

So, yes, I have access to raw data for each of these surveys, but these data are "trapped" within individual survey reports. No problem, you say, just copy and paste the data from each survey for each person into a master spreadsheet file. That would work fine if I only had 20 or 30 participants. But, I have had an average of over 800 participants in each of the four courses -- about 3400 overall! And, to complicate things further, participants vary tremendously in terms of what surveys and survey questions they choose to complete. So, I find overwhelming the magnitude and complexity of manually copying and pasting this amount of data together.

I had held out some hope that the LMS administrators could help me out and produce a master spreadsheet file for me, so I made a special request for this. Unfortunately, they reported being unable to comply as apparently the data are also located in separate files at the system level (which is a little surprising).

LiveCode to the Rescue

So, what to do? Fortunately, LiveCode's powerful list processing capabilities once again provided a solution. Most of the data reports contain a few common categories, such as the person's name and a course ID. I was able to use these common data points to "stitch" together the various data files.

The code is surprisingly simple, and is based on a algorithm I've used in many programs I've written about already. Before I get into that, let's see how I've structured the LiveCode stack. The stack consists of just one card. Here's a screen shot:

As you can see, there are three fields named (from left to right) "data1," "data2," and "finaldata." The idea is to copy and paste comma-delimited text of the data from an individual survey into the first two fields, with the first course survey going into data1. (The LMS is able to generate Excel files for each survey, so it is then easy to generate a .csv file from that.) Each line is an individual participant with each data point for that person separated with commas. My script starts with the first line in data1, then methodically scans each line in data2 to look for that same person. It if finds a match on the person's name, it then combines (i.e. "stitches") the two lines together and puts it into the next line of field "finaldata." If it doesn't find a match, it then just puts the line from data1 into finaldata.

A Very Short Script That Does Some Heavy-Duty Work!

The script for button "Cross-check with item 1" does this work:

on mouseUp
   put the number of lines in field "data1" into line 1 of field "data1lines"
   put the number of lines in field "data2" into line 1 of field "data2lines"
   put empty into field "finaldata"
   repeat with i=1 to the number of lines in field "data1"
      put false into varMatch
      put i into line 1 of field "progress"
      repeat with j=1 to the number of lines in field "data2"
         if item 1 of line i of field "data1" = item 1 of line j of field "data2" then
            put true into varMatch
            put line i of field "data1"&comma&line j of field "data2"&return after field "finaldata"
            exit repeat
         end if
      end repeat
      if varMatch = false then put line i of field "data1"&return after field "finaldata"
   end repeat
   beep 2
end mouseUp

Item 1 of field data1 and data2 contains the person's name. In the parlance of databases, this acts as the "primary key" for each of these two separate data files. The local variable "varMatch" keeps track of whether a match has been found or not for each line in data1. When a match is found, it becomes "true" and the lines from the two fields are combined. If, after scanning all lines in data2, no match is found, this remains false and the line from data1 is added to finaldata without anything added to it.

It's important to note that after I have all the data combined, I need to delete the individual student names. That is important in order to comply with all human subjects policies and procedures. 

Cascading Data, or Creating a Data Quilt

After I run this program once, I am far from done. I have several more surveys and other data files to stitch together in this master file. So, I built the option to first empty fields data1 and data2, then take the data from the finaldata field and move it into data1. Then, I can copy and paste the comma-delimited text file from the next survey into data2 and begin to stitch those together. This reminds me of a cascading waterfall of data. Another way of putting this is to say that I'm stitching together a large "quilt"from each patch of data from the individual surveys!

The Continued Value of Craft Programming

I consider this tool another example of craft programming. The program focuses on solving a unique and narrow problem. Without modifications, it is of benefit to no one else but me. But, it solved a very difficult problem for me very well and was well worth the time to create. It only took me about two hours to figure out how to build this stack. It then took about 30 minutes to run the program for the data for one of my MOOC course sections, stopping as I went to generate a new .csv file for each successive survey followed by copying and pasting the data into the data2 field. I hesitate to estimate what it would have taken to manually cut and paste the data to do the same job. Maybe 25 very tedious hours per MOOC? Maybe a lot more. Really hard to say. And, remember, I have a total of four course sections to analyze, with more sections to be offered next year. I'm also sure I would have made many errors doing this manually, whereas I can trust LiveCode to follow my algorithm perfectly.

There was no one else I could turn to for help on this. No "real" programmer within my university would have been inclined to take on this job as a service to a researcher like me and I had no funds to pay someone to do it. Creating this craft programming project with LiveCode has been a lifesaver.

Now on to the real work of analyzing the data.

Wednesday, July 2, 2014

Lloyd's Video Analysis Tool, Part 7: Generating Reports

I added the feature of generating some simple reports to the project. I am deliberating not including a video in this posting because I will reporting in my next post on my first public demonstration of the project prototype to my college held earlier today. That posting will include a short video introducing the current prototype from start to finish. I'll include a good overview of the the reporting features.

Two Report Options

As I've written elsewhere in this blog, I'm a big fan of using Excel whenever possible to help organize and analyze data from projects. So, I created an "Excel Ready" report option that creates and saves a comma-separated value (.csv) file to the user's hard drive. Here is an example of this "table style" report already opened in Excel:

From here, one can use Excel to sort, organize, or search the data.

I also created a second report option that produces a simple text document using a "label style" format. Here's an example already opened in the Mac's TextEdit application:

This could obviously be opened instead with Microsoft Word and then edited anyway the user wishes.

Other Report Options to Consider

There is obviously much more that can be done here. For example, I plan on providing the option to produce reports that only focus on certain tags or combinations of tags. Likewise, I plan on giving users the option to include or exclude categories of information. For example, the user may just want the comments, but not the tags or the time code.

Tag Summary Report

At the bottom of both reports, you can see I added a short "Tag Summary Report." This seemed like a very obvious and useful summary of some key information. The list processing strengths of LiveCode make these sorts of summaries easy to produce. I again use the strategy of having two "shadow fields" (fields the user can't see) in combination with two scripts:

The first script scans the entire video analysis data and builds a running log of all tags in the field "tags found." After this field is generated, a second script scans it to produce the field "unique tags" on the right. The contents of this second field is then copied to the bottom of the report using yet another field not shown here titled "report bin."

Here's the code for producing the Excel-ready version (the text only version only differs in the way it outputs the data):

//Create list of unique tags
   repeat with i = 1 to the number of lines in field "tags found"
      put false into varTagFound
      repeat with i2 = 1 to the number of lines in field "unique tags"
         if line i of field "tags found" = line i2 of field "unique tags" then 
            put true into varTagFound
            exit repeat
         end if
      end repeat
      if varTagFound = false then put line i of field "tags found" into line (the number of lines in field "unique tags"+1) of field "unique tags"
   end repeat
   //Count the use of each unique tag
   repeat with i = 1 to the number of lines in field "unique tags"
      put 0 into varCountTags
      repeat with j = 1 to the number of lines in field "tags found"
         if line i of field "unique tags" = line j of field "tags found" then add 1 to varCountTags
      end repeat
      put varCountTags into item 2 of line i of field "unique tags"
   end repeat
   put the number of lines in field "report bin" into l
   put "Tags used in this analysis:"&comma&"Frequency"&return into line l+2 of field "report bin"
   put field "unique tags" after field "report bin"
Yes, there is a lot of nested repeating going on here! But LiveCode somehow makes it easy to conceptualize these repeat structures in ways that other programming languages don't, at least for me.

Final Thoughts

These simple reports were the last major piece of the puzzle to creating a tool that would actually do something useful. Even if I stopped working on the project right now, I have a tool (albeit crude) that does some useful work. This is a very important milestone. Heck, I might almost be ready to declare that the project has moved from beta version 2 to beta version 3!