Sunday, July 13, 2014

Using LiveCode to Stitch Together "Big Data": Yet Another Example of the Value of Craft Programming

I've been teaching a massive open online course (MOOC) for the past year - Statistics in Education for Mere Mortals - and I've been collecting research data all along. Time has come to work on writing up this research to share with my professional community. I tend to work best best when there is a deadline. "Lucky" for me, the deadline for proposals to the 2015 annual conference of the American Educational Research Association (AERA) in Chicago is just around the corner - July 22, 2014.

However, I've been slow in analyzing the data because I've been a little confounded as to how, exactly, to go about it. You see, the data have been collected by the learning management system (LMS), but not in a way that is conducive for research. In order to effectively analyze quantitative data in a research study, you need to have all of the raw data for each participant neatly organized into a single line of information. By raw data, I mean every single detail, such as individual answers to specific questions, not just totals. Put into spreadsheet terms, each person needs to be a row and all of the data for that person needs to be the columns. Then, I can import this into a statistical software tool, such as SPSS.

Collecting Data in a Learning Management System


Unfortunately, most LMSs do not organize course data in this way. It's true the LMS has a grade book that automatically keeps track of all course participants, but this only keeps track of aggregate data, such as subscores on individual quizzes, surveys, and the like. Again, I need the raw data. The LMS does keep track of the raw data, but only within individual activities. A good example for my needs is a series of surveys I ask participants to complete at the beginning, middle, and end of the course with questions such as the following:

Which of the following best describes your learning goal at this moment?

  • I intend to fully participate in the course and complete it.
  • I intend to fully participate in most of the course.
  • I intend mainly to browse and participate in a few activities.
  • I will probably just browse around with little actual participation.
  • I really don't have a specific learning goal at this point.

I don't care about their overall "score" on these surveys, instead I need to focus on answers to specific questions. For example, I'm interested in comparing among those who actually did finish the course with those who declared at the start of the course their intent to do so versus those who said they were there just to browse with no intention to actually complete the course.

So, yes, I have access to raw data for each of these surveys, but these data are "trapped" within individual survey reports. No problem, you say, just copy and paste the data from each survey for each person into a master spreadsheet file. That would work fine if I only had 20 or 30 participants. But, I have had an average of over 800 participants in each of the four courses -- about 3400 overall! And, to complicate things further, participants vary tremendously in terms of what surveys and survey questions they choose to complete. So, I find overwhelming the magnitude and complexity of manually copying and pasting this amount of data together.

I had held out some hope that the LMS administrators could help me out and produce a master spreadsheet file for me, so I made a special request for this. Unfortunately, they reported being unable to comply as apparently the data are also located in separate files at the system level (which is a little surprising).

LiveCode to the Rescue


So, what to do? Fortunately, LiveCode's powerful list processing capabilities once again provided a solution. Most of the data reports contain a few common categories, such as the person's name and a course ID. I was able to use these common data points to "stitch" together the various data files.

The code is surprisingly simple, and is based on a algorithm I've used in many programs I've written about already. Before I get into that, let's see how I've structured the LiveCode stack. The stack consists of just one card. Here's a screen shot:



As you can see, there are three fields named (from left to right) "data1," "data2," and "finaldata." The idea is to copy and paste comma-delimited text of the data from an individual survey into the first two fields, with the first course survey going into data1. (The LMS is able to generate Excel files for each survey, so it is then easy to generate a .csv file from that.) Each line is an individual participant with each data point for that person separated with commas. My script starts with the first line in data1, then methodically scans each line in data2 to look for that same person. It if finds a match on the person's name, it then combines (i.e. "stitches") the two lines together and puts it into the next line of field "finaldata." If it doesn't find a match, it then just puts the line from data1 into finaldata.

A Very Short Script That Does Some Heavy-Duty Work!


The script for button "Cross-check with item 1" does this work:

on mouseUp
   put the number of lines in field "data1" into line 1 of field "data1lines"
   put the number of lines in field "data2" into line 1 of field "data2lines"
   
   put empty into field "finaldata"
   
   repeat with i=1 to the number of lines in field "data1"
      put false into varMatch
      put i into line 1 of field "progress"
      repeat with j=1 to the number of lines in field "data2"
         if item 1 of line i of field "data1" = item 1 of line j of field "data2" then
            put true into varMatch
            put line i of field "data1"&comma&line j of field "data2"&return after field "finaldata"
            exit repeat
         end if
      end repeat
      if varMatch = false then put line i of field "data1"&return after field "finaldata"
   end repeat
   
   beep 2
end mouseUp


Item 1 of field data1 and data2 contains the person's name. In the parlance of databases, this acts as the "primary key" for each of these two separate data files. The local variable "varMatch" keeps track of whether a match has been found or not for each line in data1. When a match is found, it becomes "true" and the lines from the two fields are combined. If, after scanning all lines in data2, no match is found, this remains false and the line from data1 is added to finaldata without anything added to it.

It's important to note that after I have all the data combined, I need to delete the individual student names. That is important in order to comply with all human subjects policies and procedures. 

Cascading Data, or Creating a Data Quilt


After I run this program once, I am far from done. I have several more surveys and other data files to stitch together in this master file. So, I built the option to first empty fields data1 and data2, then take the data from the finaldata field and move it into data1. Then, I can copy and paste the comma-delimited text file from the next survey into data2 and begin to stitch those together. This reminds me of a cascading waterfall of data. Another way of putting this is to say that I'm stitching together a large "quilt"from each patch of data from the individual surveys!

The Continued Value of Craft Programming


I consider this tool another example of craft programming. The program focuses on solving a unique and narrow problem. Without modifications, it is of benefit to no one else but me. But, it solved a very difficult problem for me very well and was well worth the time to create. It only took me about two hours to figure out how to build this stack. It then took about 30 minutes to run the program for the data for one of my MOOC course sections, stopping as I went to generate a new .csv file for each successive survey followed by copying and pasting the data into the data2 field. I hesitate to estimate what it would have taken to manually cut and paste the data to do the same job. Maybe 25 very tedious hours per MOOC? Maybe a lot more. Really hard to say. And, remember, I have a total of four course sections to analyze, with more sections to be offered next year. I'm also sure I would have made many errors doing this manually, whereas I can trust LiveCode to follow my algorithm perfectly.

There was no one else I could turn to for help on this. No "real" programmer within my university would have been inclined to take on this job as a service to a researcher like me and I had no funds to pay someone to do it. Creating this craft programming project with LiveCode has been a lifesaver.

Now on to the real work of analyzing the data.

3 comments:

  1. FWIW, by using the "repeat for each" loop method combined with moving data out of the complex field structures into variables for use within the loop, this should bring your processing time down by at least an order of magitude:

    on mouseUp
    put the number of lines in field "data1" into line 1 of field "data1lines"
    put the number of lines in field "data2" into line 1 of field "data2lines"
    put empty into field "finaldata"
    # repeat with i=1 to the number of lines in field "data1"
    put fld "data1" into tData1
    put fld "data2" into tData2
    out i into i
    repeat for each line tLine1 in tData1
    add 1 to i
    put false into varMatch
    put i into line 1 of field "progress"
    # repeat with j=1 to the number of lines in field "data2"
    repeat for each line tLine2 in tData2
    if item 1 of tLine1 = item 1 of tLine2 then
    put true into varMatch
    # put line i of field "data1"&comma&line j of field "data2"&return after field "finaldata"
    put tLine1 &comma& tLine2 &return after tFinalData
    exit repeat
    end if
    end repeat
    # if varMatch = false then put line i of field "data1"&return after field "finaldata"
    if varMatch = false then put tLine1 &return after tFinalData
    end repeat
    put tFinalData into fld "finaldata"
    beep 2
    end mouseUp

    ReplyDelete
  2. Why are you solving a problem like this using LiveCode? Why? Why? Why? This is a database problem. You can easily put this data (using common utilities) in a database, a table for each LMS file, associate the files with each other (key fields) and drop on a reporting tool - easy to do. This is maybe an hour's work at most and you can easily slice, dice, julien your data. Database's are made for exactly this kind of issue. Access could even handle this easily, including the reporting. It would also produce supper nice output, easy to graph if you wanted, distributions and other statistical data becomes easily available. LiveCode is a great system but use it for what's intended - interfaces and rules. Leave the data to the database.

    ReplyDelete
  3. Re "This is a database problem". Perhaps, and yet we see an ever-growing role for scripting languages in data analysis. Given the broad and ever-changing needs of analytics, it's useful to have a wide range of tools in one's kit.

    Python is a very popular choice for such work, even where some of the data may be stored in a database.

    LiveCode's performance is roughly on par with Python in many respects, and offers language features that make it arguably even more well suited for a wide range of analytics tasks.

    At the recent LiveCode conference I had the pleasure of meeting one of the world's leading SAS consultants who's making increasing use of LiveCode in his work.

    LiveCode may not be equal to the sum of Python, R, sed, awk, and the other tools commonly used in analytics. But it includes enough of the best of many of them that it can be a very good choice, sometimes complimenting other tools, sometimes as a complete solution in itself. And always, very fun to work with. :)

    ReplyDelete