Wednesday, September 30, 2015

Update on Using LiveCode to Build a Word Cloud: A Cloud is Forming!

I spent about two more hours on my little word cloud project. I knew that the next step of building the actual word cloud from the list of unique words and frequencies was definitely within reach. Here's an example of one of the first word clouds I built using the text of Abraham Lincoln's second inaugural address:


There is a "mild" hack at play here. The words simply go to a random spot within this square area. Each separate field containing a word has the script "grab me" on mousedown so that I can easily move the words to a more aesthetically pleasing location. I decided it wasn't worth trying to figure out ways to keeps all of the words from overlapping, etc. However, for an excellent example of how to accomplish this, check out the blog posting by Ali Lloyd on his efforts of building a word cloud. Ali is one of the excellent professionals who work at RunRev (the parent company of LiveCode). Ali's solution is exactly what you think of when you think of a word cloud. It has words of different sizes and colors with different orientations filling every nook and cranny. It's really marvelous. So, many thanks to Ali for sharing this link with me in his comment to my previous blog posting on this topic. I'll be studying his code for some time to come. (And any script that uses sines and cosines makes me want to purr.)

How I Did It


OK, back to my humble attempt. One of the challenging parts to the project was figuring out the step-wise progression of font sizes. The above word cloud looks OK, but I was able to improve the word cloud algorithm in several fundamental ways. All of the code to build the word cloud is in the green button "Build Word Cloud" shown at the bottom of this post, but here are a few key highlights.

Font Size Step-Wise Progression


I perfected the step-wise progression of the font size so that the word with the highest frequency of use had a font size of 96 pixels. In the example above, the font size is directly proportional to the frequency. That was inadequate for many reasons, the most obvious is that the font sizes can be radically different for just for the first few most frequently used words. So, I revised the script so that the second next most frequently used word had a font size of 12 pixels smaller, no matter how fewer times it was used, and so on. Let me explain a little further. If one word was mentioned 100 times, but the next most frequently mentioned word was used only 20 times, then my revised script would give the second word a font size of only 12 pixels smaller. Think of 12 pixels as the height of the "step." From a data visualization standpoint, that skews the proportion in an inappropriate way, but it makes for a more aesthetically pleasing outcome. So, I think it all depends on what the purpose of the word cloud is. For scientific purposes, it is inadequate because it skews the output, but for a quick visual to get the gist of what's going on in a passage of text that is pleasing to the eye, it's fine. I also made 12 pixels the smallest font size that would be used. (In my original script, it was possible to have one word be 96 pixels and all remaining words 12 pixels if the most frequently used word was mentioned an inordinate amount times as compared to all other words.)

Adding Color


I also added the option to pick a color at random using the following code:

    put random (255) into rColor  
    put random (255) into gColor  
    put random (255) into bColor  

    if varColor is true then   
      set the foregroundcolor of it to rColor,gColor,bColor  
    else  
      set the foregroundcolor of it to black       
    end if  

You'll need to scan the code below for these lines in the button script. The first three lines just pick three numbers at random from 0 to 255. These are used to produce a random RGB color if the "Color" option is checked.

These changes produced the following word cloud:


I know my graphic design skills don't qualify me to give any "expert" opinion, but it definitely seems like an improvement to me, at least aesthetically. Again, though, the question of whether you get a more accurate representation of the data is an important question to ask.

Another improvement was the use of the "font step" slider at the top of the screen. When I ran the algorithm with different text passages, I found that 12 was not always the optimal number for the font step. I decided it was better to let the user experiment with this. A final minor improvement, I think, is making sure that the two words with the highest frequency are always shown in black for added emphasis. Here's a screen shot of the card that builds the word cloud.



I Found Two Golden Nuggets: formattedWidth and formattedHeight


One of the wonderful outcomes of building this project is discovering the "formattedWidth" property. This property does all of the hard work of figuring out exactly how wide or how tall a text field needs to be for its contents to fit perfectly within it. I didn't know about this property when I was first building my Q Sort project, so I came up with my own function- very imperfect - to try to do the same thing. I've since updated my Q Sort app to use this property. Here are the two key lines of code that accomplish this feat:

    set the width of field "word object" to the formattedWidth of field "word object"  
    set the height of field "word object" to the formattedHeight of field "word object"  

Next Steps


As I looked at Ali Lloyd's code, it reminded me of Richard Gaskin's advice to me almost a year ago to use the "repeat for each" form of going through a list of data rather than the "repeat with" approach that I have become so fond of. My code that computes the frequency of the words in a passage of text is painfully slow, so I'm now very motivated to get with it and try out the "repeat with" approach. So, look for at least one more update on this little project.

As always, the bottom-line for me is that I continue to learn new things every time I build a LiveCode project, however small. But, is there really any other way?

Script on the Button "Build Word Cloud":


 on mouseUp  
   //Erase any existing word cloud first  
   put the number of fields into L  
   repeat with i = 5 to L-1  
    put i-4 into j  
    put item 1 of line j of field "word frequencies" into varFieldName  
    put varFieldName into message  
    delete field varFieldName  
   end repeat  
   put false into varColor  
   if the hilite of button "color" is true then put true into varColor  
   put the thumbposition of scrollbar "fontdifferencebar" into varFontChangeAmount  
   //Build the word cloud  
   set the movespeed to 0  
   //Determine the largest frequency - this will get the largest font size in the word cloud  
   put item 2 of line 1 of field "word frequencies" into varMaxFrequency  
   put 0 into varFontDifference  
   put field "minimum frequency" into varMinFrequency  
   //This is the repeat loop that will create each word, resize its text size, then move it  
   repeat with i = 1 to the number of lines in field "word frequencies"  
    put random (255) into rColor  
    put random (255) into gColor  
    put random (255) into bColor  
    if item 2 of line i of field "word frequencies" < varMaxFrequency then  
      add 1 to varFontDifference  
      put item 2 of line i of field "word frequencies" into varMaxFrequency  
    end if  
    if item 2 of line i of field "word frequencies" < varMinFrequency then exit repeat  
    //The next two lines determine the area of the screen where word cloud will be built  
    put random(300)+100 into x  
    put random(300)+100 into y  
    //Create the next word for the word cloud  
    copy field "word object" on card "library" to this card  
    hide it  
    if varColor is true then   
      set the foregroundcolor of it to rColor,gColor,bColor  
    else  
      set the foregroundcolor of it to black       
    end if  
    if varFontDifference<2 then set the foregroundcolor of it to black  
    put item 1 of line i of field "word frequencies" into field "word object"  
    //Determine font height for the word  
    //put varMaxFrequency - item 2 of line i of field "word frequencies" into varFontDifference  
    put 96-(varFontDifference*varFontChangeAmount) into varTextSize  
    if varTextSize < 12 then put 12 into varTextSize  
    set the textSize of field "word object" to varTextSize  
    set the width of field "word object" to the formattedWidth of field "word object"  
    set the height of field "word object" to the formattedHeight of field "word object"  
    //Rename the newly copied field as the word it contains  
    set name of field "word object" to item 1 of line i of field "word frequencies"  
    //Move the word to a random spot within the word cloud screen area  
    move it to x,y in 1 millisecond  
    show it  
   end repeat  
 end mouseUp  

Postscript: About My Formatted Code


Ali Lloyd's post reminded me that the way I've been showing code in my blog posts has been really terrible, so I did the obvious thing and googled "showing code in a blog post using blogger" and quickly found a great tool:

http://codeformatter.blogspot.com/

The much improved formatting above is the result.


2 comments:

  1. Hi Lloyd, I enjoyed this post -- it's very informative, and it's great to help follow your thought processes.

    With regard to formatting your code, you may find it easier to read on the blog if you include some more vertical whitespace. For example, if you look at this script in the LiveCode source code, you can see that blank lines and formatted comments are used to help divide the code up into logical sections.

    Also, I think Ali has some JavaScript that can be used to help format LiveCode script on a web page with syntax highlighting. I'll see if I can find out from him!

    ReplyDelete
  2. Thanks for your comment, Peter. I was glad to find the codeformatter tool, but I'd definitely be interested in better approaches that improve the formatting of the LiveCode script in my blog.

    ReplyDelete