Saturday, April 13, 2013

Creating Unique Random Codes

Did you ever look at a YouTube video link and wonder where that odd set of characters at the end comes from? Here's an example (the video is one that yours truly made):

https://www.youtube.com/watch?v=xSnnrfNO-o0

I've always assumed labels like "xSnnrfNO-o0" are just a string of 11 random letters, numbers, and characters. Lots of sites use the technique of generating random strings like this for links, passwords, and special codes. (Tinyurl.com is another good example.) Now, it is one thing to create a random string, but it is quite another to make sure that every random string is completely unique from all others. Obviously, no YouTube video can have the same exact string, otherwise there would be chaos, havoc, and no way for grown people to watch their favorite puppy or kitten video when the mood strikes.

So, I set out to create a LiveCode project that created a list of unique, random strings of letters, numbers, and characters. For example, consider the number of unique two character codes that one could generate from the following: a, b, and c.

We can easily figure out that there must be 9 such codes:

aa
ab
ac
ba
bb
bc
ca
cb
cc

But, what if I wanted to create unique three character codes (e.g. abc, aab) or four character codes (e.g. aaab, aaba). How many would there potentially be? So, let's consider the math. This is an example of a permutation. There is a simple formula to answer this question:

nr
(This is read as "n to the power of r.")

Where n is the number of things to choose from, and you choose r of them (also note that repetition is allowed and order matters).

If you want to know more than that about the math, here's a good Web site that will explain both combinations and permutations:

http://www.mathsisfun.com/combinatorics/combinations-permutations.html

So, in the example above, n is 3 (abc) and r is 2 (codes with two characters, or "slots"). 32 is 3 times 3, or 9.

If we wanted to generate 3 character codes, it would be 33, or 3 times 3 times 3, or 27.

In YouTube's case, r would be 11 because there are 11 characters -- let's call them "slots" from here on out -- in the code. But from what group of characters is YouTube using? Well, from the one example above, it seems to consist of lower case letters, upper case letters, numbers, and special characters (e.g. hyphens and the like). If we just stick to letters and numbers, the list would contain 62 different characters (26 lower case letters, 26 upper case letters, and the 10 digits). How many different codes with 11 slots could we construct from this list?

It would be 6211, or -- in scientific notation -- 5.2036561e+19, or...

52,036,561,000,000,000,000 unique codes. In short, a lot.

Now, allow me to repeat myself and say that it is one thing to generate a single code, but it quite another to make sure that the next code is guaranteed to be different from all others generated so far.

So, how does YouTube or any other site do this?

I can quickly see three ways of doing this:

1. Given a list of letters, numbers, or characters, select one of these at random for each slot. As each code is created, add it to an ever growing list, then every time you create a new code, first compare it to all of the codes in this list. If it matches any of them, throw it out and start over. If no matches are found, give out the code and then add the code to the list.

2. First generate the list of all possible codes (i.e. permutations), then begin assigning the codes one at a time. There will be no need to check against the list because you made sure that all codes on the list are unique before you start handing out the codes.

3. Only pretend that the codes are randomly generated. After all, who's going to check? Use some secret coding system where 1 is an a, 2 is a b, # is a z, and so on. Then, you can just start a counter at 10000000001 and generate codes sequentially. They may look random, but anyone who knows the "secret system" could work backwards to figure out what code will come next.

If you see other ways, share them with me.

I'm going to immediately dismiss #3, mainly because there are plenty of people in this world with loads of time on their hands who will likely figure out what's going on. Of choices #1 and #2, #2 seems the best strategy because of the simple fact that it would take the least amount of time to implement. Let some computer take the weekend to generate the first couple billion of the codes and put them in a list, then start pulling the codes from that list one by one. The only downside is that you have to store a very big list of codes somewhere and keep that list safe. And, it's hard to tell just how big to make that list. If you create the next YouTube, you better have a long list. But, if you create Web sites as popular as mine, 10 or 20 codes will likely do the trick.

Option #1 looks attractive at first, but stop and think about it for a second with a slight air of confidence. After a little while, if the site you've built is modestly successful, the list of codes could easily grow to thousands or millions. (How many YouTube videos have been uploaded in the time it takes you to read this?) Every time a new code is generated, the computer has to compare that code first to *every single previous code* in the list before giving it out. That will take some serious processing.

So, ponder that for awhile while we take the first step that applies to both options #1 and #2, namely that we have to learn how to generate a list of unique codes. So, let's start with that. First, here is a screen shot of my LiveCode program:

Here are the directions of how to use the program:

1. The "code elements" are the things that make up the code. Scroll through the list and you'll see that I'm using all lower case letters, all upper case letters, and all 10 digits, for a total of 62 characters. Notice that each element needs to be on its own line. This field is editable, so feel free to change it as you wish, but just make sure that only one element is on each line.

2. Enter the number of code slots. That is, how many characters will make a code. The code "aa" contains 2 slots, whereas "ad4" has 3 slots, and "gK85ty9" has 7 slots. Enter any number you wish (but I suggest keeping it relatively small -- under 10).

3. Press the button "Create Random Codes" and watch the magic.

In the above example, I'm generating codes with only 2 slots, so 62 raised to the power of 2 equals 3844 total possible unique codes. As you can see, after generating 501 unique codes, a total of 36 duplicates were found along the way. But don't worry, these were deleted as soon as they were created. The field labeled "Codes Created" stores all of the unique codes. So, if you were to scroll through the list above, you would find 501 unique codes.

4. Click "Stop!" at any time to tell LiveCode to stop at that point. (Depending on the number of slots and the total number of elements, it could take a very long time to generate all the unique codes.)

I find it fascinating to see how fast or slow the unique codes and the duplicates are generated. Not surprisingly, early on there are few duplicates, but this dynamic certainly changes as the calculation engine cranks out a bunch of codes. (It would be cool to graph this dynamic.)

How did I program this in LiveCode? Let's just focus on the basics. (Most of the scripting, as is typical, is actually for the interface design. So, I'll deliberately ignore that stuff for now knowing that you can explore the raw code on your own. But, I'm happy to explain anything further, so just ask.)

First, we need a text field to store the group of characters from which we will build the codes. I'll name this field "code elements." Each line consists of a single element (e.g. "a" on line 1, "b" on line 2, etc.)

We will create a variable -- varTotalCodeElements -- that will contain the total number of lines with an element in it:

put number of lines in field "code elements" into varTotalCodeElements

We can then use this script to select one of the elements at random and put it into the variable "varA":

put line random(varTotalCodeElements) of field "code elements" into varA.

To build the code -- let's use the variable "varCode" to store it -- we need to create a repeating loop that repeats as many times as slots in the code. So, if we want a code with five slots (e.g. abcde), the code would look like this:

   repeat 5 times
      put line random(varTotalCodeElements) of field "code elements" into varA
      put varCode&varA into varCode
   end repeat


Recall that the ampersand (&) concatenates, or joins, string elements. So, varCode builds by one character each time the loop repeats.

Of course, we should build this so that our program is flexible enough to build codes that contains any number of slots. So, let's use the variable "varTotalCodeSlots" to store that number:

   repeat varTotalCodeSlots times
      put line random(varTotalCodeElements) of field "code elements" into varA
      put varCode&varA into varCode
   end repeat 


It's important to note at this point that we also have all the information we need to determine the total number of possible permutations, that is, the total number of unique codes possible. Here's the script:

   put varTotalCodeElements^varTotalCodeSlots into varTotalCodesPossible

The symbol ^ is used to represent "to the power of." So, the value of the variable "varTotalCodesPossible" will contain the value of varTotalCodeElements raised to the power of varTotalCodeSlots. (If you download and run the final program, you will see where I use it.)

We will be adding all unique codes generated to an ever growing list of codes. So, let's get that part ready by creating another field to hold these codes. We'll simply call that field "codes." So, to add a code to this list, we first have to determine how many codes are already in this list, then we add the most current code generated to the very next line. These two lines of script handle this:

    put number of lines in field "codes" into x
    put varCode into line x+1 of field "codes"

(Side note: the variable x is a local variable, so I usually don't bother with long naming conventions given that local variables are only temporary, unlike global variables.)

Of course, before we add a code to this list, we have to first determine if the code is already contained somewhere on the list. How is that done? Answer: by brute force. That is, we will generate a random code, then compare it to each and every code already in the field "codes." A tedious task that the computer loves to do. Here's the script:

   put number of lines in field "codes" into x
   repeat with i = 1 to x
     put line i of field "codes" into y
      if varCode = y then
         //Sorry, it's already been taken, so stop, and try again.
         exit repeat

        //go and generate another random code and start the repeat loop again
      end if
   end repeat


    //Yippee! This one hasn't been used yet! Add it to the list.
    put varCode into line x+1 of field "codes"
 (Note: y is another local variable.)

The construction of the repeat statement in line 2 basically says "begin a repeat loop that puts the value 1 into a local variable i; keep repeating the loop, adding 1 to i each time; stop repeating when you reach the value of x."

The local variable y contains the code stored the current line of field "codes" being examined (i.e. when i is 1, line 1 is being examined; when i is 9, line 9 is being examined). The computer carefully compares each varCode to y and determines if it is a match. If it is, then the repeat loop is aborted with the "exit repeat" command. But, if the computer goes through each line of field "codes" and does not find a match, it concludes it is a unique code and it is added to the list.

Those are the basics of how options #1 and #2 work. The difference is that option #1 does this comparison "on the fly," whereas option #2 grinds out the entire list ahead of time. In a sense, the LiveCode example I created does both. Again, I added lots of little extras to make the program more functional and understandable. (A reminder that the default set of characters I use in the program consists of 26 lower case letters, 26 upper case letters, and the 10 digits -- a total of 62 characters in all).

Try it out and see what you get.

A final note...

I thought the program was working fine, but then I did a test of all 62 characters with a slot number of 1. In other words, I was asking the computer to find just the 62 characters. Every time I ran the program, it quickly found 36 characters, but then just seemed to spin its wheels. It kept chugging along, trying to find a unique code, but would never go beyond 36. I began to worry that my logic had some fatal flaw in it that was only now revealing itself with this simple test. However, when I used only the numbers 0 to 9, there was no problem. Likewise, there was no problem when I went back and redid my initial tests using just the letters a, b, and c. And then it hit me: 26 letters in the alphabet plus the 10 digits also equals 36. This indicated to me that the computer was not distinguishing between upper and lower case letters. A quick search in the LiveCode dictionary on the keyword "case" revealed the property "caseSensitive," which "specifies whether comparisons treat uppercase and lowercase letters as different." It also stated that "by default, the caseSensitive property is set to false."

So, I added the line "set the caseSensitive to true" as the very first command in the procedure and all seems to work just fine.

No comments:

Post a Comment