Saturday, July 12, 2008

Using State Machines: Web Access From Adobe InDesign CS3 ExtendScript - click to listen to podcast

Sample files for this podcast can be downloaded from:

http://www.rorohiko.com/podcast/geturl.zip

This podcast will explain how you can query web services from within InDesign CS3 ExtendScript. No need for plug-ins, external libraries - just Adobe InDesign ExtendScript, pure and simple.


>>> Edit#3:
Adjusted the script to handle redirections, by interpreting the HTTP 1.0/301 Moved Permanently return status.
<<<


>>> Edit#2:
Adjusted the script to give much faster downloads in case the Content-Length header is not present in the web server headers. Also changed the protocol to HTTP/1.0 instead of HTTP/1.1 to sidestep the issue of 'chunked' downloads - support for 'chunked' HTTP is left as an exercise.
<<<


>>> Edit#1:
With regards to InDesign CS5: the downloadable code won't work as-is with CS5 due to some oddity with the String.replace function.

To work around the issue, you need to change the ParseURL function to read:

...
function ParseURL(url)
{
url=url.replace(/([a-z]*):\/\/([-\._a-z0-9A-Z]*)(:[0-9]*)?\/?(.*)/,"$1/$2/$3/$4");
url=url.split("/");

// ADD THE LINE BELOW FOR INDESIGN CS5
if (url[2] == "undefined") url[2] = "80";

var parsedURL = 
...

That fixes it up so it works with CS5 too...

<<<

I'll also present some useful routines I wrote, called GetURL() and ParseURL(). GetURL() is a fairly large routine which demonstrates how you can use a programming pattern called 'a state machine' to process data on-the-fly as it is received from a network connection.

To demonstrate how to use the GetURL() function, I've also added a useful sample script. The script will search your active InDesign document for any page items that have a URL as their script label (entered via Window - Automation - Script Label). It will then fetch the data 'behind' the URL and place that data into the page item.

Install the script GetURLs.jsx in the InDesign scripts folder (the easiest is to bring up the scripts palette from InDesign: select Window - Automation - Scripts, and then right-click the User folder).

Select Reveal in Finder or Reveal in Explorer. Then copy the script into the Scripts Panel folder you should now see appear.

Switch back to InDesign, and open up the sample document GetURLSample.indd. Run the script GetURLs.jsx from the palette by double-clicking it on the palette. The empty frames should fill up with images or text (at least, if you are connected to the Internet).

So, how does it all work?

At the heart of it all is the standard ExtendScript object called Socket. More info about the Socket can be found in the JavaScript Tools Guide for CS3:

http://www.adobe.com/devnet/bridge/pdfs/javascript_tools_guide_cs3.pdf

The socket object gives us the ability to perform low-level network communications - we can set up TCP/IP connections with other computers on the network.

The problem is that the Socket object has no higher-level functionality - it has no support for any protocols, like HTTP for example.

To fix that, you could try and use the protocol support that is available via Bridge with the HttpConnection object. You could also forcibly give InDesign access to the webaccesslib that is used by Bridge, through some 'fiddling around'.

However, personally I am not too keen on either of these approaches - they're either a bit too big or too brittle to my liking; I wanted to have an 'InDesign all by itself' solution.

The alternative approach I used was to provide HTTP support in pure ExtendScript to InDesign.

Now, before diving into this: be warned, this is NOT a fully fledged, fully compliant HTTP client. I've only implemented a subset of the protocol, just enough to let me do what I needed to do.

For example, the code only supports UTF-8 text encoding. If your target web server does not offer that you'll have to add some additional code to the scripts to cope with that. Also, I've only implemented HTTP 'GET' requests, not 'POST'. Adding that functionality would be fairly easy to do - it's left as an exercise.

So, the Socket object allows us to send out requests and receive replies via TCP/IP.

The HTTP protocol is quite extensive, but the basis of it is simple - it is mainly a plain text-based, ping-pong protocol. You send out a request, and you get a reply.

The (incoming) reply is composed of three parts: a start (or status) line, zero or more header lines, and then an optional body (which can be binary data or text).

The (outgoing) request is similarly composed of a request line, and zero or more header lines, and an optional body (for POST requests, which I am not implementing here).

Immediately after the start line, you get zero or more header lines.

Header lines are themselves separated from the following body of the request or reply by an empty line - so a request or reply always has the same 'rhythm' to it: start line, header lines, empty line, body.

The GetURL() script code in the GetURLs.jsx implements this in a rudimentary fashion: the request is a simple multi-line text string, which is fired off to the web server via a Socket object.

Then the bulk of the code is for interpreting the reply from the web server - there are three levels of decoding that need to happen.

First of all, we need to decode the reply itself - separate the start/status line from the headers and from the body.

The headers also will contain some important information: the length of the body that will follow. So we need to interpret that header line in order to know exactly how many bytes to read from the socket.

At a lower level, we need to 'chop' the reply up into individual lines until we reach the body - the status line and the individual lines are separated by CRLF character pairs (CR = ASCII character 13, LF = ASCII character 10).

And finally, at the lowest level, while reading a text-based body, we need to interpret UTF-8 code and convert the UTF-8 codes into plain Unicode. This means that we need to read through codes that are 1, 2, 3 or 4 bytes long, and each of them encodes a single Unicode character.

In the GetURL() routine, these three levels of encoding are decoded concurrently, through three nested 'state machines'. Using state machines makes the code fairly fast - much faster than could ever be achieved through string matching.

I won't go into the intricate details of how GetURL() works - the script is fairly well documented and you should be able to figure out how it works by careful reading and stepping through it with the ExtendScript debugger.

Instead, I want to explain a little bit more about state machines - they are a very powerful technique for fast pattern matching and parsing, and once you 'get' them, they are easy to use. They are used in mechanisms like GREP, in compilers and interpreters, in all kinds of text parsers,...

All too often I see code that uses straight string functions to achieve some matching goals.

A simple example: you get some data thrown at you which has line endings that might be either CR (ASCII 13), LF (ASCII 10) or one of each (CRLF).

Many people will handle that by reading in all the data into a buffer, then do some pattern search-and-replace. For example,

first replace all CRLF with CR
then replace all LF with CR

After that, the line ending has become a CR throughout the text.

This approach is not necessarily the best. Especially if you receive the data character by character, you could do the clean-up on-the-fly, as you receive the data. No need to buffer or no global search-and-replaces - so it might greatly reduce the amount of memory you need and also run a lot faster to boot.

With a state machine, you would go about it as follows. First of all, you create a variable (say, myState), and you create some symbolic numerical constants (for example, kNormalState could be a symbolic name for 0, and kSeenCR could be a symbolic name for 1).

For more complex state machines there might be hundreds, even thousands of different states - but in this case, two states will do.

All we'll now do is play with a simple integer variable, and we'll keep track of where we're at by manipulating the state. The idea is that we don't assemble strings or 'memorize' any other input data - we encode the relevant info about 'what has been' into the state variable.

Data flows through our state machine - we read input data, and immediately get rid of the data - we write or store or process it - and we keep as little data as possible inside our state machine logic.

So, our little state machine is happily reading and writing character after character.

After each character we read and process we also check whether it was a CR (ASCII 13) or not, and we change our state to either kNormalState or kSeenCR.

Now, suppose we now read a line feed character (ASCII 10). Before doing anything with a new character, the state machine will always check its current state first.

If the state is kSeenCR we know that this is a line feed after a preceding CR, so we simply don't write the LF out.

If we read a LF and the state is kNormalState, we know that this is a 'stand alone' LF without preceding CR - so we output a CR character to replace it instead.

The state machine is just simple enough to express in words:

Initialize myState to kNormalState
read character (loop until end of file)

if character is LF then
if myState is NOT kSeenCR then
output CR
end if
else
output character
end if

if character is CR then
myState becomes kSeenCR
else
myState becomes kNormalState
end if

end loop

This might seem overkill, but the advantages of state machines become apparent when you try more complex things - for example, interpreting a quote JavaScript string. That string might contain escape-sequences (backslash, followed by a letter, or 1-3 octal digits). Properly interpreting such an 'escaped' string is hard work without a state machine. With a state machine it's a breeze, with hardly any overhead.

So, that was a quick introduction to state machines - I hope it was enough to pique your interest, and entice you to do a bit of research; once you've added them to your arsenal of techniques, you'll find that some difficult tasks have become a lot easier.

You can download the sample files from the following URL (which is mentioned in the podcast transcript):

http://www.rorohiko.com/podcast/geturl.zip

9 comments:

Anonymous said...

Hey, Kris,

Just wanted to thank you very kindly for the free plugin sharing, and especially for the one have just found and put into use, Text Exporter for InDesign CS3.

I guess it's winter there - and kind of endless oceanic summer where I am for the time being, north of San Diego. Always wanted to go to NZ, and have a cousin who quite happily emigrated there, early 70s.

Take care, and thanks,
Clive

Rich said...

Wow, great stuff. Thanks for posting the source and all the comments.

I have a question that maybe you can answer - so far no one at Adobe has.

I've been working on building a mini-browser into After Effects (dedicated solely to our in-house database, no web surfing here), and it works, but it's REALLY slow getting pages. Use Firefox and the page refreshes immediately, even after clearing caches. Use Adobe's socket scripting and it chugs.

I'd hoped your state-machine would help, but it goes no faster than my brute force read(), search() and replace() method. Any other website? Nice and fast.

Do you have any clue how I can speed things up, what might be slowing it down, or where to find a good wall to bang my head against (this one is worn out)?

TIA,
-Rich Helvey

Kris Coppieters said...

Hi Rich,

My first step would be to try and diagnose things - use a tool like wireshark to see what's really going down the wire. That might help zoom in on where the lag comes from (e.g server-side vs. client side). I'd also compare it with a 'normal' browsing session to see if there is any notable difference.

HTH!

Cheers,

Kris

Kris Coppieters said...

With regards to InDesign CS5: the downloadable code won't work as-is with CS5 due to some oddity with the String.replace function.

To work around the issue, you need to change the ParseURL function to read:

...
function ParseURL(url)
{
url=url.replace(/([a-z]*):\/\/([-\._a-z0-9A-Z]*)(:[0-9]*)?\/?(.*)/,"$1/$2/$3/$4");
url=url.split("/");

// ADD THE LINE BELOW FOR INDESIGN CS5
if (url[2] == "undefined") url[2] = "80";

var parsedURL =
...

That fixes it up so it works with CS5 too...

Cheers,

Kris

Samnang said...

Your GetURL function works properly until I request the big file(~12 MB), then I feel it's running forever. I wait 1 or 2 hours, but it looks it doesn't has end. Is it the problem because Javascript in InDesign can't handle large string or Socket?

Kris Coppieters said...

Not sure - it's no speed daemon, but should not take _that_ long. I suggest you work through a debug session with the ExtendScript toolkit to try and find what the issue is. I've once had issues which resulted from the web server issuing an incorrect 'Content-length' header - maybe something similar is at play here.

Brandon Boswell said...

I recently started using this and really love it pull data from a database and save by the an xml file, but for some reason the last child element in every entry has an extra return. Is this from the parser?

Kris Coppieters said...

Not sure - before I could make any sensible comments, I'd need to poke the live code in a debugger. I suggest you might be best to do that - i.e. study the code in a debugger and try to figure out where the stuff comes from. Sorry I cannot be more helpful!

Stephan said...

Hey Kris,

thank you so much for sharing your GetURL-function, which is really fantastic and very very helpful!

Thanks and cheers
Stephan