How to specify a MapR distro when launching Elastic MapReduce clusters with the Ruby CLI

Amazon’s Elastic MapReduce Ruby client allows you to specify which of the supported Hadoop distributions to use, currently either Amazon’s Apache 1.0.3-based distribution or MapR’s M3 and M5 editions.

I found the CLI’s option documented at <>:

To launch an Amazon EMR job flow with MapR using the CLI

Set the –with-supported-products parameter to either mapr-m3 or mapr-m5 to run your job flow on the corresponding version of the MapR Hadoop distribution.

The following example launches a job flow running with the M3 Edition of MapR.

elastic-mapreduce –create –alive \
–instance-type m1.xlarge –num-instances 5 \
–with-supported-products mapr-m3

For additional information about launching job flows using the CLI, see the instructions for each job flow type in Create a Job Flow.


Use geom_rect() to add recession bars to your time series plots #rstats #ggplot

Zach Mayer’s work reproducing John Hussman’s Recession Warning Composite prompted me to dig this trick out of my (Evernote) notebook.

First, let’s grab some data to plot using the very handy getSymbols() function from Jeffrey Ryan’s quantmod package. We’ll load the U.S. unemployment rate (UNRATE) from the St. Loius Fed’s Federal Reserve Economic Data (src="FRED") and load the time series into a data.frame:

unrate = getSymbols('UNRATE',src='FRED', auto.assign=F) 
unrate.df = data.frame(date=time(unrate), coredata(unrate) )

Now FRED provides a USREC time series which we could use to draw the recessions. It’s a bit awkward, though, as it contains a boolean to flag recession months since January 1921. All we really want are the start and end dates of each recession. Fortunately, the St. Louis Fed publishes just such a table on their web site. (See the answer to “What dates are used for the US recession bars in FRED graphs?” on Sometimes it’s still easier to cut-and-paste (and the static table covers another 64 years, go figure):

recessions.df = read.table(textConnection(
"Peak, Trough
1857-06-01, 1858-12-01
1860-10-01, 1861-06-01
1865-04-01, 1867-12-01
1869-06-01, 1870-12-01
1873-10-01, 1879-03-01
1882-03-01, 1885-05-01
1887-03-01, 1888-04-01
1890-07-01, 1891-05-01
1893-01-01, 1894-06-01
1895-12-01, 1897-06-01
1899-06-01, 1900-12-01
1902-09-01, 1904-08-01
1907-05-01, 1908-06-01
1910-01-01, 1912-01-01
1913-01-01, 1914-12-01
1918-08-01, 1919-03-01
1920-01-01, 1921-07-01
1923-05-01, 1924-07-01
1926-10-01, 1927-11-01
1929-08-01, 1933-03-01
1937-05-01, 1938-06-01
1945-02-01, 1945-10-01
1948-11-01, 1949-10-01
1953-07-01, 1954-05-01
1957-08-01, 1958-04-01
1960-04-01, 1961-02-01
1969-12-01, 1970-11-01
1973-11-01, 1975-03-01
1980-01-01, 1980-07-01
1981-07-01, 1982-11-01
1990-07-01, 1991-03-01
2001-03-01, 2001-11-01
2007-12-01, 2009-06-01"), sep=',',
colClasses=c('Date', 'Date'), header=TRUE)

Now the only “gotcha” is that our recession data start long before our unemployment data, so let’s trim it to match:

recessions.trim = subset(recessions.df, Peak >= min(unrate.df$date) )

Finally, we use ggplot2’s geom_line() layer to draw the unemployment data and transparent (alpha=0.2) pink rectangles to overlay the recessions:

g = ggplot(unrate.df) + geom_line(aes(x=date, y=UNRATE)) + theme_bw()
g = g + geom_rect(data=recessions.trim, aes(xmin=Peak, xmax=Trough, ymin=-Inf, ymax=+Inf), fill='pink', alpha=0.2)

Use Dropbox’s public folder for web publishing via Notepad (or emacs or…)

Remember The Good Old Days when all you needed to host a web site was a file system and Notepad (or emacs or TeachText)?

Well, I do, and I can’t say that I miss them… until last week when I tried to insert the JavaScript for some motion charts into a post. It’s impossible. Literally. Don’t waste your time. Seriously.

Self-hosted WordPress blogs can use some custom field hackery, but there’s no such option for us easy-way-out users.

Dropbox to the rescue

Just save your HTML page to your “Public” directory in Dropbox and it will get its own public URL which you can find in Dropbox’s context menu:

It’s not the ideal embedding I was hoping for — even strips out iframes — but it’s quick and easy and does the job.

googleVis-0.2.4 requires older version of RJSONIO (0.5-0) #rstats

[Update: the new release of googleVis accounts for changes in RJSONIO's handling of backslashes, so you probably won't need the older version.]

Something has apparently changed in the way RJSON’s toJSON() function works which is causing all sorts of extra escape characters (backslashes) to appear in the googleVis-generated JavaScript, at least when trying to set a visualization’s initial state. This bogus code causes the browser’s JavaScript engine to choke just before it can call chart.draw(), so you don’t see the Flash visualization at all–just a blank space with the pretty footer.

This is at least the case on Mac OS 10.6.7 and Markus Gesmann gets all the credit for tracking it down.

Here’s an example state string which selects a couple of bubbles to be labeled (“Oranges” and “Apples”) and sets the time to start about half-way through:


# create the motion chart
M=gvisMotionChart(Fruits, "Fruit", "Year", options=list(state=state.json))

Here’s the output in question using the current RJSONIO 0.7:

> cat(M$html$chart['jsDrawChart'])

// jsDrawChart
function drawChartMotionChartID6db280db() {
  var data = gvisDataMotionChartID6db280db()
  var chart = new google.visualization.MotionChart(
  var options ={};
options["width"] = [    600 ];
options["height"] = [    500 ];
options["state"] = [ "{\\"xAxisOption\\":\\"3\\",\\"xZoomedDataMin\\":81,\\"playDuration\\":15000,\\"sizeOption\\":\\"_UNISIZE\\",\\"xZoomedDataMax\\":111,\\"xLambda\\":1,\\"dimensions\\":{\\"iconDimensions\\":[\\"dim0\\"]},\\"yZoomedDataMax\\":91,\\"duration\\":{\\"multiplier\\":1,\\"timeUnit\\":\\"Y\\"},\\"orderedByX\\":false,\\"xZoomedIn\\":false,\\"yZoomedDataMin\\":71,\\"showTrails\\":false,\\"orderedByY\\":false,\\"iconType\\":\\"BUBBLE\\",\\"uniColorForNonSelected\\":false,\\"yZoomedIn\\":false,\\"nonSelectedAlpha\\":0.4,\\"yLambda\\":1,\\"time\\":\\"2010\\",\\"yAxisOption\\":\\"4\\",\\"iconKeySettings\\":[{\\"LabelY\\":27,\\"key\\":{\\"dim0\\":\\"Apples\\"},\\"LabelX\\":42}],\\"colorOption\\":\\"6\\"}" ];

And here’s working code from RJSONIO 0.5:

> cat(M$html$chart['jsDrawChart'])

// jsDrawChart
function drawChartMotionChartID47a55df7() {
  var data = gvisDataMotionChartID47a55df7()
  var chart = new google.visualization.MotionChart(
  var options ={};
options["width"] =    600;
options["height"] =    500;
options["state"] = "{\"sizeOption\":\"5\",\"nonSelectedAlpha\":0.4,\"xLambda\":1,\"iconType\":\"BUBBLE\",\"yZoomedDataMax\":91,\"iconKeySettings\":[{\"LabelY\":-124,\"LabelX\":-160,\"key\":{\"dim0\":\"Oranges\"}},{\"LabelY\":53,\"LabelX\":37,\"key\":{\"dim0\":\"Apples\"}}],\"xZoomedIn\":false,\"orderedByX\":false,\"showTrails\":false,\"yZoomedIn\":false,\"yZoomedDataMin\":71,\"xZoomedDataMin\":81,\"orderedByY\":false,\"xAxisOption\":\"3\",\"yAxisOption\":\"4\",\"uniColorForNonSelected\":false,\"duration\":{\"timeUnit\":\"Y\",\"multiplier\":1},\"time\":\"2009\",\"yLambda\":1,\"xZoomedDataMax\":111,\"dimensions\":{\"iconDimensions\":[\"dim0\"]},\"colorOption\":\"2\",\"playDuration\":15000}";

Maybe this post can help others avoid the blank look I had on my face as I kept staring at a blank page in my browser.

quantmod makes it easy to watch silver prices crash in R #rstats

As if there hasn’t been enough going on this week, silver prices have fallen nearly $10 per ounce. That’s a reduction of over 20%. Jeffrey Ryan’s quantmod package makes it easy to download the latest prices from OANDA’s web site and plot the excitement.

The getSymbols() function is at the heart of quantmod’s data retrieval prowess, currently handling Yahoo! Finance, Google Finance, the St. Louis Fed’s FRED, and OANDA sites, in addition to MySQL databases and RData and CSV files.

First a word of warning: if you have a computer science background, you may cringe at the way getSymbols() returns data. Rather than returning the fetched data as the result of a function call, it populates your R session’s .GlobalEnv environment (or another one of your choosing via the env parameter) with xts and zoo objects containing your data. For example, if you ask for IBM’s stock prices via getSymbols("IBM"), you will find the data in a new “IBM” object in your .GlobalEnv. This behavior can be changed by setting auto.assign=F, but then you can only request one symbol at a time. But this is a minor nit about an incredibly useful package.

There’s even a wrapper function to help retrieve precious metal prices, and we will use this getMetals() function to retrieve the last year’s worth of prices for gold (XAU) and silver (XAG):

getMetals(c('XAU', 'XAG'), from=Sys.Date()-365)

Yup — that’s it. getMetals() lets us know it has created two new objects:


There were also few warning messages complaining about the last line in the downloaded file. I haven’t bothered to dig into it as the data seem fine, including today’s price:

> ls()

> head(XAGUSD)
2010-05-07 17.6600
2010-05-08 18.4600
2010-05-09 18.4320
2010-05-10 18.4336
2010-05-11 18.5400
2010-05-12 19.3300

> tail(XAGUSD)
2011-05-02 47.9850
2011-05-03 45.2373
2011-05-04 44.0238
2011-05-05 40.9171
2011-05-06 37.9939
2011-05-07 35.0598

And here’s how easy it is to use the package’s built-in graphing facilities:

chartSeries(XAUUSD, theme="white")

chartSeries(XAGUSD, theme="white")

Yup — that’s quite a shellacking for silver.

Now I tend to be a ggplot2 guy myself, and I have never actually worked with xts or zoo objects before, but it’s pretty easy to get them into a suitable data.frame:

silver = data.frame(XAGUSD)
silver$date = as.Date(rownames(silver))
colnames(silver)[1] = 'price'

ggplot(data=silver, aes(x=date, y=price)) + geom_line() + theme_bw()

Slides: “Accessing Databases from R” #rstats

For the past few meetings of the Greater Boston useR Group, we have been opened with an introductory “useR Vignette” talk on a topic which may be helpful for new R users. This week, I presented an overview of accessing databases from R. Several people have tweeted and blogged nice things about my talk
and have asked for the slides, so here they are, via Slideshare:

The final slide includes the code which I used to create and populate the ‘testdb’ database I used for my examples. I have duplicated it here as it’s a nice, quick example of using DBI to store an R data.frame in a database:

First, create new database & user in MySQL:

mysql> create database testdb;
mysql> grant all privileges on testdb.* to 'testuser'@'localhost' identified by 'testpass';
mysql> flush privileges;

In R, load the “mtcars” data.frame, clean it up, and write it to a new “motortrend” table:



# car name is data.frame's rownames. Let's split into manufacturer and model columns:
mtcars$mfg = str_split_fixed(rownames(mtcars), ' ', 2)[,1]
mtcars$mfg[mtcars$mfg=='Merc'] = 'Mercedes'
mtcars$model = str_split_fixed(rownames(mtcars), ' ', 2)[,2]

# connect to local MySQL database (host='localhost' by default)
con = dbConnect("MySQL", "testdb", username="testuser", password="testpass")

dbWriteTable(con, 'motortrend', mtcars)


4 lines of R to get you started using the Rook web server interface

Now that Jeffrey Horner has settled on a name for his new package… the Rook web server interface is now available on CRAN.

Rook provides an interface for R programmers to build web applications which can run in R 2.13’s built-in web server or (soon) rApache.

Jeffrey’s provided some great documentation and sample code on his blog, in the README file, and in the package documentation itself, but somehow I completely missed the importance of the Rhttpd class and couldn’t figure out how to load or launch any of the examples.

Hopefully I can save someone some similar head-scratching. The key is the Rhttpd class, which controls the web server and manages applications. By default it will install the “RookTest” example, so here are 4 lines you need to see it work:

> library(Rook)
> s <- Rhttpd$new()
> s$start(quiet=TRUE)
> s$print()

Server started on
[1] RookTest

Call browse() with an index number or name to run an application.

[EDIT: Thanks to Jim Porzak to pointing out that browse() is a method on the Rhttpd object rather than an old school package-scoped function. Times are a-changin'... for the better!]

The browse() function didn’t seem to work for me, s$browse(1) will load the URL into your browser or you can just copy-and-paste to access the running application:


Posted in Tips. Tags: , . 10 Comments »

Get every new post delivered to your Inbox.

Join 58 other followers