February 5, 2011

Using R and Other Languages (like Fortran)

R has taken the world by storm. Not too long ago the world of statistical analysis was a calm, serene place where expensive commercial statistical packages kept the peace. SaS, SPSS, and Stata, between them were the beginning and end of statistical analysis programming software in businesses and academic settings all over. And they charged extortionary prices for their capabilities.

R changed all that. Well, truthfully, as far as I know SaS, SPSS, and Stata are still out there charging an arm and a leg for those who learned the “old ways” and don’t want to change, but R is the new standard. It’s not perfect, but it’s rapidly getting there. It can do everything those other platforms can do and more—and it can do it all as well or even better.

The really attractive thing about R is that it is not just a platform for statistical analysis; It is also a programming platform of sorts. It has an internal scripting language which can be utilized to create a wide variety of new functions and packages of functions. These packages can then be easily uploaded and utilized by the entire community. And when one is creating functions and packages, they have—at their disposal—every other function and package created by the rest of the community. These aspects have helped make R immensely attractive to researchers, and it has developed what can only be called a ginormous community.

R has some smaller, less obtrusive “cons,” but only one major con: it is slow for computationally challenging tasks.

This becomes an issue for our purposes in agent-based modeling and simulation.

We need to create virtual environments of 500×500 cells (at a minimum) and let a large number of agents wander over them according to their internal rule sets. This is no small task once we consider what we’re really saying. A grid measuring 500×500 has a total of 250,000 cells in it. If we have, say, 500 agents running around the virtual landscape, then that’s 500 agents worth of rules and individual decisions, learning, living, eating, harvesting, fighting, and dying.

The kicker is that the simulation will only run as fast as the code allows it to. It doesn’t necessarily matter how fast your computer is if the language you’re using is simply slow in execution. Another source of slow execution is poor program design and coding choices. Nothing eats up system resources and/or slows down programs like bad programming.

And here’s where Fortran comes in.

Fortran is not good at a lot of the things that R is good at. It is absolutely a beast to do graphics of any sort with (for example). However, the Fortran language is very concise and the compiled code is extremely FAST.

Naturally, we’d like to know if we can use R and Fortran together. There are two ways this might work: R calls Fortran and then displays results from Fortran or Fortran uses R for all the graphics stuff that it’s bad at.

ASIDE: There are some who argue that C or C++ are so advanced and capable these days that Fortran is no longer necessary. Most of the arguments I’ve read against Fortran essentially claim that performance differences are negligible. This might be true. On the other hand, I’ve read other opinions which say that Fortran is much more readable than either C or C++ and much easier to use for mathematically intensive programs. This is certainly true—at least so far in my experience. On the other hand, the primary argument for adoption of C/C++ in my mind is that their adoption base is far broader than Fortran (in every arena other than High Performance Computing—super computers and the like). There are more libraries which support programming in C/C++ and bigger user communities to provide help. Fortran is actually a remarkable language in many ways but getting it to interface with other platforms (like R) is a pain. It all just feels like a hack job. More on that later. The point is that I’m inclined to agree that the performance differences between the two (C/C++ and Fortran) are minimal, but Fortran is more readable and concise than the other two.

The good news is that both approaches are apparently possible. The bad news is that neither of them is ideal. Both of them involve various ways of working around the reality that getting R and Fortran to work together is a square peg, round peg kind of situation.

R CALLING FORTRAN

The way I’m beginning with is the R CALLS FORTRAN solution. Running some code my mentor provided me with demonstrates that this approach works. R sets up the virtual environment, initializes agents, and passes everything to Fortran to run with. Then as the Fortran runs it returns updates to R which R then plots. Finally, when the Fortran is complete, R runs some statistical analyses of the data created by the Fortran program.

This works, which is good. It even works relatively well. The only problem is that only Fortran subroutines can be called—not entire functions. This may limit a programmers ability to maximize all that Fortran has to offer in terms of speed and power. Additionally, the package which provides this feature doesn’t support all the “newer” Fortran features such as those found in Fortran 2003 and 2008. It doesn’t even officially support the Fortran 95 standard. Thankfully, word has it that Fortran 90/95 features should work fine.

[note to self: add more here. specifically, how to get it to work with an illuminating example]

FORTRAN CALLING R

Apparently this functionality is possible by way of a package called RFortran. Here the problem is that RFortran requires the use of the Intel Fortran compiler (which I plan to purchase, but which might not be possible for others).

[add more here]

BOTTOM LINE: FORTRAN and R

After all my research, I’m a little un-nerved by the poor state of support for interoperability between R and Fortran. It works, but it certainly isn’t for the faint of heart. It’s basically a pain to find out how to install what you need and then how to configure and use it. There are innumerable nuances and gotchas that make using R and Fortran miserable for beginners.

And remember, I’m a fan of both R and Fortran. Individually I think they’re great—but getting them to work together is a royal pain in the butt.

Alternatives

If Fortran and R interoperability is poor, what are our other options?

Remember, the point of my undergraduate research project is to determine what other options exist for those agent-based modelers who’ve surpassed Netlogo’s capabilities. We’re also trying to determine if there are any options for more computationally intensive agent-based modeling other than Repast or MASON.

So we’re still in search of a language, platform, or combination of languages which offer a reasonable learning curve, lots of speed, and lots of flexibility.

Other common languages are: Python, C, C++, D, Java, C#, and F#. Do any of these others fit our needs?

We’ll have to explore our options further. One potential combination I’m looking into is the R and C++ combination (via Rcpp). I hear that it’s relatively easy to use and offers a ton of flexibility and all the power that C++ has to offer. On my first attempt at figuring out how to make it work, I have not been successful, but I’ll keep digging. One of the nice benefits of Rcpp (and related packages inline and Rinside) is that it is created by a pair of Googlers who keep it well maintained.

Other languages or combinations I’m going to explore at some point are Python + Fortran, Python + C++, Java, C#, and F#.

For Now…

I am somewhat bound to utilize R and Fortran to their fullest because that is what my undergraduate research project proposal states—and what my undergraduate research grant was awarded for. I’m looking forward to learning Fortran, although I’m a little sad that it’s so poorly supported with graphics libraries and interoperability with R. C++ being far more popular is also far better supported by a broad community of people. There are many quality open source libraries and free IDEs and debuggers.

What I’m discovering is that open source projects often suffer from a dearth of reliable quality documentation. The R project is a good example of this. Some packages and features are well-documented while others offer little to no documentation. Help(some function) sometimes reveals a wealth of information about the parameters of the function, how it works, and examples (from beginner to expert level). But all too often the documentation provided with the package is totally inadequate. In sheer desperation, I’ll find myself Googling like crazy, hopeful that someone has generously put up a tutorial on how to use some new awesome package I’ve discovered. I often find neither a decent package author’s website nor a tutorial written by some helpful web citizen. It’s immensely frustrating and I suspect it breeds mediocrity amongst users and the software itself.

It never fails to amaze me how some apparently incredible piece of programming is so poorly documented that none but experts can guess how to use it. Often times the really good programmers see it as being beneath them to create quality documentation for their work, and it really pisses me off. The Rcpp package is a good example of one which apparently offers numerous examples, but no directions on how to locate or load them. Instead, it is assumed that someone looking into Rcpp knows where R stores packages. And that is the crux of the matter: there is simply far too much assumption going on in the computer world. It seems that almost all software, packages, and platforms assume that their users are all of the same level of skill or knowledge. What an IMMENSELY DUMB assumption.

So, open source = poorly documented (more often than not this is the defining feature of open source).

Also, I’m finding that trying to get two languages to play well together is probably a violation of our stated goals (ease of use). For most of those who find Netlogo to be an acceptable level of difficulty, the two language situation is simply not tolerable. It might be if there were some exceptionally well documented and well-maintained “glue” projects to link one or more languages together, but their aren’t.

It is becoming more and more clear to me that single language solutions are probably what we’re going to end up with. At this point it seems likely to be far more tolerable to work with C++ entirely than with the nuances and irritations of R and Fortran.

Still we forge onward feeling a bit depressed to have our R + Fortran bubble burst so soon.

January 29, 2011

Now, where were we?

It’s been quite some time since I updated this blog.

It’s surprising the effect that incentives have on one’s behavior. See, in my case, my last post occurred at the time I was in the process of writing a proposal to do a six credit Undergraduate Research class which had not yet been approved. I’d cultivated a relationship with an outstanding professor whose interests aligned with mine and together we’d come up with an idea of mutual interest. That idea, as you know, was to recreate the classic Sugarscape model and see if we could “fix” the out-of-equilibrium market behavior observed by Axtell and Axelrod (the creators of the original Sugarscape model).

Our “fix” is to introduce to our agents the capacity to change their job title and become traders (or middlemen or entrepreneurs—all words for the same idea). We believe that the inclusion of middlemen who make their wealth off of resolving price/quantity disequilibrium will result in equilibrium being achieved in our Sugarscape markets. If this turns out to be the case, the finding would be important because it would suggest that the reason we see equilibrium achieved in the real world might be because of middlemen (stores, distributors, etc.).

 

There are two components to my research project:

1) The actual model, which I call Trader’s in the Sugarscape.

2) Determining how useful R, Fortran, and F# are for agent-based modeling and simulation (ABMS).

 

The first component can be broken into roughly three tasks:

TASK ONE is to re-implement the classic Sugarscape model.

TASK TWO occurs after verifying that our re-implementation has the same out-of-equilibrium behavior; Task Two consists of modifying the model to allow agents to change their occupation to trader.

TASK THREE involves running our model many times, aggregating the results, creating plots/graphics, analyzing the results, and writing up our findings.

 

The second component consists of the nitty gritty in obtaining, installing, and utilizing R, Fortran and F# and keeping track of how difficult the process is. This blog will assist with this last aspect of the project. But there’s an additional element which makes things more difficult for me, and potentially useful for the community: I’m a complete newbie at Fortran and F#. And my skills with R are relatively undeveloped as well. Most of my R experience amounts to using it for basic statistical analysis and visual data exploration and plotting.

This lack of relevant experience is intentional. It is my belief that there are many social scientists who are interested in learning the “third way” of doing science to their repertoire but don’t know where to start. It is also my belief that there are many social scientists who are using Repast, MASON, and Netlogo but are unhappy with their level of complexity, their steep learning curve, or their relative lack of power (Netlogo). And it’s certain that there are many individuals like me out “there” who want to get into agent-based modeling but are neither experienced programmers nor interested in any of the aforementioned ABM platforms.

My personal reason for looking beyond Netlogo/Repast/MASON is that Netlogo is too underpowered while Repast and MASON are too complex. Their learning curves are almost as steep as that for Java as a whole (which is pretty steep), and the gains to efficiency they provide…I don’t know, but neither platform has convinced me that it has all that much to offer. I don’t like Java (I’d rather they were written in almost any other language, actually), and I think it’s just about as difficult to tread the path I’m on as it is to learn one of those frameworks.

In short, I just don’t believe that the future of agent-based modeling lies with any of the present platforms. Honestly, I think that eventually their will be an excellent, dominant platform, but as far as I know there is no such platform currently.

 

So, now that my project is approved, I am proceeding to educate myself in the ways of Fortran (via the book Fortran 95/2003: For Scientists and Engineers), R (via the book  Introduction to Scientific Programming and Simulation Using R) and F# (using the online resources provided by Microsoft.

In future posts I will detail my experience with obtaining, installing, and beginning to utilize all three languages. I’ll also elaborate on the complexities and nuances of each language, and the challenges—both expected and surprising—which I have encountered.

My next post will probably be on the Fortran, C, C++ and the general topic of compiler availability, cost, support, and scientific computing.

See you then!

p.s. My goal is to update this blog twice a week for the rest of the semester—but once a week is more likely.

November 8, 2010

Diving into the deep end of ABMS

I don’t recall when I first discovered the field of Computational Social Science, but I do recall being stunned by the fact I hadn’t considered the application of computer science, and programming, to the modeling of social phenomena. I was stunned because ever since I discovered Economics my freshman year of college, I’ve had two primary interests in life: Economics and Computer Science. Having discovered Computational Social Science, and the sub-field of Computational Economics, I realized that the application of computer horsepower to solving economic problems made remarkable sense. 

While my interest in the field was immediately peaked, I had no idea where to start. Well, that, and I still had to keep up with classes and work and balance both with being a good husband to my wife.

Numerous web searches yielded little direction–what to do, how to do it, where to learn it, methodology, best practices, how to ensure rigor, how to reduce bias, and so on. . . Even still finding those who are doing agent-based modeling (ABM or ABMS), and tapping into what they know about what to do and what not to do is remarkably difficult.

However, with each new day there are new blogs, new books, new tools. . .this field is *literally* exploding.

Can I share with you one of my first, and most recent observations? Very few people really know what they’re doing in ABM yet. By which I mean that new ground is being broken all over the place. Models are being crafted in Netlogo, Repast, MASON, Swarm, and hacked together in R, C#, C++, C, Fortran, and virtually every other letter in the alphabet. At the same time people are trying to figure out *how* to do ABMS. What should the research/modeling process look like? How are results reported? Verified? How do you convey the assumptions made and the reasoning behind those assumptions in a way that doesn’t sound like you’re smoking something on the job?

So methodology issues are a major issue for this new-born field at this point.

From my perspective, the other major issue is that the vast majority of the tools available are poorly suited for doing ABMS well. Every existing ABMS platform or toolkit has serious shortcomings: C, Java, or C++ platforms present a learning curve *much* to steep for any social scientist to climb, while Netlogo, R, and Python–being much easier to learn and utilize–aren’t fast enough to carry out anything beyond the simplest of simulations.

What are we to do? The only answer I’ve got is, “I don’t actually know.”

See, I’m in the same boat as everyone else. I want to do ABMS, but I have zero interest in learning C++ or C or even Java (especially Java, <shudder>). . .though under duress I might do so. I am interested in learning R and Python but I’m afraid that their ease-of-use comes at the cost of speed. Speed is incredibly important, aaalmost moreso than ease-of-implementation. Speed is what gives us, as social simulators, the ability to lay the groundwork of a virtual world which yields emergent phenomena (both expected and unexpected) as we watch on our LCD screens. Speed gives us more agents, with more rules, interacting with each other in a more elaborate space. In other words, speed gives us MORE.

But there’s a tradeoff. Speed typically comes at the expense of ease-of-implementation. If speed was the only goal, we’d write all of our simulation programs in assembly (which very few people enjoy, and no one has the time for). Ease-of-implementation is especially important given that many social simulators are not, by trade, computer scientists. We are not, as a group, well-versed in algorithm design and “best practices” for software development.

It’s probably overkill to suggest that all of us need to get some kind of bachelor’s or master’s in computer science in order to ensure we bring a degree of rigor and expertice to the writing of ABMs. But short of that, how do we ensure that a sufficient degree of transparency exists to encourage rigorous standards of professionalism in creating ABMs, and be as efficient as possible at the same time?

The truth is that the average social scientist’s opportunity cost of time is simply too high to learn the more difficult languages like C++, C, and Java. This is the case now, but will become more obvious in the future. The reason why Netlogo is so popular–as slow and inflexible as it is–is because it is the easiest of all possible ways to do ABMS. And by that I mean that its programming language is extremely simple, and getting meaningful visualizations of results out of Netlogo is easy. In short, Netlogo presents the lowest cost alternative for PhD’s looking to get into modeling.

The problem is that Netlogo is a toy, pure and simple. It simply does *not* have the power to run a model of any degree of complexity or scale. I don’t say this to denigrate what Netlogo offers. It’s a great tool for getting one’s feet wet, but that’s about it.

And after has gotten one’s feet wet with Netlogo, their aren’t a lot of appealing options for creating more complex, larger models without expending significantly more effort (primarily on the education side of things, although it is true that implementation in languages other than Netlogo is much more challenging).

Thus, the community has got to realize that Repast and other Java-based offering are not a sufficient “next step” after Netlogo. Why not? Because Java (and C and C++ for that matter) is still too difficult to learn relative to its speed, scalability, and ease-of-implementation. Even when you know what you’re doing in Java, programming in it is a relatively tedious process (as is programming in many langauges).

So we’re left with R, Python, Haskell, F#, and similar languages. These languages all share in common that they’re “higher level” than Java, C, and C++. They’re not as fast, in general, as C and C++, but they’re *much* easier to learn and use.

It’s in these languages that I see the future of modeling. That, or someone will develop a really cool high-performance framework in C++ which will be drag-and-drop and incredibly easy to use.

On another note, I’m getting set up to do an undergraduate research project this spring utilizing R and F# in reconstructing the Sugarscape model. I hope to create two separate implementations of the model–one in R and one in F#–and compare the difficulty of implementation, and the performance of the simulation between them. More on that later. . .

Follow

Get every new post delivered to your Inbox.