Tuesday, November 1, 2011

GSoC: Final Report

Introduction

My project, officially named just "Porting to Python 3", is actually divided into two separate (albeit mutually complimentary) parts: my first goal was to set up a testing framework, to ensure continuous testing of SymPy across different versions of Python. SymPy used to have a server that ran buildbot, but it stopped working some time ago. This was to provide a solid base for working on my main project, making the code Python 3 compatible. As this was to be done with a single code-base, I estimated (correctly) that these could lead to subtle incompatibilities between various Python versions. Time permitting, I also intended to leverage this framework and my knowledge to get SymPy compatible with PyPy, too. The original application can be seen at the SymPy Wiki, here. More details about my progress can also be found in my blog.

Testing framework

As mentioned above, SymPy used to use buildbot, so this was my first choice. I also read about Tox, a tool that is written for the sole purpose of testing Python programs under various conditions (different interpreters, presence or lack of certain dependencies), which also provides good integration with Jenkins, a well known server for continuous integration. My initial thoughts on this are in one of my first blog posts, where I had decided to use Tox and then later try to integrate it with Jenkins, to form a fully functioning CI server. While Tox was immediately useful (here's a post on setting up and using it), the integration with Jenkins proved to be more arduous than my initial tests showed. In retrospect, perhaps I should have given buildbot a more thorough look later, rather than eliminate it so early. Furthermore, while I saw Tox as a great tool, the uptake among other developers has been.. less than stellar (other than Aaron, I'm not aware of anyone using it regularly).

Fortunately, parallel to me setting up Tox/Jenkins, work was progressing on sympy-bot. The main need for continuous integration came from a desire to review all pull requests and test them for errors - while bigger companies and projects might need real CI, all of SymPy's code gets in through the GitHub pull request system, so theoretically it should be enough to just thoroughly test every pull request; sympy-bot was developed with this purpose in mind. Designed to be ran manually, it still has the basic functionality which I couldn't manage to replicate in Jenkins: run the test suite and post the results back. Work on it has also quickened somewhat in the last couple of months, and I now consider further development of sympy-bot a better idea than working more with Jenkins.

Python 3 porting

Even with the relative failure of setting up a robust testing framework, my main project was also progressing. Due to the nature of the issue, progress was somewhat sporadic and didn't proceed at a steady pace. This was particularly apparent during the start - I was simply stumped by some of the errors I was getting and couldn't get around them; once I made a key breakthrough, I was quickly able to get SymPy importable under Python 3, though this only happened by week five. The rest of my summer was spent hunting down the remaining errors, which was interesting at first but got very tiresome by the end. In fact, at the end Mateusz had to step in and fix the remaining few failures as I simply couldn't bring myself to look at them yet again. Thanks Mateusz! [Mateusz also did a lot of work on improving PyPy support, something for which I simply didn't find the time, so double thanks to Mateusz!]

One issue that arose early during the porting process was the (un)bundling of libraries with SymPy. SymPy bundled Pyglet and mpmath. Bundling the first was probably a bad idea at the start, and it was finally removed by Stefan Krastanov sometime early in the summer to unanimous approval. Unbundling mpmath was a more contentious issue, it sparked a very lively discussion on the issue tracker. I won't rephrase it here, but in the end it was decided not to unbudle it. This meant that I had to write a custom tool to handle calling 2to3: we needed to avoid calling it on the mpmath/ directories, because mpmath is already py3k compatible (and running 2to3 on such code produces bad code).

It was ultimately decided that this tool will live in bin/use2to3 and work by creating a Python 3-compatible version of the source code in a py3k-sympy/ subdirectory (originally sympy-py3k/ but that interfered too much with tab-completion!), from which SymPy could then be ran normally under Python 3. While I initially had misgivings about the script, I now think it's quite powerful. It's not the most ideal solution, but it does work and was the last missing link in seamless Python 3 support (eg. it also corrects shebangs and fixes some whitespace issues caused by 2to3).

Conclusion

Officially, my project was a success, but I really couldn't have done it without the help of other developers working on SymPy, in particular Ronan, Aaron and Mateusz. Beyond the GSoC period, I've got every intention to continue working with SymPy, as I think I've already shown with the few pull requests I've submitted since; I have also decided to take a more active role in helping with the Google Code-In project (assuming SymPy is accepted). As for my project, I intend to focus more on the infrastructure needed to support SymPy, rather than the math issues. Still, as my knowledge of math and SymPy internals increases, I'm sure I'll find other places to contribute as well.

To future GSoC students, I suggest maintaining good communication links and trying to be involved with the project as much as possible. Good communication with the core developers and general awareness of the current state of SymPy helped me a lot. While this was arguably more important for my project than others, at least Sean Vig has also expressed regret at not being more involved. The second most important bit of advice is to try and split your work into multiple pull requests and try to get them merged as fast as possible. SymPy has a very rapid pace of development, and as such it is always better to integrate sooner rather than later. This ties in to making good, atomic commits, but means more than that: your work should be clearly separated into small, logical chunks (<= 20 commits is my suggestion). A lot of the work done this summer has still to be integrated, or there were many troubles getting it finally in (eg. the physics.mechanics module). Finally, try to budget a lot of extra time in your project application - most of us are not experienced developers and cannot estimate the amount of work needed for something correctly. Plus, when some additional problems arise (and they will), it's always better to have time set aside to deal with them.