Packaging has always been a bit of a sore spot for Python modules. Maybe wheels are going in the rant direction. Armin Ronacher has written a nice overview of how to put wheels into actual useful practice:
Wheels currently seem to have more traction than eggs. The development is more active, PyPI started to add support for them and because all the tools start to work for them it seems to be the better solution. Eggs currently only work if you use easy_install instead of pip which seems to be something very few people still do.
So there you have it. Python on wheels. It’s there, it kinda works, and it’s probably worth your time.
Brandon Rhodes penned a nice, light, practical introduction to Pandas while using “small” data:
I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running print on a Pandas data frame would save me the trouble of formatting 12 rows of data by hand.
This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.
After some initial trepidation, I’m starting to enjoy working with Apache Avro. The schema language and options (avdl, avsc, avpr) are a bit obtuse, but the cross-language interop seems to work as advertised. Which is a good thing.
This looks like it will be bad timing for me, but as an AMPCamp 2013 and Spark Summit 2013 attendee, I can vouch for the event quality:
We are proud to announce that the 2014 Spark Summit will be held in San Francisco on June 30 – July 2 at the Westin St. Francis. Tickets are on sale now and can be purchased here.
For 2014, the Spark Summit has grown to a 3-day event. We’ll have two days of keynotes and presentations followed by one day of hands-on training. Attendees of the summit can choose between a 2-day conference-only pass or a 3-day conference and training pass.
If you can’t/didn’t get to Strata West 2014 this will be your next, best opportunity to get a deep dive into the Spark ecosystem.
I don’t know if it’s the best or the biggest, but DC has one damn well organized community of data enthusiasts:
Data Community DC (DC2) is an organization formed in mid-2012 to connect and promoting the work of data professionals in the National Capital Region. We foster education, opportunity, and professional development through high-quality, community-driven events, content, resources, products and services. Our goal is to create a truly open and welcoming community of people who produce, consume, analyze, and work with data — data scientists, analysts, economists, programmers, researchers, and statisticians, regardless of industry, sector, or technology. As of January 2014, we are currently over 5,000 members strong from diverse industries and from a large variety of backgrounds.
But that’s what we do here in the DMV, build bureaucratic organizational structures. Ha, ha! Only serious.
Glad to see Trifacta ship their first product. I had a bit of an insider seat on the Lockheed Martin collaboration. They’ve iterated like crazy since I saw a very primitive version in June. Good luck to Dr. Hellerstein and the team, and of course Go Bears!
We are happy to announce the availability of Spark 0.9.0! Spark 0.9.0 is a major release and Spark’s largest release ever, with contributions from 83 developers. This release expands Spark’s standard libraries, introducing a new graph computation package (GraphX) and adding several new features to the machine learning and stream-processing packages. It also makes major improvements to the core engine, including external aggregations, a simplified H/A mode for long lived applications, and hardened YARN support.
Spark is an open source project on the move. Previously, in-memory distributed computation was the big selling point. Now it’s unification of disparate computational models cleanly embedded within the Hadoop ecosystem.
Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
AWS is one of the most popular cloud computing platforms. It provides everything from object storage (S3), elastically provisioned servers (EC2), databases as a service (RDS), payment processing (DevPay), virtualized networking (VPC and AWS Direct Connect), content delivery networks (CDN), monitoring (CloudWatch), queueing (SQS), and a whole lot more.
In this post I’ll be going over some tips, tricks, and general advice for getting started with Amazon Web Services (AWS). The majority of these are lessons we’ve learned in deploying and running our cloud SaaS product, JackDB, which runs entirely on AWS.
Greg Linden’s got a new batch of interesting links. That’s worth coming out of posting hibernation (n.b. not retirement).
- Transitioned my feedreading experience post-GReader
- Upgraded my WordPress installation and a couple of plug-ins
- Gone from Ubuntu 11.10 (pictured above, 591 days uptime wow!) to Ubuntu 12.10
Everything so far has been pretty painless, other than one lingering bug in feedbin that only seems to affect the feed for Tim Bray’s Ongoing. Unfortunately, this is one of my favorite feeds. Seems a bit suspect that the issue lingers as Bray and his feed have been around for like ever, and a good feed library should process his, of all people’s, correctly. But I’ll chalk it up to feedbin’s growing pains.
And ReadKit is passable, but I wouldn’t exactly call it … zippy … on Ye Olde MacBook.
Give or take a few due to potential timezone adjustments, in 6 hours Google Reader will go dark. Once again, shout out to all GReader staff past and present for delivering a ton of value for nothing out of my pocket. No heapings of scorn from this quarter. Execs made a business decision and I wasn’t exactly a paying customer. It was a good run while it lasted. Special kudos to Mihai Parparita for whipping together the eminently useful readerisdead toolkit on short notice. Somehow it successfully slurped down multiple gigabytes of Reader data for me across multiple accounts.
Moving onward! I’ve decided to go with feedbin.me since it’s approved for use with Mr. Reader. I realized that despite my affection for NetNewsWire, I now do the vast majority of my feedreading on my iPad, either on the couch or interstitially. So tilting towards my favorite reader there means the least dislocation. Meanwhile, Marco Arment somewhat put a stake in the prospects of NetNwsWire. To compensate on the desktop, I’m adding ReadKit to the mix.
However, like dangerousmeta, I waited until the last minute to make up my mind. I’m reserving the right to radically change my mind as I see fit.
The courage culture paints a tempting picture of how people end up with remarkable lives. It tells a story where you’re the main character, fighting evil forces, and ultimately triumphing after a brief but intense battle.
The reality is decidedly less exciting. Remarkable careers require that you become remarkably good. This takes time. But not necessarily a string of defiant rejections of some mysterious status quo.
Since today is my birthday, I try and reflect on things I can readily change up to stay out of unhealthy ruts or just to keep myself fresh. 540 days in a row is more than enough to prove that I can keep a posting streak alive. The conjunction of birthday, nice round number, and national holiday seems more than auspicious timing to give up that streak. Plus, I’ve been at this blogging thing off and on for well over 10 years. (Remember when it was all about “social software”?)
Even though I disagree a bit with the whole post, Greg Linden recently captured a bit of where I’m at:
I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.
We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.
I don’t think blogging is dead. I’m not sure blogging was always about journalism. And I personally haven’t embraced microblogging, although Twitter makes for a great link stream. But blogging is too useful, and fun!, for me to stop cold. I will however, be slowing down a bit. Might be a couple of posts a week but probably no less than once per. I will, however, feel no obligation to any given frequency. So if you’ve been using this blog as your daily hit of excitement, I thank you for your attention, but encourage you to add another source or two as a replacement.
And even though the posting streak wasn’t particularly onerous in terms of time, I’ll be trying to turn the same habits of mind to side projects involving coding and data analysis. This also means my content should trend to more technical topics but we’ll see. With this year’s #3 pick in the NBA draft, the Washington Wizards luck is looking up, so they might be even more interesting to talk about in 2013-14. I also have this half-assed idea to do a series of 10 REM posts, reminiscing on 10 years of blogging, by trawling through the archives of Mass Programming Resistance, New Media Hack, and out into the wider web.
Be seeing you!
P.S. Feels like Greg has another start-up thread within him even though there’s already a clear direction for Geeky Ventures!
Check this out
Enthought Python Distribution -- www.enthought.com Version: 7.3-2 (32-bit) Python 2.7.3 |EPD 7.3-2 (32-bit)| (default, Apr 12 2012, 11:28:34) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "credits", "demo" or "enthought" for more information. >>> import datetime >>> datetime.date(2013,05,26) - datetime.date(2011,12,3) datetime.timedelta(540) >>>
That’s my way of saying I’ve posted for 540 days straight. Also on the order of 599 out of 600. Yeah me!
In fall of 2011, just on a lark and as an experiment in behavior modification, I set a goal of posting for 365 days straight. I slipped up, got back on the horse and never looked back. Mission accomplished.
Got a few other things to say, so more after the break:
Yowsa! I actually got a link-out from dangerousmeta! I’m showing my blogging age here, but I’ve noted in the past my admiration for the site. Meanwhile, I don’t do any audience tracking or visit analytics at all for MPR. Pretty much have no idea who’s actually reading this stuff, if anyone. So it’s one of those old time, early ’oughts (yup, I go back that far) thrills to see the site title pop up in another feed.
Evan Miller’s statistical material for programmers might come in handy:
As my modest contribution to developer-kind, I’ve collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.
Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors’ web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I’ve also added links and references, so that even if you’re unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.
Network partitions that is, and their implications for some common, popular, open source datastores. Kyle Kingsbury has cooked up “Call Me Maybe”
This article is part of Jepsen, a series on network partitions. We’re going to learn about distributed consensus, discuss the CAP theorem’s implications, and demonstrate how different databases behave under partition.
In-depth technical content on the Web. Who knew! You have been warned.
I was all set to put CommaFeed on the list of potential GReader replacements after seeing a mention coming across the MetaFilter feed. Then I started reading the MeFi comments and this one from Rhaomi really hit home:
It’s not just the interface and UI, which is pretty easy to clone. It’s the staggering infrastructure that powers it — the sophisticated search crawlers scouring the web and delivering near-real-time updates, the industrial-scale server farms that store untold petabytes of searchable text and images relevant to you (much of it from long-vanished sources), the ubiquitous Google name that makes the service a popular platform for innumerable third-party apps, scripts, and extensions.
It’s possible to code up something that looks and feels a lot like Reader in three months, with the same view types and shortcuts. But to replicate its core functionality — fast updates, archive search, stability, universal access, wide interoperability — takes Google-scale engineering I doubt anybody short of Micosoft/Yahoo can emulate. It was very nearly a public service, and its going to be frustrating trying to downsize expectations for such a core web service to what a startup — even a subscription-backed one — can accomplish.
Not to mention the current CommaFeed landing page annoyingly doesn’t have any type of “About” page, just a force funnel to registration. Hey, I like to at least be sweet talked a little before wasting a password!
The upside of NewsBlur has been that it works really well at its core purpose, fetching feeds and displaying them for the user. It also has solid native mobile clients, enabling you to keep read status in sync across devices.
That’s a good enough endorsement for me. With the clock ticking on the GReader shutdown, I’ll give NewsBlur the first crack at filling the void for me.
C’mon Bears, cut it out. It’s getting embarrassing how much Spark related output there has been recently. In a good way!
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.
We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
Need to drill in to see how GraphX stacks up to the current spate of “big data” graph toolkits, especially GraphLab. Ben Lorica reports that GraphX is more oriented towards to programmer productivity as opposed to raw performance:
GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower than those written in GraphLab/PowerGraph.
Wow! Basho’s Ricon East conference was a little more diverse and wide ranging than I anticipated. This was evidenced by Anders Pearson’s summary of the talks he attended. For example, this lede on ZooKeeper for the Skeptical Architect by Camille Fournier, VP of Technical Architecture, Rent the Runway:
Camille presented ZooKeeper from the perspective of an architect who is a ZooKeeper committer, has done large deployments of it at her previous employer (Goldman Sachs), left to start her own company, and that company doesn’t use ZooKeeper. In other words, taking a very balanced engineering view of what ZooKeeper is appropriate for and where you might not want to use it.
Of the talks Pearson summarized, only two were by Bash employees while the rest were by some pretty serious distributed folks such as Margo Seltzer and Theo Schlossnagle. Plus there was a healthy dose of industry war story experience at scale.
Good on Basho!
snakebite 1.0.0: Pure Python HDFS client bit.ly/16qtoVT— Python Package Index (@pypi) May 17, 2013
Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run
hadoop fs -ls /, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.
So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.
Roger that on the annoyingly slow response of
hadoop fs. Thanks Spotify.
TIL about Jepp:
Jepp embeds CPython in Java. It is safe to use in a heavily threaded environment, it is quite fast and its stability is a main feature and goal.
Could be handy for cutting down performance overhead at some points in the Hadoop stack where Python and Java come together. I’m looking at you Hadoop Streaming. Also for helping Python out with the myriad of serialization formats that Java does oh so well.
A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value. Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy at the level of basic nurture and shelter. What is needed for this data isn’t philosophy, religion, or science — what’s needed is basic, scalable infrastructure.
The more data analysis I do, the more plain ’ole wrestling with the data becomes critical. And figuring out the plumbing and tools to make that happen becomes more interesting.
Amazon EC2 is a great service but sometimes it’s hard to keep track of all the virtual machine types that are provided. Jeff Barr put together a handy comprehensive backgrounder to Amazon EC2 instance families and types:
Over the past six or seven years I have had the opportunity to see customers of all sizes use Amazon EC2 to power their applications, including high traffic web sites, Genome analysis platforms, and SAP applications. I have learned that the developers of the most successful applications and services use a rigorous performance testing and optimization process to choose the right instance type(s) for their application.
In order to help you to do this for your own applications, I’d like to review some important EC2 concepts and then take a look at each of the instance types that make up the EC2 instance family.
Even better he covers the intended use cases for each family and their designed performance tradeoffs. Keep it in your back pocket if you’re an EC2 hacker.
Introducing … GraphLab the company: congrats to the founders of GraphLab Inc. on their $6.75 million Series A goo.gl/yG9YG— Ben Lorica (@bigdata) May 14, 2013
I’ve mentioned GraphLab and have been toying with it since before it’s 1.0 release. Now the stakes have been raised with a de-cloaking and a heap of venture capital. Good luck to Professor Geustrin and crew.
The Discogs.com data is in some humongous XML files, which is a little unruly for many data hacking tasks. Python has some great XML processing modules, but it’s always good to have a little guidance. Enter this oldie but goodie from Eli Bendersky on Processing XML in Python with ElementTree:
As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM.
We’ve just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn’t it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse.
I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing.
If I was going to update Bendersky’s post, I wouldn’t change much, other than to mention lxml and lxml.etree which provide high-performance streaming XML processing.
Haven’t finished working through them, but these git intros feel pretty useful. Slideshare alert if you’re allergic.
Lemi Orhan Ergin’s Git branching Model might be overly stylish, but looks like it goes into detail on merging in addition to branching.
Via Rajiv Pant.
Link parkin’: SourceTree, Atlassian’s desktop GUI DVCS client:
Say goodbye to the command line – use the full capability of Git and Mercurial in the SourceTree desktop app. Manage all your repositories, hosted or local, through SourceTree’s simple interface.
Still checking for consistency, but it looks like I’ve completed my mission of grabbing all the currently available Discogs.com data dumps. Have one more to grab and verify the checksum. Then I should be good to go. 45+ Gb (compressed) to romp through.
Oddly, it looks like we’re only getting releases updated for the month of May. Curious.
Really handy tip from Emacs Redux:
Auto-backup is triggered when you save a file – it will keep the old version of the file around, adding a ~ to its name. So if you saved the file foo, you’d get foo~ as well.
auto-save-mode auto-saves a file every few seconds or every few characters …
Even though I’ve never actually had any use of those backups, I still think it’s a bad idea to disable them (most backups are eventually useful). I find it much more prudent to simply get them out of sight by storing them in the OS’s tmp directory instead.
I find the biggest pain with autosave files is getting git to ignore their existence. Yeah, I can fiddle around with .gitignore files, but that never quite seems to be universally applied correctly for me. Not even having emacs temp files in project directories makes the whole issue go away.
Playing off of continuous partial attention, a particularly bad patch of TV convinced me it’s just a medium for “continuous partial insanity”. Between The News, “reality shows”, the fictional programming, and the advertising the only intent is to keep you in a state of intense emotional elation or despair. Mostly despair since fear drives sales.
Criminy! Sports is a relative island of rationality, structure, and order.
Interestingly, a Google search for “continuous partial insanity” currently only brings up a long abandoned blog, parked on it as a tagline. Seems like an opportunity.
Luke Wroblewski takes interface design and user experience in a serious fashion. So his Google Glass experience was the first commentary I took seriously:
Almost a week ago I picked up my Glass explorer edition on Google’s campus in Mountain View. Since then I’ve it put into real-world use in a variety of places. I wore the device in three different airports, busy city streets, several restaurants, a secure federal building, and even a casino floor in Las Vegas. My goal was to try out Glass in as many different situations as possible to see how I would or could use the device.
During that time, Scott Jenson’s concise mandate of user experience came to mind a lot. As Scott puts it “value must be greater than pain.” That is, in order for someone to use a product, it must be more valuable to them than the effort required to use it. Create enough value and pain can be high. But if you don’t create a lot of value, the pain of using something has to be really low. It’s through this lens, that I can best describe Google Glass in it’s current state.
Definitely worth a full read, especially for the punch line.
Like I said, I enjoy a good curmudgeonly rant. Stephen Few has not been having a good couple of months with publishers.
When I fell in love with words as a young man, I developed a respect for publishers that was born mostly of fantasy. I imagined venerable institutions filled with people of great intellect, integrity, and respect for ideas. I’m sure many people who fit this description still work for publishers, but my personal experience has mostly involved those who couldn’t think their way out of a wet paper bag and apparently have no desire to try.
Said most recent experience involves a bait and switch by Taylor & Francis (the publisher) on rights to some material Few was providing to an academic journal. Guy goes out of his way to put something together, I’m sure of high quality, and they want to reserve the right to modify his work. After they agreed in principle to his terms.
Something similar happened to Danah Boyd and I notice a pattern. Good intentioned journal editor from academia agrees to reasonable terms from fellow academic. Publisher waits until last minute to pull the okee-doke “Well, we can’t really do that. If you don’t agree to our onerous terms we’ll have to pull your article.” If these guys didn’t have their hooks so tightly intertwined with the tenure process, this behavior would be so over.
norman 0.7.1: Norman is a framework for advanced data structures in python using an database-like approach.. bit.ly/V2ovxv— Python Package Index (@pypi) February 12, 2013
Speaking of finding interesting things on @PyPi, here’s Norman
Norman is a framework for advanced data structures in python using an database-like approach. The range of potential applications is wide, for example in-memory databases, multi-keyed dictionaries or node graphs.
For the longest time I’ve been thinking one could transliterate prefuse into Python to enable interactive visualization programming at a high level. The critical hurdle was prefuse’s table oriented datastructures and queries. In-memory sqlite could probably do the trick, but then you’ve got to deal with serialization and deserialization of Python objects.
Norman looks like it might fit the bill better for a prefuse knockoff.