Greg Linden’s got a new batch of interesting links. That’s worth coming out of posting hibernation (n.b. not retirement).
- Transitioned my feedreading experience post-GReader
- Upgraded my WordPress installation and a couple of plug-ins
- Gone from Ubuntu 11.10 (pictured above, 591 days uptime wow!) to Ubuntu 12.10
Everything so far has been pretty painless, other than one lingering bug in feedbin that only seems to affect the feed for Tim Bray’s Ongoing. Unfortunately, this is one of my favorite feeds. Seems a bit suspect that the issue lingers as Bray and his feed have been around for like ever, and a good feed library should process his, of all people’s, correctly. But I’ll chalk it up to feedbin’s growing pains.
And ReadKit is passable, but I wouldn’t exactly call it … zippy … on Ye Olde MacBook.
Give or take a few due to potential timezone adjustments, in 6 hours Google Reader will go dark. Once again, shout out to all GReader staff past and present for delivering a ton of value for nothing out of my pocket. No heapings of scorn from this quarter. Execs made a business decision and I wasn’t exactly a paying customer. It was a good run while it lasted. Special kudos to Mihai Parparita for whipping together the eminently useful readerisdead toolkit on short notice. Somehow it successfully slurped down multiple gigabytes of Reader data for me across multiple accounts.
Moving onward! I’ve decided to go with feedbin.me since it’s approved for use with Mr. Reader. I realized that despite my affection for NetNewsWire, I now do the vast majority of my feedreading on my iPad, either on the couch or interstitially. So tilting towards my favorite reader there means the least dislocation. Meanwhile, Marco Arment somewhat put a stake in the prospects of NetNwsWire. To compensate on the desktop, I’m adding ReadKit to the mix.
However, like dangerousmeta, I waited until the last minute to make up my mind. I’m reserving the right to radically change my mind as I see fit.
The courage culture paints a tempting picture of how people end up with remarkable lives. It tells a story where you’re the main character, fighting evil forces, and ultimately triumphing after a brief but intense battle.
The reality is decidedly less exciting. Remarkable careers require that you become remarkably good. This takes time. But not necessarily a string of defiant rejections of some mysterious status quo.
Since today is my birthday, I try and reflect on things I can readily change up to stay out of unhealthy ruts or just to keep myself fresh. 540 days in a row is more than enough to prove that I can keep a posting streak alive. The conjunction of birthday, nice round number, and national holiday seems more than auspicious timing to give up that streak. Plus, I’ve been at this blogging thing off and on for well over 10 years. (Remember when it was all about “social software”?)
Even though I disagree a bit with the whole post, Greg Linden recently captured a bit of where I’m at:
I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.
We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.
I don’t think blogging is dead. I’m not sure blogging was always about journalism. And I personally haven’t embraced microblogging, although Twitter makes for a great link stream. But blogging is too useful, and fun!, for me to stop cold. I will however, be slowing down a bit. Might be a couple of posts a week but probably no less than once per. I will, however, feel no obligation to any given frequency. So if you’ve been using this blog as your daily hit of excitement, I thank you for your attention, but encourage you to add another source or two as a replacement.
And even though the posting streak wasn’t particularly onerous in terms of time, I’ll be trying to turn the same habits of mind to side projects involving coding and data analysis. This also means my content should trend to more technical topics but we’ll see. With this year’s #3 pick in the NBA draft, the Washington Wizards luck is looking up, so they might be even more interesting to talk about in 2013-14. I also have this half-assed idea to do a series of 10 REM posts, reminiscing on 10 years of blogging, by trawling through the archives of Mass Programming Resistance, New Media Hack, and out into the wider web.
Be seeing you!
P.S. Feels like Greg has another start-up thread within him even though there’s already a clear direction for Geeky Ventures!
Check this out
Enthought Python Distribution -- www.enthought.com Version: 7.3-2 (32-bit) Python 2.7.3 |EPD 7.3-2 (32-bit)| (default, Apr 12 2012, 11:28:34) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "credits", "demo" or "enthought" for more information. >>> import datetime >>> datetime.date(2013,05,26) - datetime.date(2011,12,3) datetime.timedelta(540) >>>
That’s my way of saying I’ve posted for 540 days straight. Also on the order of 599 out of 600. Yeah me!
In fall of 2011, just on a lark and as an experiment in behavior modification, I set a goal of posting for 365 days straight. I slipped up, got back on the horse and never looked back. Mission accomplished.
Got a few other things to say, so more after the break:
Yowsa! I actually got a link-out from dangerousmeta! I’m showing my blogging age here, but I’ve noted in the past my admiration for the site. Meanwhile, I don’t do any audience tracking or visit analytics at all for MPR. Pretty much have no idea who’s actually reading this stuff, if anyone. So it’s one of those old time, early ’oughts (yup, I go back that far) thrills to see the site title pop up in another feed.
Evan Miller’s statistical material for programmers might come in handy:
As my modest contribution to developer-kind, I’ve collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.
Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors’ web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I’ve also added links and references, so that even if you’re unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.
Network partitions that is, and their implications for some common, popular, open source datastores. Kyle Kingsbury has cooked up “Call Me Maybe”
This article is part of Jepsen, a series on network partitions. We’re going to learn about distributed consensus, discuss the CAP theorem’s implications, and demonstrate how different databases behave under partition.
In-depth technical content on the Web. Who knew! You have been warned.
I was all set to put CommaFeed on the list of potential GReader replacements after seeing a mention coming across the MetaFilter feed. Then I started reading the MeFi comments and this one from Rhaomi really hit home:
It’s not just the interface and UI, which is pretty easy to clone. It’s the staggering infrastructure that powers it — the sophisticated search crawlers scouring the web and delivering near-real-time updates, the industrial-scale server farms that store untold petabytes of searchable text and images relevant to you (much of it from long-vanished sources), the ubiquitous Google name that makes the service a popular platform for innumerable third-party apps, scripts, and extensions.
It’s possible to code up something that looks and feels a lot like Reader in three months, with the same view types and shortcuts. But to replicate its core functionality — fast updates, archive search, stability, universal access, wide interoperability — takes Google-scale engineering I doubt anybody short of Micosoft/Yahoo can emulate. It was very nearly a public service, and its going to be frustrating trying to downsize expectations for such a core web service to what a startup — even a subscription-backed one — can accomplish.
Not to mention the current CommaFeed landing page annoyingly doesn’t have any type of “About” page, just a force funnel to registration. Hey, I like to at least be sweet talked a little before wasting a password!
The upside of NewsBlur has been that it works really well at its core purpose, fetching feeds and displaying them for the user. It also has solid native mobile clients, enabling you to keep read status in sync across devices.
That’s a good enough endorsement for me. With the clock ticking on the GReader shutdown, I’ll give NewsBlur the first crack at filling the void for me.
C’mon Bears, cut it out. It’s getting embarrassing how much Spark related output there has been recently. In a good way!
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.
We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
Need to drill in to see how GraphX stacks up to the current spate of “big data” graph toolkits, especially GraphLab. Ben Lorica reports that GraphX is more oriented towards to programmer productivity as opposed to raw performance:
GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower than those written in GraphLab/PowerGraph.
Wow! Basho’s Ricon East conference was a little more diverse and wide ranging than I anticipated. This was evidenced by Anders Pearson’s summary of the talks he attended. For example, this lede on ZooKeeper for the Skeptical Architect by Camille Fournier, VP of Technical Architecture, Rent the Runway:
Camille presented ZooKeeper from the perspective of an architect who is a ZooKeeper committer, has done large deployments of it at her previous employer (Goldman Sachs), left to start her own company, and that company doesn’t use ZooKeeper. In other words, taking a very balanced engineering view of what ZooKeeper is appropriate for and where you might not want to use it.
Of the talks Pearson summarized, only two were by Bash employees while the rest were by some pretty serious distributed folks such as Margo Seltzer and Theo Schlossnagle. Plus there was a healthy dose of industry war story experience at scale.
Good on Basho!
snakebite 1.0.0: Pure Python HDFS client bit.ly/16qtoVT— Python Package Index (@pypi) May 17, 2013
Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run
hadoop fs -ls /, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.
So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.
Roger that on the annoyingly slow response of
hadoop fs. Thanks Spotify.
TIL about Jepp:
Jepp embeds CPython in Java. It is safe to use in a heavily threaded environment, it is quite fast and its stability is a main feature and goal.
Could be handy for cutting down performance overhead at some points in the Hadoop stack where Python and Java come together. I’m looking at you Hadoop Streaming. Also for helping Python out with the myriad of serialization formats that Java does oh so well.
A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value. Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy at the level of basic nurture and shelter. What is needed for this data isn’t philosophy, religion, or science — what’s needed is basic, scalable infrastructure.
The more data analysis I do, the more plain ’ole wrestling with the data becomes critical. And figuring out the plumbing and tools to make that happen becomes more interesting.
Amazon EC2 is a great service but sometimes it’s hard to keep track of all the virtual machine types that are provided. Jeff Barr put together a handy comprehensive backgrounder to Amazon EC2 instance families and types:
Over the past six or seven years I have had the opportunity to see customers of all sizes use Amazon EC2 to power their applications, including high traffic web sites, Genome analysis platforms, and SAP applications. I have learned that the developers of the most successful applications and services use a rigorous performance testing and optimization process to choose the right instance type(s) for their application.
In order to help you to do this for your own applications, I’d like to review some important EC2 concepts and then take a look at each of the instance types that make up the EC2 instance family.
Even better he covers the intended use cases for each family and their designed performance tradeoffs. Keep it in your back pocket if you’re an EC2 hacker.
Introducing … GraphLab the company: congrats to the founders of GraphLab Inc. on their $6.75 million Series A goo.gl/yG9YG— Ben Lorica (@bigdata) May 14, 2013
I’ve mentioned GraphLab and have been toying with it since before it’s 1.0 release. Now the stakes have been raised with a de-cloaking and a heap of venture capital. Good luck to Professor Geustrin and crew.
The Discogs.com data is in some humongous XML files, which is a little unruly for many data hacking tasks. Python has some great XML processing modules, but it’s always good to have a little guidance. Enter this oldie but goodie from Eli Bendersky on Processing XML in Python with ElementTree:
As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM.
We’ve just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn’t it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse.
I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing.
If I was going to update Bendersky’s post, I wouldn’t change much, other than to mention lxml and lxml.etree which provide high-performance streaming XML processing.
Haven’t finished working through them, but these git intros feel pretty useful. Slideshare alert if you’re allergic.
Lemi Orhan Ergin’s Git branching Model might be overly stylish, but looks like it goes into detail on merging in addition to branching.
Via Rajiv Pant.
Link parkin’: SourceTree, Atlassian’s desktop GUI DVCS client:
Say goodbye to the command line – use the full capability of Git and Mercurial in the SourceTree desktop app. Manage all your repositories, hosted or local, through SourceTree’s simple interface.
Still checking for consistency, but it looks like I’ve completed my mission of grabbing all the currently available Discogs.com data dumps. Have one more to grab and verify the checksum. Then I should be good to go. 45+ Gb (compressed) to romp through.
Oddly, it looks like we’re only getting releases updated for the month of May. Curious.
Really handy tip from Emacs Redux:
Auto-backup is triggered when you save a file – it will keep the old version of the file around, adding a ~ to its name. So if you saved the file foo, you’d get foo~ as well.
auto-save-mode auto-saves a file every few seconds or every few characters …
Even though I’ve never actually had any use of those backups, I still think it’s a bad idea to disable them (most backups are eventually useful). I find it much more prudent to simply get them out of sight by storing them in the OS’s tmp directory instead.
I find the biggest pain with autosave files is getting git to ignore their existence. Yeah, I can fiddle around with .gitignore files, but that never quite seems to be universally applied correctly for me. Not even having emacs temp files in project directories makes the whole issue go away.
Playing off of continuous partial attention, a particularly bad patch of TV convinced me it’s just a medium for “continuous partial insanity”. Between The News, “reality shows”, the fictional programming, and the advertising the only intent is to keep you in a state of intense emotional elation or despair. Mostly despair since fear drives sales.
Criminy! Sports is a relative island of rationality, structure, and order.
Interestingly, a Google search for “continuous partial insanity” currently only brings up a long abandoned blog, parked on it as a tagline. Seems like an opportunity.
Luke Wroblewski takes interface design and user experience in a serious fashion. So his Google Glass experience was the first commentary I took seriously:
Almost a week ago I picked up my Glass explorer edition on Google’s campus in Mountain View. Since then I’ve it put into real-world use in a variety of places. I wore the device in three different airports, busy city streets, several restaurants, a secure federal building, and even a casino floor in Las Vegas. My goal was to try out Glass in as many different situations as possible to see how I would or could use the device.
During that time, Scott Jenson’s concise mandate of user experience came to mind a lot. As Scott puts it “value must be greater than pain.” That is, in order for someone to use a product, it must be more valuable to them than the effort required to use it. Create enough value and pain can be high. But if you don’t create a lot of value, the pain of using something has to be really low. It’s through this lens, that I can best describe Google Glass in it’s current state.
Definitely worth a full read, especially for the punch line.
Like I said, I enjoy a good curmudgeonly rant. Stephen Few has not been having a good couple of months with publishers.
When I fell in love with words as a young man, I developed a respect for publishers that was born mostly of fantasy. I imagined venerable institutions filled with people of great intellect, integrity, and respect for ideas. I’m sure many people who fit this description still work for publishers, but my personal experience has mostly involved those who couldn’t think their way out of a wet paper bag and apparently have no desire to try.
Said most recent experience involves a bait and switch by Taylor & Francis (the publisher) on rights to some material Few was providing to an academic journal. Guy goes out of his way to put something together, I’m sure of high quality, and they want to reserve the right to modify his work. After they agreed in principle to his terms.
Something similar happened to Danah Boyd and I notice a pattern. Good intentioned journal editor from academia agrees to reasonable terms from fellow academic. Publisher waits until last minute to pull the okee-doke “Well, we can’t really do that. If you don’t agree to our onerous terms we’ll have to pull your article.” If these guys didn’t have their hooks so tightly intertwined with the tenure process, this behavior would be so over.
norman 0.7.1: Norman is a framework for advanced data structures in python using an database-like approach.. bit.ly/V2ovxv— Python Package Index (@pypi) February 12, 2013
Speaking of finding interesting things on @PyPi, here’s Norman
Norman is a framework for advanced data structures in python using an database-like approach. The range of potential applications is wide, for example in-memory databases, multi-keyed dictionaries or node graphs.
For the longest time I’ve been thinking one could transliterate prefuse into Python to enable interactive visualization programming at a high level. The critical hurdle was prefuse’s table oriented datastructures and queries. In-memory sqlite could probably do the trick, but then you’ve got to deal with serialization and deserialization of Python objects.
Norman looks like it might fit the bill better for a prefuse knockoff.
I follow @PyPi on Twitter, which just streams Python package announcements. It’s a cheap way to get exposure to new and interesting modules. But everyday it seems like there a couple of newly minted 0.1 packages for “printing nested lists”. Curious but not worth investigating.
They’re generated by people following along an example in the book Head First Python.
The book’s author has amended the lesson (through errata and next edition I guess) to point learners at testpypi.python.org (which didn’t exist at the time the book was written).
I run a cleanup script that deletes them every now and then. I haven’t run it for a while… I’ll put it on my looong TODO list…
Will definitely have to shell out for Cyrille Rossant’s Learning IPython for Interactive Computing and Data Visualization
This book is a beginner-level introduction to IPython for interactive Python programming, high-performance numerical computing, and data visualization. It assumes nothing more than familiarity with Python. It targets developers, students, teachers, hobbyists who know Python a bit, and who want to learn IPython for the extended console, the Notebook, and for more advanced scientific applications.
Too much good e-book tech material at a good price these days.
From GoldenHill Software
CloudPull seamlessly backs up your Google account to your Mac. It supports Gmail, Google Contacts, Google Calendar, Google Drive (formerly Docs), and Google Reader. By default, the app backs up your accounts every hour and maintains old point-in-time snapshots of your accounts for 90 days.
Emphasis mine. Gonna’ try this out over the weekend.
Although I’ve fallen off the film viewing wagon, I’m always intrigued by movies with “all-star” casts. For example, Pulp Fiction has Travolta, Jackson, Thurman, Willis, Roth, Plummer, Rhames, Walken, Buscemi, Keitel, and of course Tarantino as actors. I’ve never seriously sat down and tried to quantify what this meant, but 10 “big time” stars seems like a reasonable threshold.
Then of course, the question is what’s “big time”? And there is the sticking point.
Today I had the brilliant idea that you could, relatively easily, define “top billing” based upon IMDB movie data. If an actor is listed as say one of the top 5 for their gender in the credits (for a few years?) call them an All-Star. Still a little squishy but firmer. Then you can quantitatively evaluate each film, rank, and decide.
Interesting challenge, and I wonder how it could apply to major league sports teams?
Slowly making headway downloading the Discogs data dumps. Got 19 complete months in hand. Now into the era of no masters files and release files less than 1GB. Current total storage is roughly 29Gb.
Looking forward to some serious data hacking.
A release slated for the summer will include features2 that enable data sharing (users will be able to do memory-speed writes to Tachyon). With Tachyon, Spark users will have for the first time, a high throughput way of reliably sharing files with other users. Moreover, despite being an external storage system Tachyon is comparable to Spark’s internal cache. Throughput tests on a cluster showed that Tachyon can read 200x and write 300x faster than HDFS. (Tachyon can read and write 30x faster than FDS’ reported throughput.)
Similar to the resilient distributed datasets (RDD) fundamental within Spark, fault-tolerance in Tachyon also relies3 on the concept of lineage – logging the transformations used to build a dataset, and using those logs to rebuild datasets when needed. Additionally as an external storage system Tachyon also keeps tracks of binary programs used to generate datasets, and the input datasets required by those programs.
Terabyte scale analytics at interactive speeds. Coming soon to a laptop near you.
Steve Holden, who know’s a bit or two about Python, gives his explanation of the existence of the tuple datatype in the programming language:
And that, best beloved, is what tuples are for: they are ordered collections of objects, and each of the objects has, according to its position, a specific meaning (sometimes referred to as its semantics). If no behaviors are required then a tuple is “about the simplest thing that could work.”
Has some good insights, but I think tuple immutability and hashability is vastly undersold.
I’m back in Philadelphia. Hotel Wi-Fi, I scoff at you again with your $12 (!!) a night charge. With Verizon and AT&T’s LTE on my side, I surf without fear. Unlike last time, didn’t even have to leave the room.
Just a quick scan of Jimmy Lin’s paper (PDF Warning) hints that there are some useful insights regarding logging at scale, which is currently an interest of mine:
A little about our backgrounds: The first author is an Associate Professor at the University of Maryland who spent an extended sabbatical from 2010 to 2012 at Twitter, primarily working on relevance algorithms and analytics infrastructure. The second author joined Twitter in early 2010 and was first a tech lead, then the engineering manager of the analytics infrastructure team. Together, we hope to provide a blend of the academic and industrial perspectives—a bit of ivory tower musings mixed with “in the trenches” practical advice. Although this paper describes the path we have taken at Twitter and is only one case study, we believe our recommendations align with industry consensus on how to approach a particular set of big data challenges.
TIL there’s an iPad App for O’Reilly’s Safari Online library of books:
Now available for iOS and Android devices. Safari To Go is available for free and delivers full access to thousands of technology, digital media, business and personal development books and training videos from more than 100 of the world’s most trusted publishers. Search, navigate and organize your content on any WiFi or 3G/4G connection. Plus, cache up to three books to your offline bookbag to read when you can’t connect!
Works great for me since my employer provides Safari accounts!
Two great tastes, that taste great together:
The Pandas Time Series/Date tools and Vega visualizations are a great match; Pandas does the heavy lifting of manipulating the data, and the Vega backend creates nicely formatted axes and plots. Vincent is the glue that makes the two play nice, and provides a number of conveniences for making plot building simple.
Useful examples ensue.