Frank McSherry published a useful reminder that one must carefully calibrate the need to deploy “big data” solutions:
Lots of people struggle with the complexities of getting big data systems up and running, when they possibly shouldn’t be using the systems in the first place. The data sets above are certainly not small (billions of edges), but still run just fine on a laptop. Much faster than the distributed systems, at least.
Here are two helpful guidelines (for largely disjoint populations):
- If you are going to use a big data system for yourself, see if it is faster than your laptop.
- If you are going to build a big data system for others, see that it is faster than my laptop.
This brings back memories of the CMU work on GraphChi, where the processed graphs with billions of edges on a Mac Mini.
I’ll have to dig up Frank’s paper once it gets published.
I’ve been seeing some weird host naming issues on my Mac OS X machine for work. Thought it was an honest to gosh conflict with another machine but it turns out there’s glitchiness in Apple’s latest DNS servers:
Duplicate machine names. We use an old Mac named “nirrti” as a file- and iTunes server. In the pre-10.10 days, once in a blue moon nirrti would rename herself to “nirrti (2)”, presumably because it looked like another machine was already using the name “nirrti”. Under 10.10, this now happens a lot, sometimes getting all the way to nirrti (7). Changing back the computer name in the Sharing pane of the System Preferences usually doesn’t take. Apart from looking bad, this also makes opening network connections and playing iTunes content harder, as you need to connect to the right version of the name or nothing happens.
Good to know, but I wouldn’t go so far as to attempt the modifications described in the article. Seems like a recipe for later pain on further application and operating system upgrades.
I am all over that. Will definitely be checking it out during tomorrow’s commute.
Fiending for a Mushroom Jazz 8 release though. It’s been over 3 years since Mushroom Jazz 7 hit the street.
You can also turn a Kafka topic into a Spark RDD
Spark-kafka is a library that facilitates batch loading data from Kafka into Spark, and from Spark into Kafka.
This library does not provide a Kafka Input DStream for Spark Streaming. For that please take a look at the spark-streaming-kafka library that is part of Spark itself.
In the day job, I was casting about for ways to integrate Apache Spark with the open source search engine Elasticsearch. Basically, I had some megawads of JSON data which Elasticsearch happily inhales, but I needed a compute platform to work with the data. Spark is my weapon of choice.
Turns out there’s a really nice Elasticsearch Hadoop toolkit that includes making Spark RDDs out of Elasticsearch searches. I have to thank Sloan Ahrens for tipping me off with a nice clear explanation of putting the connector in action:
In this post we’re going to continue setting up some basic tools for doing data science. The ultimate goal is to be able to run machine learning classification algorithms against large data sets using Apache Spark™ and Elasticsearch clusters in the cloud.
… we will continue where we left off, by installing Spark on our previously-prepared VM, then doing some simple operations that illustrate reading data from an Elasticsearch index, doing some transformations on it, and writing the results to another Elasticsearch index.
I’m way late to the Bitcoin party, but think the notion of applications built from blockchain concepts will be a Big Deal (™). Andreas Antonopoulos’ new book Mastering Bitcoin is getting me up to speed. Here’s a taste:
One way to think about the blockchain is like layers in a geological formation, or a glacier core sample. The surface layers may change with the seasons, or even be blown away before they have time to settle. But once you go a few inches deep, geological layers become more and more stable. By the time you look a few hundred feet down, you are looking at a snapshot of the past that has remained undisturbed for millennia or millions of years. In the blockchain, the most recent few blocks may be revised if there is a chain recalculation due to a fork. The top six blocks are like a few inches of topsoil. But once you go deeper into the blockchain, beyond six blocks, blocks are less and less likely to change. After 100 blocks back, there is so much stability that the “coinbase” transaction, the transaction containing newly mined bitcoins, can be spent. A few thousand blocks back (a month) and the blockchain is settled history. It will never change.block, and “top” or “tip” to refer to the most recently added block.
From what I’ve read so far, the book is a nice blend of high level overview and technical details, with code samples no less.
Yes this blog is still fully operational. As sole owner, proprietor, publisher, and author, I’m committing to more content in 2015. I guarantee it’s going to be a more interesting year in these here parts.
A couple of good overviews from the fine folks at Cloudera
First, Gwen Shapira & Jeff Holoman on “Apache Kafka for Beginners”
Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.
So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?
In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.
And Uri Laserson on “How-to: Use IPython Notebook with Apache Spark”
Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.
Generally, I dislike the technothriller genre (c.f. Daemon), but I generally enjoyed “Nexus” by Ramez Naam. The technical and philosophical aspects of bio-hacking were well done. I wasn’t particularly fond of the technothriller clash of nation states, American exceptionalism, military/intelligence complex sycophancy tropes, but I knew what I was getting into. At least there was some interesting cultural diversity and introspection in the mix.
I may actually pick up the sequel, “Crux”.
“tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies”
Haven’t gotten a chance to dig in yet, but Continuum Analytics’ new Blaze Expressions library is worthy of further inspection:
Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.) Each of these systems has their own strengths and weaknesses and an experienced data analyst will choose the right tool for the problem at hand. Unfortunately learning how each system works and pushing data into the proper form often takes most of the data scientist’s time.
The startup costs of learning to munge and migrate data between new technologies often dominate biggish-data analytics.
Blaze strives to reduce this friction. Blaze provides a uniform interface to a variety of database technologies and abstractions for migrating data.
I especially like the notion of exploiting multiple different frameworks such as in-memory (Pandas), SQL, NoSQL (MongoDB), and Big Data (Apache Spark) for tabular backend engines.
I’ve been a fan of Apache Spark (Go Bears!) for a while despite not having a real good opportunity to put the toolkit to practical use. Last year I got to AMPCamp 3 and the first Spark Summit. At the latter event, The AMPLab started singing a new tune about the benefits of a unified model for big data processing, moving on from selling in-memory computing.
Cloudera’s Gwen Shapira posted a good case study of the upside:
But the biggest advantage Spark gave us in this case was Spark Streaming, which allowed us to re-use the same aggregates we wrote for our batch application on a real-time data stream. We didn’t need to re-implement the business logic, nor test and maintain a second code base. As a result, we could rapidly deploy a real-time component in the limited time left — and impress not just the users but also the developers and their management.
A bit dated, but hopefully not completely useless:
6 years of happy MacBook ownership in the books. Ye Old MacBook is still serving me well, although I think one of those new fangled Airs may be in the offing. Then again, that’s what I said about this time two years ago.
Pub date for my new novel, The Peripheral, moves up to Oct 28th.— William Gibson (@GreatDismal) July 11, 2014
Maybe I’ll have finished re-reading the Blue Ant trilogy by then.
The breadth of our coverage will be much clearer at this new version of FiveThirtyEight, which is launching Monday under the auspices of ESPN. We’ve expanded our staff from two full-time journalists to 20 and counting. Few of them will focus on politics exclusively; instead, our coverage will span five major subject areas — politics, economics, science, life and sports.
What I like about this particular post (go read it all, seriously), is the level of humility Silver expresses. A lot of people can, and do, do the math and follow the predictive approaches he espouses. But putting it to the principled service of informing The Public, within the current dynamic of Internet social media, is innovative. Computer Assisted Reporting was just a precursor. As a recovering new media hack I can appreciate all the roots of this iteration of his work.
Plus, I love this attitude:
It’s time for us to start making the news a little nerdier.
We launched Amazon S3 on March 14, 2006 with a press release and a simple blog post. We knew that the developer community was interested in and hungry for powerful, scalable, and useful web services and we were eager to see how they would respond.
Of course, I was dead wrong in my analysis. “S3 is not a gamechanger.” What was I thinking? Too much focus on the storage economics and not enough on the business model inflection point.
Packaging has always been a bit of a sore spot for Python modules. Maybe wheels are going in the rant direction. Armin Ronacher has written a nice overview of how to put wheels into actual useful practice:
Wheels currently seem to have more traction than eggs. The development is more active, PyPI started to add support for them and because all the tools start to work for them it seems to be the better solution. Eggs currently only work if you use easy_install instead of pip which seems to be something very few people still do.
So there you have it. Python on wheels. It’s there, it kinda works, and it’s probably worth your time.
Brandon Rhodes penned a nice, light, practical introduction to Pandas while using “small” data:
I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running print on a Pandas data frame would save me the trouble of formatting 12 rows of data by hand.
This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.
After some initial trepidation, I’m starting to enjoy working with Apache Avro. The schema language and options (avdl, avsc, avpr) are a bit obtuse, but the cross-language interop seems to work as advertised. Which is a good thing.
This looks like it will be bad timing for me, but as an AMPCamp 2013 and Spark Summit 2013 attendee, I can vouch for the event quality:
We are proud to announce that the 2014 Spark Summit will be held in San Francisco on June 30 – July 2 at the Westin St. Francis. Tickets are on sale now and can be purchased here.
For 2014, the Spark Summit has grown to a 3-day event. We’ll have two days of keynotes and presentations followed by one day of hands-on training. Attendees of the summit can choose between a 2-day conference-only pass or a 3-day conference and training pass.
If you can’t/didn’t get to Strata West 2014 this will be your next, best opportunity to get a deep dive into the Spark ecosystem.
I don’t know if it’s the best or the biggest, but DC has one damn well organized community of data enthusiasts:
Data Community DC (DC2) is an organization formed in mid-2012 to connect and promoting the work of data professionals in the National Capital Region. We foster education, opportunity, and professional development through high-quality, community-driven events, content, resources, products and services. Our goal is to create a truly open and welcoming community of people who produce, consume, analyze, and work with data — data scientists, analysts, economists, programmers, researchers, and statisticians, regardless of industry, sector, or technology. As of January 2014, we are currently over 5,000 members strong from diverse industries and from a large variety of backgrounds.
But that’s what we do here in the DMV, build bureaucratic organizational structures. Ha, ha! Only serious.
Glad to see Trifacta ship their first product. I had a bit of an insider seat on the Lockheed Martin collaboration. They’ve iterated like crazy since I saw a very primitive version in June. Good luck to Dr. Hellerstein and the team, and of course Go Bears!
We are happy to announce the availability of Spark 0.9.0! Spark 0.9.0 is a major release and Spark’s largest release ever, with contributions from 83 developers. This release expands Spark’s standard libraries, introducing a new graph computation package (GraphX) and adding several new features to the machine learning and stream-processing packages. It also makes major improvements to the core engine, including external aggregations, a simplified H/A mode for long lived applications, and hardened YARN support.
Spark is an open source project on the move. Previously, in-memory distributed computation was the big selling point. Now it’s unification of disparate computational models cleanly embedded within the Hadoop ecosystem.
Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
AWS is one of the most popular cloud computing platforms. It provides everything from object storage (S3), elastically provisioned servers (EC2), databases as a service (RDS), payment processing (DevPay), virtualized networking (VPC and AWS Direct Connect), content delivery networks (CDN), monitoring (CloudWatch), queueing (SQS), and a whole lot more.
In this post I’ll be going over some tips, tricks, and general advice for getting started with Amazon Web Services (AWS). The majority of these are lessons we’ve learned in deploying and running our cloud SaaS product, JackDB, which runs entirely on AWS.
Greg Linden’s got a new batch of interesting links. That’s worth coming out of posting hibernation (n.b. not retirement).
- Transitioned my feedreading experience post-GReader
- Upgraded my WordPress installation and a couple of plug-ins
- Gone from Ubuntu 11.10 (pictured above, 591 days uptime wow!) to Ubuntu 12.10
Everything so far has been pretty painless, other than one lingering bug in feedbin that only seems to affect the feed for Tim Bray’s Ongoing. Unfortunately, this is one of my favorite feeds. Seems a bit suspect that the issue lingers as Bray and his feed have been around for like ever, and a good feed library should process his, of all people’s, correctly. But I’ll chalk it up to feedbin’s growing pains.
And ReadKit is passable, but I wouldn’t exactly call it … zippy … on Ye Olde MacBook.
Give or take a few due to potential timezone adjustments, in 6 hours Google Reader will go dark. Once again, shout out to all GReader staff past and present for delivering a ton of value for nothing out of my pocket. No heapings of scorn from this quarter. Execs made a business decision and I wasn’t exactly a paying customer. It was a good run while it lasted. Special kudos to Mihai Parparita for whipping together the eminently useful readerisdead toolkit on short notice. Somehow it successfully slurped down multiple gigabytes of Reader data for me across multiple accounts.
Moving onward! I’ve decided to go with feedbin.me since it’s approved for use with Mr. Reader. I realized that despite my affection for NetNewsWire, I now do the vast majority of my feedreading on my iPad, either on the couch or interstitially. So tilting towards my favorite reader there means the least dislocation. Meanwhile, Marco Arment somewhat put a stake in the prospects of NetNwsWire. To compensate on the desktop, I’m adding ReadKit to the mix.
However, like dangerousmeta, I waited until the last minute to make up my mind. I’m reserving the right to radically change my mind as I see fit.
The courage culture paints a tempting picture of how people end up with remarkable lives. It tells a story where you’re the main character, fighting evil forces, and ultimately triumphing after a brief but intense battle.
The reality is decidedly less exciting. Remarkable careers require that you become remarkably good. This takes time. But not necessarily a string of defiant rejections of some mysterious status quo.
Since today is my birthday, I try and reflect on things I can readily change up to stay out of unhealthy ruts or just to keep myself fresh. 540 days in a row is more than enough to prove that I can keep a posting streak alive. The conjunction of birthday, nice round number, and national holiday seems more than auspicious timing to give up that streak. Plus, I’ve been at this blogging thing off and on for well over 10 years. (Remember when it was all about “social software”?)
Even though I disagree a bit with the whole post, Greg Linden recently captured a bit of where I’m at:
I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.
We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.
I don’t think blogging is dead. I’m not sure blogging was always about journalism. And I personally haven’t embraced microblogging, although Twitter makes for a great link stream. But blogging is too useful, and fun!, for me to stop cold. I will however, be slowing down a bit. Might be a couple of posts a week but probably no less than once per. I will, however, feel no obligation to any given frequency. So if you’ve been using this blog as your daily hit of excitement, I thank you for your attention, but encourage you to add another source or two as a replacement.
And even though the posting streak wasn’t particularly onerous in terms of time, I’ll be trying to turn the same habits of mind to side projects involving coding and data analysis. This also means my content should trend to more technical topics but we’ll see. With this year’s #3 pick in the NBA draft, the Washington Wizards luck is looking up, so they might be even more interesting to talk about in 2013-14. I also have this half-assed idea to do a series of 10 REM posts, reminiscing on 10 years of blogging, by trawling through the archives of Mass Programming Resistance, New Media Hack, and out into the wider web.
Be seeing you!
P.S. Feels like Greg has another start-up thread within him even though there’s already a clear direction for Geeky Ventures!
Check this out
Enthought Python Distribution -- www.enthought.com Version: 7.3-2 (32-bit) Python 2.7.3 |EPD 7.3-2 (32-bit)| (default, Apr 12 2012, 11:28:34) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "credits", "demo" or "enthought" for more information. >>> import datetime >>> datetime.date(2013,05,26) - datetime.date(2011,12,3) datetime.timedelta(540) >>>
That’s my way of saying I’ve posted for 540 days straight. Also on the order of 599 out of 600. Yeah me!
In fall of 2011, just on a lark and as an experiment in behavior modification, I set a goal of posting for 365 days straight. I slipped up, got back on the horse and never looked back. Mission accomplished.
Got a few other things to say, so more after the break:
Yowsa! I actually got a link-out from dangerousmeta! I’m showing my blogging age here, but I’ve noted in the past my admiration for the site. Meanwhile, I don’t do any audience tracking or visit analytics at all for MPR. Pretty much have no idea who’s actually reading this stuff, if anyone. So it’s one of those old time, early ’oughts (yup, I go back that far) thrills to see the site title pop up in another feed.
Evan Miller’s statistical material for programmers might come in handy:
As my modest contribution to developer-kind, I’ve collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.
Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors’ web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I’ve also added links and references, so that even if you’re unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.
Network partitions that is, and their implications for some common, popular, open source datastores. Kyle Kingsbury has cooked up “Call Me Maybe”
This article is part of Jepsen, a series on network partitions. We’re going to learn about distributed consensus, discuss the CAP theorem’s implications, and demonstrate how different databases behave under partition.
In-depth technical content on the Web. Who knew! You have been warned.
I was all set to put CommaFeed on the list of potential GReader replacements after seeing a mention coming across the MetaFilter feed. Then I started reading the MeFi comments and this one from Rhaomi really hit home:
It’s not just the interface and UI, which is pretty easy to clone. It’s the staggering infrastructure that powers it — the sophisticated search crawlers scouring the web and delivering near-real-time updates, the industrial-scale server farms that store untold petabytes of searchable text and images relevant to you (much of it from long-vanished sources), the ubiquitous Google name that makes the service a popular platform for innumerable third-party apps, scripts, and extensions.
It’s possible to code up something that looks and feels a lot like Reader in three months, with the same view types and shortcuts. But to replicate its core functionality — fast updates, archive search, stability, universal access, wide interoperability — takes Google-scale engineering I doubt anybody short of Micosoft/Yahoo can emulate. It was very nearly a public service, and its going to be frustrating trying to downsize expectations for such a core web service to what a startup — even a subscription-backed one — can accomplish.
Not to mention the current CommaFeed landing page annoyingly doesn’t have any type of “About” page, just a force funnel to registration. Hey, I like to at least be sweet talked a little before wasting a password!
The upside of NewsBlur has been that it works really well at its core purpose, fetching feeds and displaying them for the user. It also has solid native mobile clients, enabling you to keep read status in sync across devices.
That’s a good enough endorsement for me. With the clock ticking on the GReader shutdown, I’ll give NewsBlur the first crack at filling the void for me.
C’mon Bears, cut it out. It’s getting embarrassing how much Spark related output there has been recently. In a good way!
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.
We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
Need to drill in to see how GraphX stacks up to the current spate of “big data” graph toolkits, especially GraphLab. Ben Lorica reports that GraphX is more oriented towards to programmer productivity as opposed to raw performance:
GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower than those written in GraphLab/PowerGraph.
Wow! Basho’s Ricon East conference was a little more diverse and wide ranging than I anticipated. This was evidenced by Anders Pearson’s summary of the talks he attended. For example, this lede on ZooKeeper for the Skeptical Architect by Camille Fournier, VP of Technical Architecture, Rent the Runway:
Camille presented ZooKeeper from the perspective of an architect who is a ZooKeeper committer, has done large deployments of it at her previous employer (Goldman Sachs), left to start her own company, and that company doesn’t use ZooKeeper. In other words, taking a very balanced engineering view of what ZooKeeper is appropriate for and where you might not want to use it.
Of the talks Pearson summarized, only two were by Bash employees while the rest were by some pretty serious distributed folks such as Margo Seltzer and Theo Schlossnagle. Plus there was a healthy dose of industry war story experience at scale.
Good on Basho!
snakebite 1.0.0: Pure Python HDFS client bit.ly/16qtoVT— Python Package Index (@pypi) May 17, 2013
Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run
hadoop fs -ls /, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.
So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.
Roger that on the annoyingly slow response of
hadoop fs. Thanks Spotify.