I’ve mentioned before the fine work that Mark Litwintschik does putting data management systems through their paces using a dataset of 1.1 billion taxi rides. He’s back with another post on BrytlytDB.
BrytlytDB is an in-GPU-memory database built on top of PostgreSQL. It’s operated using many of PostgreSQL’s command line utilities, it’s wire protocol compatible so third-party PostgreSQL clients can connect to BrytlytDB and queries are even parsed, planned and optimised by PostgreSQL’s regular codebase before the execution plan is passed off to GPU-optimised portions of code BrytlytDB offer.
There have been quite a few posts by Litwintschik since I noted his efforts. What caught my eye this time is the mention of the, new to me, BrytlytDB. BrytlytDB apparently leverages a lot of the core capabilities of the PostgreSQL code base and presents a lot of API compatibility. To quote from the homepage, “Brytlyt combines the power of GPUs with patent pending IP and integrates with PostgreSQL.”
I probably have a bit of myopia, but it feels like PostgreSQL essentially defines the baseline for commercial DBMS functionality these days.
And once again, I have to commend Litwintschik on the thoroughness of his reporting on these posts. One of the few technical bloggers who provides enough detail to actually approach “reproducibility.”
In this episode of the ARCHITECHT Show, In this episode of the ARCHITECHT Show, Ion Stoica talks about the promise of real-time data and machine learning he’s pursuing with the new RISELab project he directs at UC-Berkeley, along with some other big names in big data. Stoica previously was director of the university’s AMPLab, which created and helped to mature technologies such as Apache Spark, Apache Mesos and Alluxio. Stoica is also co-founder and executive chairman of Apache Spark startup Databricks, and he shares some insights into that company’s business and the evolution of the big data ecosystem.
eBPF/bcc enables us to write a new range of tools to deeply troubleshoot, trace and track issues in places previously unreachable without patching the kernel. Tracepoints are also quite handy as they give a good hint on interesting places, removing the need to tediously read the kernel code and can be placed in portions of the code that would otherwise be unreachable from kprobes, like inline or static functions.
Also, I learned about the
Link parkin’. A free(ish) e-book comparing and contrasting the current leading frameworks for messaging. Free as in “give us contact info first” free. Haven’t read yet, YMMV.
Author and consultant Jakub Korab describes use cases and design choices that lead developers to very different approaches for developing message-based systems. You’ll come away with a high-level understanding of both ActiveMQ and Kafka, including how they should and should not be used, how they handle concerns such as throughput and high-availability, and what to look out for when considering other messaging technologies in future.
I’ll probably grab it out of message nerd curiosity. Also wondering if the book touches on somewhat divergent frameworks like NATS.
I promised to revisit the topic of Kafka’s new “exactly once processing.” A while ago, Tyler Treat generated a relatively popular post entitled “You Cannot Have Exactly Once Delivery”. Treat came back and recontextualized the original argument in the face of Confluent’s recent work.
First, let me say what Confluent has accomplished with Kafka is an impressive achievement and one worth celebrating. They made a monumental effort to implement these semantics, and it paid off. The intention of this post is not to minimize any of that work but to try to clarify a few key points and hopefully cut down on some of the misinformation and noise.
The gist is that the Kafka Streams approach is a fairly closed framework that works with the messaging system to ensure a particular semantics correctly with reasonable performance. That’s a good thing. Definitely worth a read if you’re a messaging junkie.
It’s been over 7 years since MarsEdit 3 was released. Typically I would like to maintain a schedule of releasing major upgrades every two to three years. This time, a variety of unexpected challenges led to a longer and longer delay.
The good news? MarsEdit 4 is finally shaping up. I plan to release the update later this year.
Over seven years ago, I hypothesized about ESPN falling from, what looked like at the time, an unassailable perch. All my speculation turned out to be off base, but ESPN has been taking it in the shorts recently. Witness The Athletic preparing to swoop on newly available talent, according to Bloomberg.
2017 is shaping up to be a rough year for sports journalism. ESPN, Fox Sports, Sports Illustrated, Bleacher Report, and Yahoo Sports have all cut staff positions in the last several months, showing the deep cracks in the predominant business model of online sports news. The founders of the Athletic, an 18-month-old online sports publication, see opportunity in the struggles of the biggest companies. As the news of the cuts kept coming, co-founders Alex Mather and Adam Hansmann, who have no previous journalism experience, hastily pulled together $5.8 million in new capital from investors, in a round they closed last week. The plan is to scoop up laid-off writers, and put them to work building a new kind of sports news operation as the traditional industry leaders are in retreat.
From what I can tell, a combo of mobile / digital (cannibalizing cable subscriptions on volume and price) and demographics (younger folks interested in “edgier” not SportsCenter) are the trends causing fits in Bristol, Conn.
These four companies underscore the unbroken link between on-demand computing, big data, and machine learning. While the ‘90s and “oughties” were about building up the front-end user interface—and in the process, making powerful technology simple enough to find billions of users—more recent years have been about laying the groundwork for adaptive, always-aware organizations.
I wasn’t there to view the startup sales pitchers or do any voting. However, Allistair Croll’s assertion that these are AI startups seems a bit off. The short descriptions make three seem to be more data wrangling / harnessing companies. The fourth is a media analytics platform for the e-sports era. There might be some AI hidden in there, and maybe that’s the point, but they sure don’t feel like Bradford Cross’ vertical AI startups.
I’ve actually been enjoying Podcast.__init__ for a bit now. (The Podcast About Python and the People Who Make It Great .) Recently, Tobias Macey, the host, had an interview with Tim Abbott. Abbott’s the lead developer of the open source Zulip project, which is a “modern group chat” application.
As good as Podcast.__init__ has been, this was a really interesting interview. First, Abbott had successfully exited two startups and spent some time deeply embedded in the Dropbox engineering team. So there was some interesting technical organization discussion. Second, Abbott had some very cogent thoughts on how to create a vibrant open source project. A couple of key things that stood out to me were making onboarding of new contributors as frictionless as possible and systematically externalizing his knowledge into visible documentation, as opposed to invisible e-mails.
I’m also sort of curious if Zephyr, which inspired Zulip, is still used at M.I.T. The community of Zephyr users must be vanishingly small, so I was surprised to hear of Abbott’s fondness for it. He strikes me as a true Engineer.
At my last gig, we routinely had breakouts of bikeshedding arguments regarding the mandatory, organizational discussion, group chat application. A year or two ago, as a gag, I had half a mind to propose Zulip to the company, but wisely thought better of it.
Just wanted to mark the fact that I’m a fan of Overcast for subscribing to and playing podcasts. As a technonerd, I might do with a few more bells and whistles, but the overall simplicity makes it a compelling app on the iPhone.
Overcast is good enough that I paid for an annual subscription. YMMV.
Diggin’ in the starred items crates and fell into this post from Camille Fournier about some field-earned wisdom on microservices:
This article is going to have two examples. The first is the rough way “microservices” was deployed in my last gig, and why I made the decisions I made in the architecture. The second is an example of an architecture that is much closer to the “beautiful dream” microservices as I have heard it preached, for architectures that are stream-focused.
Not too deep into the weeds but enough technical insight to be useful. Key takeaway is to not get hyper-aggressive about decentralizing data management.
If you want to say “my database is better than your database” then you really also need to specify “for what?”. And if you want to evaluate whether graph databases really do earn their keep as compared to relational databases, you really want to do the comparison on the home turf of the graph databases – the use cases they claim to be good at.
The final outcome is that traditional RDBMS engines, using straight SQL instead of a specialized graph query language, have much better performance. Gremlin takes it on the chin a bit.
I’ve been meaning to link park the Kotlin programming language. In general, I’m just a programming language nerd and when Google promoted Kotlin for official Android programming, the language hit my radar. Via some random Web surfing (people still do that right?) I came across this brief RedMonk overview of why Kotlin is gaining in popularity:
The short version is that Kotlin is a JVM-based language originally released in 2011 by the JetBrains (makers of IntelliJ) team from St Petersburg, Russia. Like Scala, an inspiration for the language, Kotlin is intended to improve on the Java foundations both syntactically and otherwise while trading on that platform’s ubiquity.
I enjoyed Derrick Harris’ interview with the founders of StackRox:
In this episode of the ARCHITECHT Show, StackRox co-founders Sameer Bhalotra and Ali Golshan break down the state of container security and the new technology they have built to solve it. Bhalotra and Golshan have deep histories doing cybersecurity everywhere from startups to Google to the White House, which they draw on to discuss the security threats and opportunities that microservices present, as well as best practices for cybersecurity in general. This week, StackRox emerged from stealth mode after building the product and company for nearly 3 years.
Sameer and Ali had interestingly different backgrounds coming from government and enterprise consulting. From a total nerd perspective, they came across as a skoosh slick in their answers and choreographed handoffs, but I’ll chalk that up to being well-polished founders who’ve been on the fundraising and customer development trail for a while. That’s how you gotta sound to get C-suite types to fork over the cash.
But on the surface there are some neat ideas in the StackRox product. In the same way that networking technology has become disaggregated, microservices architectures have disaggregated applications and allowed for deeper introspection, monitoring, and remediation.
Have to say, I’ve been impressed by the guests that Harris has been able to get for his interviews.
If it happens, I could get into a graphic novel version of Takeshi Kovacs.
Author Richard K. Morgan will bring Altered Carbon, the Philip K. Dick Award-winning novel published by Gollancz in the UK and soon to be adapted as a Netflix television series, to Dynamite Entertainment with all-new, in-continuity stories, exclusively available in the comic book and graphic novel formats.
Heck, this might be enough motivation to sign up for Netflix.
As I’ve said before, there’s been a bit of gardening going on here behind the scenes. This has made me revisit a number of older posts on this here blog.
Circa 2010, I was seriously investigating ways to get mobile data access for a reasonable price. The number of posts regarding the HTC Evo as a potential phone + hotspot combo is impressive. That’s a cute little time capsule of technology.
Not to mention there used to be some company called Palm back then.
Eventually I wound up just getting an iPhone, which at the time only provided 2GB of 3G connectivity per month. Eight years later, with rollover, I usually have 8GB of LTE for two devices for around the same price. Unlimited text messages to boot. The 8GB isn’t particularly impressive, but the rest of the kit vice price is of note.
I’m still on the iPhone (6S Plus), but becoming really intrigued by a top of the line Google Pixel on Google Fi. A friend of mine speaks highly of the Android experience and iOS isn’t providing any level of excitement to me these days.
Times may have changed but technolust never goes away forever!
Google Cloud Platform hosts a number of public datasets:
Public Datasets on Google Cloud Platform makes it easy for users to access and analyze data in the cloud. These datasets are freely hosted and accessible using a variety of data warehouse and analytics software, from open source Apache Spark to cutting edge Google technologies like Google BigQuery and Google Cloud Dataflow. From structured genomic or encyclopedic data to unstructured climate data, Public Datasets provide a playground for those new to big data and data analysis and a powerful repository for skilled researchers. You can also integrate with your application to add valuable insights for your users. Whatever your use case, these datasets are freely available on GCP.
The thing I find surprising is that the Common Crawl web archives aren’t on GCP, especially given Google’s web heritage. Apropos the late, lamented Fairness Doctrine, Common Crawl is hosted on AWS. There was a good, recent GCP Podcast episode with the Public Datasets team that had an e-mail contact. Maybe I’ll fire off a question.
Here be dragons. I know from personal experience but Hynek Schlawack explains why way better than me.
Proper cleanup when terminating your application isn’t less important when it’s running inside of a Docker container. Although it only comes down to making sure signals reach your application and handling them, there’s a bunch of things that can go wrong.
Really, as Hynek says, “Avoid being PID 1.”
A few years ago, I had the pleasure of meeting and chit-chatting with Paco Nathan. Back then he was with DataBricks, but now he’s at O’Reilly Media doing interesting things with Jupyter and learning. I enjoyed couple of his recent presentations. The first on AI inside O’Reilly Media.
And one on a TextRank rewrite in Python.
Yowsa! That slideshare shortcode actually worked. We’ll see how it comes out in the RSS feed
What is Iris?
Iris is designed to help non-expert programmers who understand what kinds of analyses they need to run (for example, creating a logistic regression model, or computing a Mann-Whitney U test) but not how to write the code to accomplish these goals. Iris also allows expert programmers to accomplish data science tasks more quickly.
Iris supports a broad set of functionality available in popular Python scientific libraries such as scipy and scikit-learn, and we intend to open source the system upon release.
And from a deeper explainer:
Iris supports interactive command combination through a conversational model inspired by linguistic theory and programming language interpreters. Our approach allows us to leverage a simple language model to enable complex workflows: for example, allowing you to converse with the system to build a classifier based on a bag-of-words embedding model, or compare the inauguration speeches of Obama and Trump through lexical analyses.
Iris is an academic research project led by Ethan Fast of the Stanford CS department. I’ll be interested to see how far this gets. Conversational agents that are domain specific, vertically integrated with an environment, and targeted at complex activities seem a bit more promising than the low bar tasks industry currently seems to be focusing on (cough, meeting scheduling, cough). Also feels like a “right moment” with Siri, Cortana, Alexa, Slackbots, Twitterbots, Xiaoice, Tay, and friends establishing a beachhead but bigger wins coming down the road.
Better late than never.
Hip Hop, can we get 30,000 RTs for our 30th Anniversary? pic.twitter.com/MVsrl4qbZi— Eric B and Rakim™ (@EricBandRakim) July 8, 2017
“You thought I was doughnut. You tried to glaze me.”
The funny thing about the iconic Paid In Full album is that I always found the album version of Eric B. is President ultra irritating. I was lucky enough to purchase the 12″ single well before the album came out. The single cut didn’t have that annoying grinding sound all over it. It was just the simple beat, Eric B. scratching, and Rakim’s dynamically unique rap style. That’s the real track to me.
30 years!! Damn time flies!
First of all, let me start by saying that literally everybody is doing (or claiming to do) AI in the bay area. AI has inflamed the spirits of pretty much every single software engineer, data scientist, business developer, talent scout, and VC in the greater San Francisco area.
All tools and services presented at the conference embed some form of machine intelligence, and scientists are the new cool kids on the block. Software engineering has probably reached an all-time low in terms of coolness in the bay area, and regarded almost as the “necessary evil” in order to unleash the next AI interface. This is somewhat counter-intuitive, as actually Machine Learning and AI are more like the raisins in raisin bread, as Peter Norvig and Marcos Sponton say.
I like the raisin bread analogy, which means the data platform engineering aspect of building AI products might be seen as a lucrative “dirty job”.
Seriously. How did I not know about this?
Since December 16, 2006 MixesDB is the database for DJ mixes, radio shows and podcasts.
Together with their dates, tracklists, file details and flyers a useful collection of artists, events, clubs, and podcasts is built:
The mixes are added by music lovers from all over the world. Our slogan: We care about correctness because most do not.
We don’t offer any downloads or secret ways to get download links.
Also Why No Padlock? helped me figure out why Chrome wasn’t giving me the prized lock. Which then led to installing the SSL Insecure Content Fixer plugin for WordPress. Now my image URLs are cleaned up automagically.
No thanks to systemd under Ubuntu Linux 16.04, which got itself twisted up and held me back from upgrading to Ubuntu 17.04. Boiled down to moving some arcane config file out of the way to allow a couple hundred odd packages to upgrade. That’s actually where the majority of my time was spent in this exercise.
Now I just have to figure what all the certificate mumbo jumbo actually means.
Traveling in the Kubernetes orbit, I couldn’t help but hear about some new Istio thing. Unfortunately, I didn’t really have time to dig in. Google Cloud Platform Podcast during the commute for the win:
Due to popular demand, this week Francesc and Mark are joined by Product Manager Varun Talwar and Senior Staff Software Engineer Sven Mawson to discuss all things Istio, an open platform to connect, manage, and secure microservices.
This document introduces Istio: an open platform to connect, manage, and secure microservices. Istio provides an easy way to create a network of deployed services with load balancing, service-to-service authentication, monitoring, and more, without requiring any changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between microservices, configured and managed using Istio’s control plane functionality.
Istio currently only supports service deployment on Kubernetes, though other environments will be supported in future versions.
Serendipitously, the latest episode of The ArchiTECHt Show podcast featured an interview with the CEO of Buoyant, William Morgan, about Linkerd, which seems to be an alternative product for service meshes. From the Linkerd site:
Linkerd is an open source, scalable service mesh for cloud-native applications.
Linkerd was built to solve the problems we found operating large production systems at companies like Twitter, Yahoo, Google and Microsoft. In our experience, the source of the most complex, surprising, and emergent behavior was usually not the services themselves, but the communication between services. Linkerd addresses these problems not just by controlling the mechanics of this communication but by providing a layer of abstraction on top of it.
Both platforms essentially put a proxy layer between the microservices and the underlying LAN network transport. The GCP Podcast made this crystal clear. Then a bunch of functionality related to distributed services can be factored out of the apps and into the service mesh (e.g., load balancing, retries, circuit breaking). Istio is k8s only at the moment, while Linkerd is friendly with other orchestration tools like Marathon on Mesos.
Once upon a time, I worked on a project that could have really used this technology.
From the GitHub repo
Winton Kafka Streams is a Python implementation of Apache Kafka’s Streams API. It builds on Confluent’s librdkafka (a high performance C library implementing the Kafka protocol) and the Confluent Python Kafka library to achieve this.
The power and simplicity of both Python and Kafka’s Streams API combined opens the streaming model to many more people and applications.
Wasn’t really into using Java to tinker with Kafka Streams, but now I’m intrigued. Wonder if the Python library is feature parallel?
According to DoesMySiteNeedHTTPS.com, yes.
“But my site doesn’t have forms or collect information from users.”
Doesn’t matter. HTTPS protects more than just form data! HTTPS keeps the URLs, headers, and contents of all transferred pages confidential.
Looks like I have some work to do.
Adrian Colyer is taking a well-deserved, short break from the morning paper. He presented a few back pointers to top material from this last “term” of reading. In addition, there’s some ways of translating the scale of differences that occasionally popup in computing into human recognizable scales:
And here’s something a little different which didn’t quite fit in any particular paper review as a fun thought to leave you with for now: developing an intuition for orders of magnitude and some of the numbers you see in CS papers.
Speaking of Python, I never knew Intel had their own custom, performance supercharged version:
The Intel® Distribution for Python* is an easy-to-access, integrated package that delivers faster Python* application performance on modern Intel® platforms. Available for Windows*, Linux* and macOS*.
Those stars are for a link to the varied trademarks. Gotta love those big corporate lawyers. That’s probably also why you have to go through an annoying registration form to download the bundle.
While Jupyter is quite hot these days, it did take a while to emerge. Some folks conflate IPython and Jupyter in casual conversation, but the projects have had distinctly different paths. Karlijn Willems did a deep dive into the differences and even got some feedback and input from the creators:
Today’s blog post intends to illustrate some of the core differences between the two more explicitly, not only starting from the origins of both to explain how the two relate, but also covering some specific features that are either part of one or the other, so that it will be easier for you to make the distinction between the two!
Consider also reading DataCamp’s Definitive Guide to Jupyter Notebook for tips and tricks, best practices, examples, and much more.
There are definitely some interesting twists and turns.
Julia Evans put together an extended blog post on “Linux tracing systems & how they fit together”.
The thing I learned last week that helped me really understand was – you can split linux tracing systems into data sources (where the tracing data comes from), mechanisms for collecting data for those sources (like “ftrace”) and tracing frontends (the tool you actually interact with to collect/analyse data). The overall picture is still kind of fragmented and confusing, but it’s at least a more approachable fragmented/confusing system.
Even better, she made a nice illustrated ’zine to go with.
Wrote a really quick zine out of the linux tracing tools post from yesterday. It’s not super fancy but here it is. It’s 12 pages, there’s a print version & a version to read on your computer as usual.
Been listening to a few episodes of the Datanauts podcast, and as I anticipated, the material is right up my alley. Actually, the hosts Ethan C. Banks and Chris Wahl really impressed me in a discussion of Apache Geode, an in-memory data grid (IMDG). While Banks and Wahl clearly aren’t distributed systems researchers / hackers, they asked exceptionally good, fundamental questions about how Geode fares under various conditions (e.g, network partitions).
So far, so good. I can recommend a subscription in your podcatcher to the Datanauts.
Something interesting is happening over at the University of Chicago.
As part of a plan to greatly increase the scale, scope and impact of computer science research and education across the University community, the University of Chicago has appointed prominent data science scholar Michael Franklin to chair its Department of Computer Science and to serve as senior advisor to the provost on computation and data science.
Ben Y. Zhao is a UC Berkeley CS Division alum, well-regarded systems researcher, and formerly at UC Santa Barbara. He just moved over to the University of Chicago this month.
I am Neubauer Professor of Computer Science at University of Chicago. Prior to joining UChicago, I was a Professor of Computer Science at UC Santa Barbara. My research covers a range of topics from large-distributed networks and systems, HCI, security and privacy, and wireless / mobile systems, mostly from a data-driven perspective. My current projects are focused on three areas: data-driven models of user behavior/interactions, security of online and mobile communities, and wireless systems and protocols. My work targets a range of top conferences, including WWW/IMC, Mobicom/SIGCOMM/NSDI, UsenixSecurity/NDSS/S&P, CHI/CSCW.
Luis Bettencourt, of the Santa Fe Institute, applies techniques from the complex systems community to the study of urban dynamics. He just joined up with the U of C, although he’ll maintain an appointment as external faculty to SFI.
Luis M. Bettencourt, a leading researcher in urban science and complex systems, has been appointed the inaugural Pritzker Director of the Mansueto Institute for Urban Innovation at the University of Chicago.
… In his research, Bettencourt uses the growing availability of data worldwide on topics ranging from transportation to housing to understand cities in quantitative and predictive ways. He is dedicated to creating new urban theory to explain how cities thrive and the challenges they face, based on the integration of ideas from urban disciplines such as geography, economics and sociology with methodologies from the natural and computational sciences. He also focuses on understanding the role of innovation and technological change as a driver of economic growth and human development in cities, across the world and throughout history. One of his most influential research projects has helped explain the systematic association between the size of urban areas and higher rates of economic productivity and innovation, as well as higher costs of living and violent crime.
I don’t know if these are coordinated events and I haven’t dug into any other recent appointments. Even if not, this is a kernel of talent that a world class university can build around for incredible outcomes. Also, with Northwestern University targeting a big expansion of the Computer Science program, a nice, metropolitan, bi-polar axis of computing research could emerge.
Yesterday I finished reading William Gibson’s Distrust That Particular Flavor. I’m on record as being a Gibson fanboy and a completist for all of his fiction, to the best of my knowledge. Yet Distrust That Particular Flavor had been sitting on my virtual ToRead pile for quite a long time.
The book is a collection of articles, book introductions, and speeches by Gibson, across a variety of venues: Wired, The New York Times, Time, Book Expo America, etc. As a former Wired subscriber, I was familiar with his style of journalism and had already read many of the articles, admittedly quite a while ago. Gibson self-acknowledges that he’s not really a journalist and is in fact not quite comfortable writing non-fiction. Thus, these snippets are truly of a distinctive flavor.
Overall, these are mostly interesting just as a time capsule of technological and cultural shifts, the most recent dated from the year 2006. Anybody remember AltaVista? There are a couple of standouts like The Road to Oceania but nothing earth shattering.
The collection also provides some insights into Gibson’s thinking as particular novels developed. William Gibson’s Filmless Festival is a useful precursor to Pattern Recognition.
Ultimately, Distrust That Particular Flavor is worthwhile if one is deep into @GreatDismal, Gibson’s handle for his prolific Twitter output. If not, no worries if you skip it.
Greg Linden and I go way back, although very superficially. He linked to a post on my old blog. I was a user of Findory. We may have exchanged a few emails. I still subscribe to his blog’s feed.
Even so, I get a kick out of “knowing” someone who’s had a big impact on the computing industry, as evidenced by Greg, along with collaborators Brent Smith and Jeremy York, receiving a Test of Time award for their research article on Amazon’s early item-based recommendation system.
On its 20th anniversary, the editorial board created its first ever “The Test of Time” award. I’m honored to say they gave it to our 2003 article, “Amazon.com Recommendations: Item-to-Item Collaborative Filtering”, which continues to be accessed, cited, and used in industry and research many years after its original publication.
Their follow-up article is also quite enjoyable. It provides practical insights into actually deploying such a recommendation algorithm, especially as experience has been gained over time. Congratulations!