home ¦ Archives ¦ Atom ¦ RSS

pyDAWG

Okay, tons of background knowledge is great, until you find yourself having to search 20Gb / 270+ million lines to find anything in that mountain of data. Full text indexing seems like the right choice, but again, who wants to deal with Solr/Lucene, ElastiSearch, or Sphinx just to get started?

Enter DAWG

This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).

String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable. DAWG may be even faster than built-in dict for some operations. It also provides fast advanced methods like prefix search.

Based on dawgdic C++ library.

I’m going to give it a shot at work and see what happens. The big upside is fast prefix search (hopefully) over and above a key/value store’s fast key lookup. Only obvious downside I can see is constantly hearing Xzibit in my head when I read the docs for this module.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.