Daily - 01/14/20 - Toolleeo's Links

Mercurial's Journey to and Reflections on Python 3 | Gregory Szorc

Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.

article python versioning story

CleverCSV - A Python package for handling messy CSV files

CleverCSV provides a drop-in replacement for the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

csv library coding_lang:python opensource software data_mining algorithm

Making Python Programs Blazingly Fast

Python haters always say, that one of reasons they don't want to use it, is that it's slow. Well, whether specific program - regardless of programming language used - is fast or slow is very much dependant on developer who wrote it and their skill and ability to write optimized and fast programs.

So, let's prove some people wrong and let's see how we can improve performance of our Python programs and make them really fast!

python programming optimization software article

Parsr - Transforms PDF, Documents and Images into Enriched Structured Data

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

Document Hierarchy Regeneration - Words, Lines and Paragraphs
Headings Detection
Table Detection and Reconstruction
Lists Detection
Text Order Detection
Named Entity Recognition (Dates, Percentages, etc)
Key-Value Pair Detection (for the extraction of specific form-based entries)
Page Number Detection
Header-Footer Detection
Link Detection
Whitespace Removal

framework software opensource source_code coding_lang:python text_manipulation text-processing

tokenizers - Fast State-of-the-Art Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

text_mining text_manipulation software library coding_lang:rust opensource

Daily Shaarli

01/14/20