Search: [text_manipulation] - Toolleeo's Links

rapidfuzz - Rapid fuzzy string matching in Python and C++ using the Levenshtein Distance

opensource · software · library · text_manipulation · text-processing · source_code

Mon Mar 30 20:59:26 2020 * · permalink

·

https://github.com/rhasspy/rapidfuzz

ydiff - View colored, incremental diff

View colored, incremental diff in workspace or from stdin with side by side and auto pager support (was "cdiff").

#cli-app · source_code · opensource · software · utility · text_manipulation · tools

Mon Feb 10 15:44:52 2020 * · permalink

·

https://github.com/ymattw/ydiff

GNU Recutils

GNU recutils is a set of tools and libraries to access human-editable, text-based databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields. Advanced capabilities usually found in other data storage systems are supported: data types, data integrity (keys, mandatory fields, etc.) as well as the ability of records to refer to other records (sort of foreign keys). Despite its simplicity, recfiles can be used to store medium-sized databases.

utility · database · text_manipulation · #cli-app · file_format · text · data_structure

Sun Jan 26 19:51:48 2020 * · permalink

·

https://www.gnu.org/software/recutils/manual/

Parsr - Transforms PDF, Documents and Images into Enriched Structured Data

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

Document Hierarchy Regeneration - Words, Lines and Paragraphs
Headings Detection
Table Detection and Reconstruction
Lists Detection
Text Order Detection
Named Entity Recognition (Dates, Percentages, etc)
Key-Value Pair Detection (for the extraction of specific form-based entries)
Page Number Detection
Header-Footer Detection
Link Detection
Whitespace Removal

framework · software · opensource · source_code · coding_lang:python · text_manipulation · text-processing

Tue Jan 14 05:49:38 2020 * · permalink

·

https://github.com/axa-group/Parsr

tokenizers - Fast State-of-the-Art Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

text_mining · text_manipulation · software · library · coding_lang:rust · opensource

Tue Jan 14 05:48:48 2020 * · permalink

·

https://github.com/huggingface/tokenizers

Levenshtein distance - Wikipedia

information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

algorithm · article · text_manipulation

Fri Aug 23 18:33:22 2019 * · permalink

·

https://en.m.wikipedia.org/wiki/Levenshtein_distance

homer: a text analyser in Python, can help make your text more clear, simple and useful for your readers.

Homer, a text analyser in Python, can help make your text more clear, simple and useful for your readers.

text_manipulation · analytics · source_code · coding_lang:python · software · opensource

Sat Aug 17 12:31:04 2019 · permalink

·

https://github.com/wyounas/homer

peco - Interactive filtering tool

peco is a CLI utility that filters text interactively. The tool is written in the Go programming language. It's free and open source software.

linux · #cli-app · interactive · software · coding_lang:go · article · tutorial · utility · terminal · console · text_manipulation

Mon Aug 12 05:40:09 2019 * · permalink

·

https://www.linuxlinks.com/excellent-utilities-peco-interactive-filtering-tool/

Regex tutorial for Linux (Sed & AWK) examples - Like Geeks

In order to successfully work with the Linux sed editor and the awk command in your shell scripts, you have to understand regular expressions or in short regex. Since there are many engines for regex, we will use the shell regex and see the bash power in working with regex.

First, we need to understand what regex is, then we will see how to use it.

Table of contents include:

What is regex, Types of regex, Define BRE Patterns, Special Characters, Anchor Characters, The dot Character, Character Classes, Negating Character Classes, Using Ranges, Special Character Classes, The Asterisk, Extended Regular Expressions, The question mark, The Plus Sign, Curly Braces, Pipe Symbol, Grouping Expressions, Practical examples, Counting Directory Files, Validating E-mail Address.

regex · tutorial · 5_stars · text_manipulation

Mon Dec 3 03:53:08 2018 * · permalink

·

https://likegeeks.com/regex-tutorial-linux/amp/

Turn Vim Into Excel: Tips for Editing Tabular Data

The author tried to edit data in spreadsheet programs.

This post illustrate ho to use Vim to edit tabular data, although there are a few things that will make it more pleasant. It is assumed that editing files are in tab-separated value format (TSV).

"But what about CSV files?" Just. Don't.

Do: convert your CSV to TSV and back for editing.

csv · text_manipulation · vim · tutorial · post · article

Mon Dec 3 03:07:04 2018 * · permalink

·

http://alangrow.com/blog/turn-vim-into-excel-tips-for-tabular-data-editing

texar: Toolkit for Text Generation and Beyond

Toolkit for Text Generation and Beyond. Contribute to asyml/texar development by creating an account on GitHub.

machine_learning · NLP · tensorflow · library · coding_lang:python · text_generation · text_manipulation · programming · opensource · source_code

Mon Sep 10 19:39:33 2018 * · permalink

·

https://github.com/asyml/texar

TextBlob: Simplified Text Processing

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.