Tullio Facchinetti
Tag cloud
Picture wall
Daily
RSS Feed
  • RSS Feed
  • ATOM Feed
  • Daily Feed
Filters

Links per page

  • 20 links
  • 50 links
  • 100 links

Display

Filter untagged links
tokenizers - Fast State-of-the-Art Tokenizers https://github.com/huggingface/tokenizers
Tue 14 Jan 2020 06:48:48 AM CET
QRCode

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
coding_lang:rust library opensource software text_manipulation text_mining
3660 links
Shaarli - The personal, minimalist, super-fast, database free, bookmarking service by the Shaarli community - Theme by kalvn