Daily Shaarli

All links of one day in a single page.

08/04/19

Hosting cloud - Elevate prestazioni e affidabilità!

Servizio di hosting cloud per applicazioni professionali.

[1908.00200] KiloGrams: Very Large N-Grams for Malware Classification

N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger values of $n$ are not tested due to computational burden or the fear of overfitting.

In this work, we present a method to find the top-$k$ most frequent $n$-grams that is 60$\times$ faster for small $n$, and can tackle large $n\geq1024$. Despite the unprecedented size of $n$ considered, we show how these features still have predictive ability for malware classification tasks. More important, large $n$-grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common $n$-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.

A nice, little known C feature: Static array indices in parameter declarations

The people who created C sure loved keeping the number of keywords low, and today I’m going to show you yet another place you can use the static keyword in C99.

You might have seen function parameter declaration for array parameters that include the size:

void foo(int myArray[10]);

The function will still receive a naked int *, but the [10] part can serve as documentation for the people reading the code, saying that the function expects an array of 10 ints.

But, you can actually also use the keyword static between the brackets:

void bar(int myArray[static 10]);

This tells the compiler that it should assume that the array passed to bar has at least 10 elements. (Note that this rules out a NULL pointer!)

TOM: A library for topic modeling and browsing

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license.

Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus. It also offers a common interface for two topic models (namely LDA using either variational inference or Gibbs sampling, and NMF using alternating least-square with a projected gradient method), and implements three state-of-the-art methods for estimating the optimal number of topics to model a corpus. What is more, TOM constructs an interactive Web-based browser that makes it easy to explore a topic model and the related corpus.

components: An easier way to build and share serverless applications w/ the Serverless Framework

An easier way to build and share serverless applications w/ the Serverless Framework.