R/sentocorpus.R
as.sento_corpus.Rd
Converts most common quanteda and tm corpus objects into a
sento_corpus
object. Appropriate available metadata is integrated as features;
for a quanteda corpus, this can come from docvars(x)
, for a tm corpus,
only meta(x, type = "indexed")
metadata is considered.
as.sento_corpus(x, dates = NULL, do.clean = FALSE)
x | a quanteda |
---|---|
dates | an optional sequence of dates as |
do.clean | see |
A sento_corpus
object, as returned by the sento_corpus
function.
Samuel Borms
data("usnews", package = "sentometrics") txt <- system.file("texts", "txt", package = "tm") reuters <- system.file("texts", "crude", package = "tm") # reshuffle usnews data.frame for use in quanteda and tm dates <- usnews$date usnews$wrong <- "notNumeric" colnames(usnews)[c(1, 3)] <- c("doc_id", "text") # conversion from a quanteda corpus qcorp <- quanteda::corpus(usnews, text_field = "text", docid_field = "doc_id") corp1 <- as.sento_corpus(qcorp)#> Warning: Following feature columns are dropped as they are not numeric: wrong.#> Warning: Following feature columns are dropped as they are not numeric: wrong.# conversion from a tm SimpleCorpus corpus (DataframeSource) tmSCdf <- tm::SimpleCorpus(tm::DataframeSource(usnews)) corp3 <- as.sento_corpus(tmSCdf)#> Warning: Following feature columns are dropped as they are not numeric: wrong.# conversion from a tm SimpleCorpus corpus (DirSource) tmSCdir <- tm::SimpleCorpus(tm::DirSource(txt)) corp4 <- as.sento_corpus(tmSCdir, dates[1:length(tmSCdir)])#># conversion from a tm VCorpus corpus (DataframeSource) tmVCdf <- tm::VCorpus(tm::DataframeSource(usnews)) corp5 <- as.sento_corpus(tmVCdf)#> Warning: Following feature columns are dropped as they are not numeric: wrong.# conversion from a tm VCorpus corpus (DirSource) tmVCdir <- tm::VCorpus(tm::DirSource(reuters), list(reader = tm::readReut21578XMLasPlain)) corp6 <- as.sento_corpus(tmVCdir, dates[1:length(tmVCdir)])#>