Ceci est l’annexe à l’article Analyse thématique comparative des discours politiques et de leur diffusion dans le Wikipédia francophone, accepté aux JADT 2024.
On se propose de contextualiser un lexique de termes sur un ensemble de communiqués de presse des partis RN, FI. Plus précisément, étant donné un corpus de textes C et un lexique L de taille réduite caractéristique d’une thématique ou d’un type de discours, nous recherchons les associations les plus significatives entre les termes constitutifs de ce lexique sur la base leurs co-occurrences directes dans le corpus C ainsi que sur celle de différentes représentations vectorielles de ces termes:
Le lexique est défini dans la table de lexique de la base de données. Son contenu est le suivant:
## [1] TRUE
## [1] "agriculture" "antisémitisme" "artisan"
## [4] "artisanat" "autonomie" "autoritaire"
## [7] "autoritarisme" "autorité" "banlieue"
## [10] "catholicisme" "catholique" "chauvin"
## [13] "chauvinisme" "chrétien" "chrétienne"
## [16] "chrétienté" "christianisme" "civilisation"
## [19] "civique" "communautaire" "communautarisme"
## [22] "communautariste" "communauté" "conspiration"
## [25] "conspirationnisme" "conspirationniste" "défense"
## [28] "délinquance" "délinquant" "démagogie"
## [31] "démocratie" "démocratique" "drapeau"
## [34] "droite" "écologie" "économie"
## [37] "élite" "élites" "énergie"
## [40] "ensauvagement" "etat" "etat-nation"
## [43] "ethnique" "étranger" "européenne"
## [46] "extérieur" "extrême" "extrême-droite"
## [49] "extrême-gauche" "extrêmes" "extrémisme"
## [52] "extrémiste" "fascisme" "féminisme"
## [55] "français" "française" "françaises"
## [58] "france" "francophone" "frontière"
## [61] "frontières" "gauche" "gay"
## [64] "genre" "héritage" "hijab"
## [67] "histoire" "identitaire" "identité"
## [70] "idéologie" "immigration" "impérialisme"
## [73] "indépendance" "individualisme" "individualiste"
## [76] "insécurité" "intérieur" "international"
## [79] "islam" "islamisme" "islamiste"
## [82] "islamophobie" "isolationnisme" "laïcité"
## [85] "liberté" "local" "locale"
## [88] "locales" "localisme" "locaux"
## [91] "migratoire" "minorités" "nation"
## [94] "nativisme" "néonationalisme" "nucléaire"
## [97] "patrie" "patriote" "paysan"
## [100] "pénalisation" "peuple" "polarisation"
## [103] "populisme" "populiste" "protection"
## [106] "puissance" "québec" "québécois"
## [109] "québécoise" "québécoises" "racisme"
## [112] "raciste" "radical" "radicalisation"
## [115] "régionalisme" "repli" "rural"
## [118] "ruralité" "sécularisme" "sécuritaire"
## [121] "sécurité" "séparatisme" "souveraineté"
## [124] "souverainisme" "souverainiste" "supranationalisme"
## [127] "tradition" "union" "unité"
## [130] "valeur" "xénophobe" "xénophobie"
## [133] "zone"
Sur une approche combinatoire où les articles sont considérés comme des sacs de mots, et où on ne cherche pas à caractériser le style mes les associations plus ou moins systématiques entre termes, les mots vides ont tendance à générer des cliques d’associations peu informatives et coûteuses en temps de calcul. On utilise la liste de mots vides pour le français du projet stopwords-iso, que l’on peut compléter avec des mots spécifiques au domaine considéré.
L’interrogation de la base de données se fait avec la fonction myD, qui a comme arguments:
<-function(bq="ALL",dd1="2023-01-01",dd2="2007-01-01",l=300,tb="abstracts"){
myD<-mycnx()
cnx=paste("select id as doc_id, title, date_last as date, content as text from ",tb,
qwp" where date_last <= '",dd1,"' and date_last >= '",dd2,"' limit ",l,sep='')
if(bq!="ALL")
=paste("select id as doc_id, title, date_last as date, content as text from ",tb,
qwp" where date_last <= '",dd1,"' and date_last >= '",dd2,
"' and to_tsvector('french', title || ' ' || content) @@ to_tsquery('french','",bq,
"') limit ",l,sep='')
=dbGetQuery(cnx,qwp)
DdbDisconnect(cnx)
$text=tolower(D$text)
D$text=gsub("[,.:;\\(\\)\\[\\]\\'\\|]"," ",D$text,perl = TRUE)
D$text=gsub(" +"," ",D$text,perl = TRUE)
D
D }
“europe|nation|peuple|économie|sécurité”
“europe|sécurité”
On peut rechercher le contexte de co-occurrence des termes du lexique sur ce corpus. On utilise un test de Kendall pour évaluer la significativité de ces co-occurrences.
Pour ce faire on utilise la fonction mycontext:
<-function(D,tx="",ty="",tz=""){
mycontext=""
testkif(tx!="" && ty!=""){
=cortest(D,tx,ty)
t=paste("\nKendall test: tau=",round(t$estimate,digits = 2),"p=",
testkround(t$p.value,digits = 4),sep =" ")
}if(tx!="")D=D[grep(tx,D$text),]
if(ty!="")D=D[grep(ty,D$text),]
if(tz!="")D=D[grep(tz,D$text),]
hist(D$date,breaks="months",freq=TRUE, main=paste("Termes :",tx,ty,tz,testk, sep=" "),
xlab="Dates des pages")
D }
On calcule une nouvelle représentation vectorielle sur le seul contexte extrait. Cette représentation vise à prédire les collocations de mots sur l’ensemble du texte.
Pour calculer les plongements on utilise la fonction wempol:
<-function(D,k=20,i=100,slw=c("http")){
wempollibrary(word2vec)
set.seed(123456789)
=as.character(D$text)
x=word2vec(x = x, dim = k, iter = i, type = 'skip-gram', window = 10, lr=0.01,
wemhs=TRUE, sample=0.001, min_count = 5, threads=10, stopwords=slw)
wem }
# Extraction de la liste des termes. Si doublons on procède à la somme des vecteur correspondants
=function(beta,w){
my_grep=colnames(beta)
nreturn(beta[,n==w])
}=function(beta,w){
my_subgrep=grep(w,colnames(beta),fixed = TRUE);
vif(length(v)==1) return(beta[,v])
else return(apply(beta[,v],c(1),sum))
}
# Calcul de la corrélation
<-function(beta,w1,w2){
ldacortest=my_subgrep(beta,w1)
v1=my_subgrep(beta,w2)
v2cor.test(v1,v2,method = "p")
}
# Recherche des paires de termes corrélés dans une liste
=function(lm,terms,p=0.1){
ldabicor=list()
xt=terms[terms%in%colnames(lm)]
terms=length(terms)
nfor(i in 1:(n-1)){
for(j in (i+1):n){
=ldacortest(lm,terms[i],terms[j])
xif(!(is.na(x$estimate))&&(x$p.value<p)) xt[paste(terms[i],terms[j])]=x$estimate
}
}=as.numeric(xt)
vtnames(vt)=names(xt)
sort(vt,decreasing = TRUE)
}
et pour la visualisation:
=function(beta,s){
termvizlibrary(FactoMineR)
=beta[,colnames(beta)%in%s]
C_redPCA(C_red)
}
On peut ainsi lister les mots qu’il est plus probable de trouver à proximité d’un mot x et comment se positionnent les termes du lexique.
# Recherche des termes proches (plongements lexicaux)
=function(wem,s){
wesimpredict(wem, s, type = "nearest", top_n = 30)
}
On cherche à générer des représentations vectorielles des mots qui permettent d’étudier leurs co-occurrences indirectes (utilisation de ces mots dans des contextes similaires). Pour cela on tient compte de leur fréquence d’apparition dans les textes. Il existe deux approches: factorielle et probabiliste. Nous suivons ici l’approche probabiliste.
On aborde l’étude des ensembles de termes fréquemment associés sur ce corpus. On ignore l’ordonnancement des mots dans un texte pour étudier leurs associations à l’échelle des documents.
Pour cela il est nécessaire de procéder au pre-traitement du texte. On retire les mots qui apparaissent dans la liste swl précédente et on élimine ceux de très basse fréquence. On pourrait aussi procéder à une lemmatization mais cela ralentit beaucoup le processus (d’autant plus que pour une lemmatization contextuelle il faut procéder à une analyse syntaxique qui tienne compte du contexte).
# pre traitements du texte
<-function(D,swl,m=10,lemmatize=0){
myprelibrary(tm);
library(stopwords)
library(stringr)
library(textstem)
library(qdap)
=D$text
vtext=c()
dict_lemmasif(lemmatize==1){
library(textstem)
library(hunspell)
=make_lemma_dictionary(DWP$text,engine = "hunspell",lang = "fr_FR")
dict_lemmas
}=D$doc_id
ids=c()
Cid=c()
Ctxt=length(vtext)
n=0
jfor(i in 1:n){
= tolower(vtext[i])
text=gsub("[^[:alnum:][:blank:]+?&/\\-]", " ", text)
textif(lemmatize==1){
= lemmatize_strings(text,dictionary = lemmes)
text
}= trimws(gsub(pattern = " +\\- +",replacement = "-",x =text))
text= rm_stopwords(text,swl,separate = FALSE,strip=TRUE)
text=paste(str_extract_all(text, '\\w{3,}')[[1]], collapse=' ')
text=gsub(" . "," ",text)
textif(length(text>m)){
=j+1;
j=ids[i];
Cid[j]=text;
Ctxt[j]
}
}=data.frame(doc_id=Cid,text=Ctxt)
D }
La fonction mydtm génère la matrice de représentation des documents comme des ensembles de mots pondérés. Son appel nécessite:
un nom de variable pour conserver la matrice générée
un vecteur de textes pré-traités (après élimination des mots vides et éventuelle lématisation).
un vecteur contenant les identifiants des textes
optionnellement on peut lui préciser un seuil m de fréquence minimal pour les mots (5 par défaut)
<-function(vtext,ids,m=5){
mydtmlibrary(tm)
=data.frame(doc_id=ids,text=vtext)
D<- Corpus(DataframeSource(D))
corpus <- m
minimumFrequency <- DocumentTermMatrix(corpus, control = list(bounds = list(global = c(minimumFrequency, Inf))))
DTM <- slam::row_sums(DTM) > 0
sel_idx <- D[sel_idx, ]
D
DTM[sel_idx, ] }
L’approche probabiliste est une approche générative. Elle consiste à rechercher k distributions de mots qui permettent d’expliquer les phénomènes de co-occurrence dans les textes. Ce processus suppose que l’écriture d’un texte induit un choix préalable de sujets ou thématiques et que ces thématiques induisent différentes probabilités d’apparition pour les mots.
Le modèle est calculé avec la fonction ldapol, les paramètres sont:
Déterminer le nombre de dimensions du modèle est le point le plus délicat. Comme pour le méthodes de k-means (centres mobiles) il s’agit de déterminer a priori le nombre “thématiques” contenues dans les textes. On peut faire reposer ce choix sur différentes mesures de qualité du modèle obtenu. C’est ce que fait la fonction suivante. Attention le lancement de celle-ci peut être très long puisqu’elle procède au calcules d’un grand nombre de nombre de modèles pour les comparer.
Ici on choisit de prendre comme nombre de dimension le point d’intersection entre deux mesures de qualité
<-function(DTM,k=10,m=40,i=10,s=0:4,b=50){
findklibrary(ldatuning)
library(topicmodels)
<- list(
control_list_gibbs burnin = b,
iter = 2*b,
seed = s,
nstart = 5,
best = TRUE
)=FindTopicsNumber(DTM,topics = seq(k,m,by = i),metrics = c("Griffiths2004", "Deveaud2014"),
FTNmethod = "Gibbs",control = control_list_gibbs)
FindTopicsNumber_plot(FTN)
}
Une fois le nombre de dimensions choisis on procède au calcul du modèle avec un nombre élevé de tirages et de re-échantillonage.
<-function(DTM,k=30,s=0:4,b=2500){
ldapollibrary(topicmodels)
<- list(
control_list_gibbs burnin = b,
iter = 2*b,
seed = s,
nstart = 5,
best = TRUE
)<- LDA(DTM, k, method = "Gibbs", control = control_list_gibbs)
topicModel <- posterior(topicModel)
tmResult as.matrix(tmResult$terms)
}
Sur ce modèle probabiliste on peut aussi calculer les termes qui contribuent le plus à chaque dimension pour éventuellement détecter des mots vides à ajouter à la liste des mots vides.
On montre ici les termes de la liste de réference et les termes du modèle LDA qui ont les plus grandes corrélations. Pour cela on utilise la fonction ldabicor montrée plus haut.
On utilise ce modèle probabiliste pour recherche des termes corrélés pour ce modèle hors lexique d’étude. Cette approche est comparable à la recherche de collocations sur la base des plongements lexicaux sauf qu’elle s’applique à l’échelle des documents.
# Recherche de tous les termes corrélés à un terme donné
=function(lm,w,p=0.1){
ldanncor=list()
xtfor(t in colnames(lm)){
=ldacortest(lm,w,t)
stif(st$p.value<p) xt[t]=st$estimate
}=as.numeric(xt)
vtnames(vt)=names(xt)
sort(vt,decreasing = TRUE)
}
## Requête : europe|nation|peuple|économie|sécurité
## ----
## doc_id title date
## Min. : 17.0 Length:234 Min. :2016-06-29 00:00:00.00
## 1st Qu.: 178.5 Class :character 1st Qu.:2019-05-09 06:00:00.00
## Median : 7569.5 Mode :character Median :2020-04-05 00:00:00.00
## Mean : 4838.7 Mean :2020-05-15 02:15:53.84
## 3rd Qu.: 7705.5 3rd Qu.:2021-05-12 00:00:00.00
## Max. :12179.0 Max. :2022-12-15 00:00:00.00
## text
## Length:234
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 188.0 1130.8 1631.5 2201.8 2219.8 43310.0
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 21 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
## Nombre de termes distincts: 1388
options(digits=8)
=18
k=ldapol(DTM_FI,k=k,b=300) lm_FI
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 35 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
On utilise ce modèle probabiliste pour recherche des termes corrélés pour ce modèle hors lexique d’étude. Cette approche est comparable à la recherche de collocations sur la base des plongements lexicaux sauf qu’elle s’applique à l’échelle des documents.
## Requête : europe|nation|peuple|économie|sécurité
## ----
## doc_id title date
## Min. : 600.0 Length:300 Min. :2016-01-12 00:00:00
## 1st Qu.: 736.5 Class :character 1st Qu.:2016-08-01 12:00:00
## Median : 1167.5 Mode :character Median :2017-08-30 12:00:00
## Mean : 1205.9 Mean :2017-07-13 08:46:48
## 3rd Qu.: 1593.0 3rd Qu.:2018-04-21 06:00:00
## Max. :12216.0 Max. :2022-09-23 00:00:00
## text
## Length:300
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 635.0 1260.5 1534.5 1631.3 1933.2 4102.0
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 27 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
## Nombre de termes distincts: 1536
options(digits=8)
=18
k=ldapol(DTM_RN,k=k,b=300)
lm_RN#load("Rdata_RN/lm_RN.o")
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 36 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
## Requête : europe|nation|peuple|économie|sécurité
## ----
## doc_id title date
## Min. : 244 Length:300 Min. :2003-12-08 11:48:47.00
## 1st Qu.: 208500 Class :character 1st Qu.:2011-06-16 01:10:20.50
## Median : 489480 Mode :character Median :2015-03-18 08:00:39.50
## Mean : 498590 Mean :2015-02-16 14:24:53.55
## 3rd Qu.: 735290 3rd Qu.:2019-03-15 01:37:45.25
## Max. :1195043 Max. :2022-04-24 10:11:09.00
## text
## Length:300
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 510.00 694.75 829.00 893.93 1036.25 1717.00
=ldabicor(we_WP,termes_ref,p=0.01)
termes_ref_we_cor_WPas.data.frame(termes_ref_we_cor_WP)
library(stringr)
=termes_ref %in% unique(str_split(paste(names(termes_ref_we_cor_WP[1:25]),collapse = " "),pattern = " "))[[1]]
cor_WP_we_idxtermviz(we_WP,termes_ref[cor_WP_we_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 28 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
="europe"
x=wesim(wem_WP,x)
wnn_WP_xas.data.frame(wnn_WP_x)
="sécurité"
y=wesim(wem_WP,y)
wnn_WP_yas.data.frame(wnn_WP_y)
<-mydtm(pDWP$text,pDWP$doc_id,m = m_ref)
DTM_WPcat("Nombre de termes distincts:",DTM_WP$ncol,"\n")
## Nombre de termes distincts: 659
options(digits=8)
=18
k=ldapol(DTM_WP,k=k,b=300)
lm_WP#load("Rdata_RN/lm_RN.o")
=ldabicor(lm_WP,termes_ref,p=0.01)
termes_ref_lm_cor_WPas.data.frame(termes_ref_lm_cor_WP)
=termes_ref %in% unique(str_split(paste(names(termes_ref_lm_cor_WP[1:50]),collapse = " "),pattern = " "))[[1]]
cor_WP_lm_idxtermviz(lm_WP,termes_ref[cor_WP_lm_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 38 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
="europe"
x=ldanncor(lm_WP,x)
knn_lm_WP_xas.data.frame(knn_lm_WP_x)
="sécurité"
y=ldanncor(lm_WP,y)
kknn_lm_WP_yas.data.frame(kknn_lm_WP_y)
="économie"
z=ldanncor(lm_WP,z)
knn_lm_WP_zas.data.frame(knn_lm_WP_z)
## Requête : europe|sécurité
## ----
## doc_id title date
## Min. : 32.00 Length:120 Min. :2016-06-29 00:00:00
## 1st Qu.: 202.25 Class :character 1st Qu.:2019-05-07 00:00:00
## Median : 7581.00 Mode :character Median :2020-04-21 12:00:00
## Mean : 5045.08 Mean :2020-06-12 06:11:30
## 3rd Qu.: 7872.50 3rd Qu.:2021-08-26 06:15:00
## Max. :12175.00 Max. :2022-12-09 00:00:00
## text
## Length:120
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 498.0 1243.8 1701.5 2638.0 2445.8 43310.0
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 15 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
<-mydtm(pDFIc$text,pDFIc$doc_id,m = m_ref)
DTM_FIccat("Nombre de termes distincts:",DTM_FIc$ncol,"\n")
## Nombre de termes distincts: 748
options(digits=8)
=18
k=ldapol(DTM_FIc,k=k,b=300) lm_FIc
=ldabicor(lm_FIc,termes_ref,p=0.01)
termes_ref_lm_cor_FIcas.data.frame(termes_ref_lm_cor_FIc)
=termes_ref %in% unique(str_split(paste(names(termes_ref_lm_cor_FIc[1:50]),collapse = " "),pattern = " "))[[1]]
cor_FIc_lm_idxtermviz(lm_FIc,termes_ref[cor_FIc_lm_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 24 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
## Requête : europe|sécurité
## ----
## doc_id title date
## Min. : 609.0 Length:300 Min. :2016-01-05 00:00:00
## 1st Qu.: 1131.5 Class :character 1st Qu.:2016-06-27 18:00:00
## Median : 1695.5 Mode :character Median :2017-09-04 12:00:00
## Mean : 1743.2 Mean :2017-07-29 14:47:36
## 3rd Qu.: 2245.5 3rd Qu.:2018-06-15 06:00:00
## Max. :12216.0 Max. :2022-09-23 00:00:00
## text
## Length:300
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 347.0 1276.8 1545.5 1685.3 1965.0 7170.0
=ldabicor(we_RNc,termes_ref,p=0.01)
termes_ref_we_cor_RNcas.data.frame(termes_ref_we_cor_RNc)
library(stringr)
=termes_ref %in% unique(str_split(paste(names(termes_ref_we_cor_RNc[1:25]),collapse = " "),pattern = " "))[[1]]
cor_RNc_we_idxtermviz(we_RNc,termes_ref[cor_RNc_we_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 24 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
<-mydtm(pDRNc$text,pDRNc$doc_id,m = m_ref)
DTM_RNccat("Nombre de termes distincts:",DTM_RNc$ncol,"\n")
## Nombre de termes distincts: 1605
=my_top_words(lm_RNc)
df_top=c()
nm_df_top=list()
T=0
ifor(t in df_top){
=i+1
i= t[t %in% termes_ref]
T[[i]] =paste0(i,":",length(T[[i]]),":",paste0(T[[i]],collapse = " "))
nm_df_top[i]
}names(df_top)=nm_df_top
df_top
=ldabicor(lm_RNc,termes_ref,p=0.01)
termes_ref_lm_cor_RNcas.data.frame(termes_ref_lm_cor_RNc)
=termes_ref %in% unique(str_split(paste(names(termes_ref_lm_cor_RNc[1:50]),collapse = " "),pattern = " "))[[1]]
cor_RNc_lm_idxtermviz(lm_RNc,termes_ref[cor_RNc_lm_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 22 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
## Requête : europe|sécurité
## ----
## doc_id title date
## Min. : 2683 Length:185 Min. :2016-01-11 17:45:28.00
## 1st Qu.: 338797 Class :character 1st Qu.:2017-12-06 18:23:14.00
## Median : 705934 Mode :character Median :2019-07-08 14:24:51.00
## Mean : 670880 Mean :2019-05-30 09:55:52.39
## 3rd Qu.:1015340 3rd Qu.:2021-01-05 00:56:50.00
## Max. :1193013 Max. :2022-05-15 20:27:39.00
## text
## Length:185
## Class :character
## Mode :character
##
##
##
## Longueur des textes :
## ----
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 510.00 701.00 807.00 875.37 984.00 1729.00
=ldabicor(we_WPc,termes_ref,p=0.01)
termes_ref_we_cor_WPcas.data.frame(termes_ref_we_cor_WPc)
library(stringr)
=termes_ref %in% unique(str_split(paste(names(termes_ref_we_cor_WPc[1:25]),collapse = " "),pattern = " "))[[1]]
cor_WPc_we_idxtermviz(we_WPc,termes_ref[cor_WPc_we_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 20 individuals, described by 34 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
="europe"
x=wesim(wem_WPc,x)
wnn_WPc_xas.data.frame(wnn_WPc_x)
="sécurité"
y=wesim(wem_WPc,y)
wnn_WPc_yas.data.frame(wnn_WPc_y)
<-mydtm(pDWPc$text,pDWPc$doc_id,m = m_ref)
DTM_WPccat("Nombre de termes distincts:",DTM_WPc$ncol,"\n")
## Nombre de termes distincts: 356
options(digits=8)
=18
k=ldapol(DTM_WPc,k=k,b=300)
lm_WPc#load("Rdata_RN/lm_RN.o")
=ldabicor(lm_WPc,termes_ref,p=0.01)
termes_ref_lm_cor_WPcas.data.frame(termes_ref_lm_cor_WPc)
=termes_ref %in% unique(str_split(paste(names(termes_ref_lm_cor_WPc[1:50]),collapse = " "),pattern = " "))[[1]]
cor_WPc_lm_idxtermviz(lm_WPc,termes_ref[cor_WPc_lm_idx])
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 18 individuals, described by 35 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
="europe"
x=ldanncor(lm_WPc,x)
knn_lm_WPc_xas.data.frame(knn_lm_WPc_x)
="sécurité"
y=ldanncor(lm_WPc,y)
kknn_lm_WPc_yas.data.frame(kknn_lm_WPc_y)
="économie"
z=ldanncor(lm_WPc,z)
knn_lm_WPc_zas.data.frame(knn_lm_WPc_z)