Text analysis: fundamentals and sentiment analysis

Lecture 26

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

April 29, 2025

Announcements

Announcements

  • Homework 09
  • Project presentations on Friday

Core text data workflows

  • Obtain your text sources
  • Extract documents and move into a corpus
  • Transformation
  • Extract features
  • Perform analysis

Get text data

Obtain your text sources

  • Web sites/APIs
  • Databases
  • PDF documents
  • Digital scans of printed materials

Extract documents and move into a corpus

  • Text corpus
  • Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

Transformation

  • Clean and prepare the text for analysis
  • Improve data quality and reduce extraneous noise
  • Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
  • Standard text preprocessing

Example sentence

Bet!!! I just LOWKEY vibed w/ this 10/10 fire idea, and no cap, it’s honestly giving major slay energy — so I’m finna drop it, like fr, and let y’all totally stan. 🤙

Standardize character case

Before

Bet!!! I just LOWKEY vibed w/ this 10/10 fire idea, and no cap, it’s honestly giving major slay energy — so I’m finna drop it, like fr, and let y’all totally stan. 🤙

After

bet!!! i just lowkey vibed w/ this 10/10 fire idea, and no cap, it’s honestly giving major slay energy — so i’m finna drop it, like fr, and let y’all totally stan. 🤙

Remove numbers

Before

bet!!! i just lowkey vibed w/ this 10/10 fire idea, and no cap, it’s honestly giving major slay energy — so i’m finna drop it, like fr, and let y’all totally stan. 🤙

After

bet!!! i just lowkey vibed w/ this / fire idea, and no cap, it’s honestly giving major slay energy — so i’m finna drop it, like fr, and let y’all totally stan. 🤙

Remove punctuation

Before

bet!!! i just lowkey vibed w/ this / fire idea, and no cap, it’s honestly giving major slay energy — so i’m finna drop it, like fr, and let y’all totally stan. 🤙

After

bet i just lowkey vibed w this fire idea and no cap its honestly giving major slay energy so im finna drop it like fr and let yall totally stan 🤙

Remove stopwords

Before

bet i just lowkey vibed w this fire idea and no cap its honestly giving major slay energy so im finna drop it like fr and let yall totally stan 🤙

After

bet lowkey vibed fire idea cap honestly giving major slay energy im finna drop fr yall totally stan 🤙

Stem words

Before

bet lowkey vibed fire idea cap honestly giving major slay energy im finna drop fr yall totally stan 🤙

After

bet lowkei vibe fire idea cap honestli give major slai energi im finna drop fr yall total stan 🤙

Extract features

Convert the text string into some sort of quantifiable measures

Bag-of-words representation

This sentence is giving

Bet!!! I just LOWKEY vibed w/ this 10/10 fire idea, and no cap, it’s honestly giving major slay energy — so I’m finna drop it, like fr, and let y’all totally stan. 🤙

It, this — bet!!! And let vibed I’m so fr, 🤙 stan. 10/10 totally w/ it’s honestly I slay just fire cap, and finna y’all lowkey no idea, major like giving energy drop

Stan. this fire and it’s totally 🤙 like w/ bet!!! I’m major and finna idea, slay giving vibed lowkey no y’all — energy drop cap, I it, just honestly so fr, 10/10 let

Honestly bet!!! It’s finna so giving totally and lowkey major w/ no this I’m vibed energy and — y’all just 🤙 slay like let stan. drop it, fr, fire I cap, 10/10 idea,

This bet!!! Lowkey y’all cap, slay vibed it’s I idea, honestly fire and giving it, fr, I’m 10/10 finna — let stan. 🤙 like energy no just totally drop w/ so and major

Order is meaningless.

Term frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 1 0 0 0 0 0 1 0 1 0 0
The cat ate the mouse 0 1 1 0 0 0 0 1 0 2 0
The cheese was delicious 0 0 0 1 1 0 0 0 0 1 1
The dog ate the cat 0 1 1 0 0 1 0 0 0 2 0
  • Term frequency vector
  • Term-document matrix
  • Sparse data structure

Term frequency-inverse document frequency

Term frequency: raw count of term in a document

Inverse document frequency:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

tf-idf = term frequency \(\times\) inverse document frequency

Frequency of a term adjusted for how rarely it is used

Term frequency-inverse document frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 0.462 0.000 0.000 0.000 0.000 0.000 0.462 0.000 0.462 0.000 0.000
The cat ate the mouse 0.000 0.139 0.139 0.000 0.000 0.000 0.000 0.277 0.000 0.115 0.000
The cheese was delicious 0.000 0.000 0.000 0.347 0.347 0.000 0.000 0.000 0.000 0.072 0.347
The dog ate the cat 0.000 0.139 0.139 0.000 0.000 0.277 0.000 0.000 0.000 0.115 0.000

Word embeddings

Word embedding: a mathematical representation of a word in a continuous vector space

  • Dense data structure
  • Captures context
  • Semantic similarity

Word embeddings

Word embeddings

word d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
are -0.51533000 0.831860 0.22457 -0.738650 0.187180 0.260210 -0.42564 0.671210 -0.310840 -0.612750 0.089526 -0.240110 1.18780 0.676090 -0.022885 -0.92533 0.071174 0.388370 -0.4292400 0.371440 0.326710 0.431410 0.874950 0.3400900 -0.23189 -0.41144 0.490610 -0.32906 -0.491090 -0.189880 0.334080 -0.212450 -0.383860 -0.080547 1.116100 0.236170 0.313330 0.492860 0.100000 -0.151310 -0.141760 -0.280200 -0.23880 -0.35486 0.18282 -0.191340 0.60544 0.074573 -0.207310 -0.609650 0.199080 -0.570240 -0.174270 1.441900 -0.250190 -1.86480 0.416710 -0.246070 1.5010000 0.874150 -0.671350 1.27620 -0.272100 0.17583 1.22420 0.28242 0.6237500 0.6395100 0.369140 -0.846770 -0.322700 -0.671520 -0.1963500 -0.4078900 -0.209660 -0.19623 0.041885 0.539670 -1.110500 -0.395150 0.66590000 -0.233000 -1.082000 0.046465 -2.09930 -0.284930 0.0800250 -0.129630 -0.30011 -0.467640 -0.818310 -0.048509 -0.32233 -0.320130 -1.1207000 -0.056788 -0.730040 -1.20240 1.130400 0.347900
ate -0.08029200 0.659240 0.35281 0.034911 -0.944040 0.306810 0.60626 0.390930 0.228050 -0.710910 0.322700 0.499360 0.39814 0.611360 -0.010969 -0.09097 -0.421980 -0.080869 -0.3649000 0.074443 0.544210 0.350360 0.010708 -0.5578100 -0.23541 0.16357 -0.941980 -0.15397 -0.361850 0.138090 0.351410 1.066500 0.545790 0.056154 0.332340 1.009100 0.029193 0.526120 0.161590 -0.344020 -0.029192 -0.413610 -0.20168 -0.16338 -0.13938 0.378120 -0.54910 0.109800 0.152180 -0.739240 -0.034577 -0.202590 0.304410 0.423220 -0.975890 -0.25193 -0.411190 0.126880 0.0158810 0.390360 0.365970 1.35690 0.047675 -0.62382 -0.32479 -0.10494 0.0878120 -0.7758900 0.433540 0.222770 -0.200040 0.013524 0.7980100 0.5074600 -0.716180 0.92140 -0.960170 -0.785590 0.048053 0.730540 0.25351000 0.257890 -0.824790 0.181390 -0.66272 -0.886150 0.0548580 -0.086880 -0.77234 0.432990 0.714370 -0.881040 0.43407 -0.066353 -0.9752000 -0.907160 0.147380 0.03475 0.384050 0.175360
cat 0.23088000 0.282830 0.63180 -0.594110 -0.585990 0.632550 0.24402 -0.141080 0.060815 -0.789800 -0.291020 0.142870 0.72274 0.204280 0.140700 0.98757 0.525330 0.097456 0.8822000 0.512210 0.402040 0.211690 -0.013109 -0.7161600 0.55387 1.14520 -0.880440 -0.50216 -0.228140 0.023885 0.107200 0.083739 0.550150 0.584790 0.758160 0.457060 -0.280010 0.252250 0.689650 -0.609720 0.195780 0.044209 -0.31136 -0.68826 -0.22721 0.461850 -0.77162 0.102080 0.556360 0.067417 -0.572070 0.237350 0.471700 0.827650 -0.292630 -1.34220 -0.099277 0.281390 0.4160400 0.105830 0.622030 0.89496 -0.234460 0.51349 0.99379 1.18460 -0.1636400 0.2065300 0.738540 0.240590 -0.964730 0.134810 -0.0072484 0.3301600 -0.123650 0.27191 -0.409510 0.021909 -0.606900 0.407550 0.19566000 -0.418020 0.186360 -0.032652 -0.78571 -0.138470 0.0440070 -0.084423 0.04911 0.241040 0.452730 -0.186820 0.46182 0.089068 -0.1818500 -0.015230 -0.736800 -0.14532 0.151040 -0.714930
cheese -0.63712000 0.605150 -0.19317 0.116060 -0.410510 0.129780 1.74050 0.053119 0.208400 -0.536420 0.061240 -0.027045 -0.17595 1.296300 0.416620 0.90429 0.384430 -0.615150 -0.4669300 0.618620 -0.597650 0.886310 -0.374760 -0.9017800 -0.16541 1.00080 0.070107 -0.38194 -0.620150 -0.412870 0.046083 0.613130 -0.560240 -0.593780 0.055440 0.622950 0.193900 -0.214870 0.110400 -1.433400 1.016800 -1.591000 -0.64335 -0.88056 -0.13692 -0.166660 0.37185 -0.198730 -0.105600 -0.647160 -0.162720 -0.266330 -0.604040 0.677650 -1.660300 -0.76015 -0.592030 0.690610 0.0982840 0.090139 0.970170 0.63826 0.700190 -0.07888 0.77505 -0.59275 0.0099363 0.1458000 0.090962 -0.997450 -0.332210 0.605890 0.6329000 0.4926700 0.312280 0.90852 -0.434890 -0.319390 0.835890 0.832720 0.47300000 0.053605 -0.429040 0.330060 0.11979 -1.012000 -0.3595800 0.190870 0.53706 -0.605020 0.014610 0.136870 -1.18810 -0.222550 -0.9175600 -1.289900 0.186770 -0.27083 1.303300 0.036128
delicious -0.65534000 0.340340 0.30284 -0.148540 0.176830 0.337250 0.51254 0.047677 0.203640 -0.169770 0.064244 -0.030980 0.29266 0.256680 0.266270 0.55210 -0.199290 -0.455120 0.0758580 0.672750 0.074552 0.212680 0.043048 -0.9397500 0.16909 1.26090 -0.118490 0.19958 -0.780670 -0.968800 -0.273490 0.471600 -0.011452 -0.742100 0.413170 0.604600 -0.075988 0.218740 0.186800 -1.350800 0.686080 -0.138280 -0.29852 -0.72438 0.56742 0.317580 -0.11389 -0.063852 0.062136 -0.102100 0.309080 -0.538150 0.341190 0.019077 -0.991060 -1.00930 0.773920 0.453050 0.0667420 -0.897930 -0.490000 1.16020 -0.293620 -0.31742 0.22462 -1.19390 0.2820300 -0.5876100 -0.109370 -0.941000 -0.046886 0.327370 0.2178300 0.5369800 -0.200270 1.17190 -0.669520 -0.533590 0.405850 0.336610 -0.12291000 -0.188850 -0.452200 0.605610 -0.46547 -0.441810 0.2503800 0.173040 -0.51647 -0.225460 0.164590 0.279910 -0.42529 -0.468750 -1.1439000 -0.615680 -0.426700 -0.68853 0.089564 0.723000
dog 0.30817000 0.309380 0.52803 -0.925430 -0.736710 0.634750 0.44197 0.102620 -0.091420 -0.566070 -0.532700 0.201300 0.77040 -0.139830 0.137270 1.11280 0.893010 -0.178690 -0.0019722 0.572890 0.594790 0.504280 -0.289910 -1.3491000 0.42756 1.27480 -1.161300 -0.41084 0.042804 0.548660 0.188970 0.375900 0.580350 0.669750 0.811560 0.938640 -0.510050 -0.070079 0.828190 -0.353460 0.210860 -0.244120 -0.16554 -0.78358 -0.48482 0.389680 -0.86356 -0.016391 0.319840 -0.492460 -0.069363 0.018869 -0.098286 1.312600 -0.121160 -1.23990 -0.091429 0.352940 0.6464500 0.089642 0.702940 1.12440 0.386390 0.52084 0.98787 0.79952 -0.3462500 0.1409500 0.801670 0.209870 -0.860070 -0.153080 0.0745230 0.4081600 0.019208 0.51587 -0.344280 -0.245250 -0.779840 0.274250 0.22418000 0.201640 0.017431 -0.014697 -1.02350 -0.396950 -0.0056188 0.305690 0.31748 0.021404 0.118370 -0.113190 0.42456 0.534050 -0.1671700 -0.271850 -0.625500 0.12883 0.625290 -0.520860
mice 0.00063935 0.275940 0.11937 -0.587170 -0.732070 0.364360 0.73082 0.194790 -0.456630 -0.712230 -0.462910 0.354310 0.41265 0.011087 0.704830 1.15380 -0.865050 0.747780 1.0898000 -0.136560 -0.215850 -0.608840 0.068820 -0.2693900 -0.14702 0.23594 -0.362450 -0.80454 -0.619630 0.478210 0.721450 0.343340 -0.329530 0.190550 1.033400 0.230030 0.115860 0.874050 -0.253240 0.421480 -0.464190 -0.243130 -1.36830 -0.28809 -0.18192 0.294360 0.33680 -0.068659 -0.929580 -0.135920 -0.850740 -0.245050 0.089080 0.628800 0.069943 -0.72037 -0.561120 -0.256980 -0.5670900 -0.195380 0.013889 1.16350 0.238500 -0.12460 0.50788 1.59060 -0.3817100 0.3070000 0.738250 0.060485 0.065348 -0.019585 0.4766500 0.2848400 -0.783970 0.29604 0.098664 -0.142200 -0.128560 0.357240 0.18805000 -0.272090 -1.156600 1.092900 -1.53750 0.345480 1.5179000 -0.030003 -0.95319 0.416920 -0.111090 -0.608480 0.58638 0.179360 -0.4151700 -0.343450 -0.857680 -0.81315 0.254300 -1.163200
mouse -0.09320700 0.049685 0.25748 -0.525010 -0.180090 0.468880 0.26035 -0.484460 -0.020865 -1.021200 -0.642040 0.062146 0.17611 -0.521840 0.589680 1.54660 -0.418890 0.750560 1.2493000 -0.252390 -0.275400 0.094360 0.658510 -0.5618800 0.89223 0.82503 -0.589030 -0.70064 -0.229580 0.036496 0.385330 0.822370 0.028273 0.533260 1.044000 0.413500 -0.626240 -0.199070 0.626840 -0.193680 0.071461 -0.056608 -0.62716 -0.21990 -0.70554 0.756930 -0.33047 0.248220 -0.334600 0.413430 -0.508890 0.171170 0.193200 0.417950 -0.204310 -1.48530 -0.821540 0.069956 0.0020854 0.310960 0.452840 1.14810 0.089534 0.17282 0.56481 1.00160 -0.3856100 0.2381400 0.659000 0.207000 -0.136880 0.049653 0.0198350 -0.6654400 -0.365960 0.39073 -0.183770 0.218370 0.042889 0.791930 -0.09979700 -0.206130 -0.446030 0.172250 -1.25740 1.084900 0.9162000 -0.176950 0.56489 -0.017692 -0.045254 0.458630 0.47844 -0.160780 0.0030882 -0.092954 -0.496070 -0.58809 0.777270 -0.670310
silly -0.08140800 0.059552 0.77880 -0.646800 -0.615850 0.647310 -0.44597 0.308900 -0.071626 0.266020 0.161110 -0.040699 -0.43499 -0.134010 0.688020 0.53160 -0.762000 0.814480 0.2602000 0.574170 0.828190 0.422930 0.305790 -1.0311000 0.32201 0.68830 -0.553720 0.13781 -0.330430 -0.024804 -0.302030 0.399540 0.156220 -0.948060 -0.572130 0.460430 -0.856440 -0.653490 0.165680 -0.346040 0.387710 0.912410 -0.33025 -0.41045 -0.74941 -0.215180 0.26530 0.523260 -0.462110 -0.477560 0.405750 -0.187820 0.177040 -0.039180 -0.760020 -1.10750 0.447030 0.884780 0.1169300 0.070433 -0.093688 0.66467 -0.649070 0.26288 0.27458 -0.52282 1.0216000 0.0037161 -0.361660 -0.236730 -0.269150 -0.207520 0.0701320 -0.0048971 -0.583350 0.53387 -0.570200 0.355030 -0.083076 0.180800 -0.04327600 -0.325590 0.436960 -0.069350 -1.72520 -0.085043 -0.5303200 0.148600 -0.13186 0.054436 -0.264000 0.316100 -0.24254 -0.560520 -0.0719670 0.051976 -1.059800 -0.11550 -0.540620 0.194170
the -0.03819400 -0.244870 0.72812 -0.399610 0.083172 0.043953 -0.39141 0.334400 -0.575450 0.087459 0.287870 -0.067310 0.30906 -0.263840 -0.132310 -0.20757 0.333950 -0.338480 -0.3174300 -0.483360 0.146400 -0.373040 0.345770 0.0520410 0.44946 -0.46971 0.026280 -0.54155 -0.155180 -0.141070 -0.039722 0.282770 0.143930 0.234640 -0.310210 0.086173 0.203970 0.526240 0.171640 -0.082378 -0.717870 -0.415310 0.20335 -0.12763 0.41367 0.551870 0.57908 -0.334770 -0.365590 -0.548570 -0.062892 0.265840 0.302050 0.997750 -0.804810 -3.02430 0.012540 -0.369420 2.2167000 0.722010 -0.249780 0.92136 0.034514 0.46745 1.10790 -0.19358 -0.0745750 0.2335300 -0.052062 -0.220440 0.057162 -0.158060 -0.3079800 -0.4162500 0.379720 0.15006 -0.532120 -0.205500 -1.252600 0.071624 0.70565000 0.497440 -0.420630 0.261480 -1.53800 -0.302230 -0.0734380 -0.283120 0.37104 -0.252170 0.016215 -0.017099 -0.38984 0.874240 -0.7256900 -0.510580 -0.520280 -0.14590 0.827800 0.270620
was 0.13717000 -0.542870 0.19419 -0.299530 0.175450 0.084672 0.67752 0.098295 -0.035611 0.213340 0.516630 0.206870 0.44082 -0.336550 0.560250 -0.68790 0.519570 -0.212580 -0.5270800 -0.122490 0.330990 0.026448 0.590070 0.0065469 0.45405 -0.33884 -0.282610 -0.24633 0.108470 0.316400 -0.153680 0.735030 0.118580 0.708420 0.075081 0.297380 -0.113950 0.408070 -0.042531 -0.213010 -0.798490 -0.127030 0.75200 -0.41746 0.46615 -0.039097 0.65961 -0.323360 0.442000 -0.941370 -0.231250 -0.306040 0.799120 1.458100 -0.881990 -3.00410 -0.752430 -0.205030 1.1998000 0.948810 0.306490 0.48411 -0.757200 0.65856 0.70107 -0.93141 0.5292800 0.2332300 0.188570 0.386910 0.011489 -0.319370 0.0118580 0.2294400 0.177640 0.16868 0.140030 0.586470 -1.544700 -0.064425 -0.00064711 0.136060 -0.326950 0.100430 -1.54600 -0.547600 0.2102700 -0.671950 -0.15970 -0.682710 -0.220430 -0.870880 -0.16248 0.830860 -0.2304500 0.198640 -0.051892 -0.52057 0.254340 -0.237590

Word embeddings

Document d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
The dog ate the cat 0.3823700 0.761710 2.96888 -2.283849 -2.100396 1.662016 0.50943 1.021270 -0.953455 -1.891862 0.074720 0.708910 2.50940 0.148130 0.002381 1.59426 1.664260 -0.839063 -0.1195322 0.192823 1.833840 0.320250 0.399229 -2.518988 1.64494 1.64415 -2.931160 -2.15007 -0.857546 0.428495 0.568136 2.091679 1.964150 1.779974 1.281640 2.577146 -0.352927 1.760771 2.022710 -1.471956 -1.058292 -1.444141 -0.27188 -1.89048 -0.02407 2.333390 -1.02612 -0.474051 0.297200 -2.261423 -0.801794 0.585309 1.281924 4.558970 -2.999300 -8.88263 -0.576816 0.022370 5.511771 2.029852 1.191380 5.21898 0.268633 1.34541 3.87267 1.49202 -0.5712280 0.0386500 1.869626 0.232350 -1.910516 -0.320866 0.2493246 0.4132800 -0.061182 2.00930 -2.778200 -1.419931 -3.843887 1.555588 2.084650 1.036390 -1.462259 0.657001 -5.54793 -2.026030 -0.0536298 -0.431853 0.33633 0.191094 1.317900 -1.215248 0.54077 2.305245 -2.775600 -2.215400 -2.255480 -0.27354 2.815980 -0.519190
The cat ate the mouse -0.0190070 0.502015 2.69833 -1.883429 -1.543776 1.496146 0.32781 0.434190 -0.882900 -2.346992 -0.034620 0.569756 1.91511 -0.233880 0.454791 2.02806 0.352360 0.090187 1.1317400 -0.632457 0.963650 -0.089670 1.347649 -1.731768 2.10961 1.19438 -2.358890 -2.43987 -1.129930 -0.083669 0.764496 2.538149 1.412073 1.643484 1.514080 2.052006 -0.469117 1.631780 1.821360 -1.312176 -1.197691 -1.256629 -0.73350 -1.32680 -0.24479 2.700640 -0.49303 -0.209440 -0.357240 -1.355533 -1.241321 0.737610 1.573410 3.664320 -3.082450 -9.12803 -1.306927 -0.260614 4.867406 2.251170 0.941280 5.24268 -0.028223 0.99739 3.44961 1.69410 -0.6105880 0.1358400 1.726956 0.229480 -1.187326 -0.118133 0.1946366 -0.6603200 -0.446350 1.88416 -2.617690 -0.956311 -3.021158 2.073268 1.760673 0.628620 -1.925720 0.843948 -5.78183 -0.544180 0.8681890 -0.914493 0.58374 0.151998 1.154276 -0.643428 0.59465 1.610415 -2.605342 -2.036504 -2.126050 -0.99046 2.967960 -0.668640
Mice are silly -0.5960986 1.167352 1.12274 -1.972620 -1.160740 1.271880 -0.14079 1.174900 -0.839096 -1.058960 -0.212274 0.073501 1.16546 0.553167 1.369965 0.76007 -1.555876 1.950630 0.9207600 0.809050 0.939050 0.245500 1.249560 -0.960400 -0.05690 0.51280 -0.425560 -0.99579 -1.441150 0.263526 0.753500 0.530430 -0.557170 -0.838057 1.577370 0.926630 -0.427250 0.713420 0.012440 -0.075870 -0.218240 0.389080 -1.93735 -1.05340 -0.74851 -0.112160 1.20754 0.529174 -1.599000 -1.223130 -0.245910 -1.003110 0.091850 2.031520 -0.940267 -3.69267 0.302620 0.381730 1.050840 0.749203 -0.751149 3.10437 -0.682670 0.31411 2.00666 1.35020 1.2636400 0.9502261 0.745730 -1.023015 -0.526502 -0.898625 0.3504320 -0.1279471 -1.576980 0.63368 -0.429651 0.752500 -1.322136 0.142890 0.810674 -0.830680 -1.801640 1.070015 -5.36200 -0.024493 1.0676050 -0.011033 -1.38516 0.003716 -1.193400 -0.340889 0.02151 -0.701290 -1.607837 -0.348262 -2.647520 -2.13105 0.844080 -0.621130
The cheese was delicious -1.1934840 0.157750 1.03198 -0.731620 0.024942 0.595655 2.53915 0.533491 -0.199021 -0.405391 0.929984 0.081535 0.86659 0.952590 1.110830 0.56092 1.038660 -1.621330 -1.2355820 0.685520 -0.045708 0.752398 0.604128 -1.782942 0.90719 1.45315 -0.304713 -0.97024 -1.447530 -1.206340 -0.420809 2.102530 -0.309182 -0.392820 0.233481 1.611103 0.207932 0.938180 0.426309 -3.079588 0.186520 -2.271620 0.01348 -2.15003 1.31032 0.663693 1.49665 -0.920712 0.032946 -2.239200 -0.147782 -0.844680 0.838320 3.152577 -4.338160 -7.79785 -0.558000 0.569210 3.581526 0.863029 0.536880 3.20393 -0.316116 0.72971 2.80864 -2.91164 0.7466713 0.0249500 0.118100 -1.771980 -0.310445 0.455830 0.5546080 0.8428400 0.669370 2.39916 -1.496500 -0.472010 -1.555560 1.176529 1.055093 0.498255 -1.628820 1.297580 -3.42968 -2.303640 0.0276320 -0.591160 0.23193 -1.765360 -0.025015 -0.471199 -2.16571 1.013800 -3.017600 -2.217520 -0.812102 -1.62583 2.475004 0.792158

Consumer complaints to the CFPB

Consumer complaints to the CFPB

[1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                   
[2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."
[3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                           
[4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"                                                 

Sparse matrix structure

Document-feature matrix of: 117,214 documents, 46,099 features (99.88% sparse) and 0 docvars.
         features
docs        account auto bank call charg chase dai date dollar
  3113204 1       1    2    2    1     1     1   3    1      1
  3113208 0       1    0    6    3     5     0   0    1      1
  3113804 0       0    0    0    0     0     0   2    2      0
  3113805 0       1    0    0    0     0     0   0    0      0
  3113807 0       2    0    0    0     1     0   0    0      0
  3113808 0       0    0    0    0     0     0   0    0      0
[ reached max_ndoc ... 117,208 more documents, reached max_nfeat ... 46,089 more features ]

Sparsity of text corpa

Generating word embeddings

  • Dimension reduction
    • Principal components analysis (PCA)
    • Singular value decomposition (SVD)
  • Probabilistic models
  • Neural networks
    • Word2Vec
    • GloVe
    • BERT
    • ELMO
  • Custom-generated or pre-trained

GloVe

  • Pre-trained word vector representations
  • Measured using co-occurrence statistics (how frequently words occur in proximity to each other)
  • Four versions
    • Wikipedia (2014) - 6 billion tokens, 400 thousand words
    • Twitter - 27 billion tokens, 2 billion tweets, 1.2 million words
    • Common Crawl - 42 billion tokens, 1.9 million words
    • Common Crawl - 840 billion tokens, 2.2 million words

GloVe 6b (100 dimensions)

# A tibble: 400,000 × 101
   token      d1      d2      d3      d4      d5      d6      d7      d8      d9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 "the" -0.0382 -0.245   0.728  -0.400   0.0832  0.0440 -0.391   0.334  -0.575 
 2 ","   -0.108   0.111   0.598  -0.544   0.674   0.107   0.0389  0.355   0.0635
 3 "."   -0.340   0.209   0.463  -0.648  -0.384   0.0380  0.171   0.160   0.466 
 4 "of"  -0.153  -0.243   0.898   0.170   0.535   0.488  -0.588  -0.180  -1.36  
 5 "to"  -0.190   0.0500  0.191  -0.0492 -0.0897  0.210  -0.550   0.0984 -0.201 
 6 "and" -0.0720  0.231   0.0237 -0.506   0.339   0.196  -0.329   0.184  -0.181 
 7 "in"   0.0857 -0.222   0.166   0.134   0.382   0.354   0.0129  0.225  -0.438 
 8 "a"   -0.271   0.0440 -0.0203 -0.174   0.644   0.712   0.355   0.471  -0.296 
 9 "\""  -0.305  -0.236   0.176  -0.729  -0.283  -0.256   0.266   0.0253 -0.0748
10 "'s"   0.589  -0.202   0.735  -0.683  -0.197  -0.180  -0.392   0.342  -0.606 
# ℹ 399,990 more rows
# ℹ 91 more variables: d10 <dbl>, d11 <dbl>, d12 <dbl>, d13 <dbl>, d14 <dbl>,
#   d15 <dbl>, d16 <dbl>, d17 <dbl>, d18 <dbl>, d19 <dbl>, d20 <dbl>,
#   d21 <dbl>, d22 <dbl>, d23 <dbl>, d24 <dbl>, d25 <dbl>, d26 <dbl>,
#   d27 <dbl>, d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>, d32 <dbl>,
#   d33 <dbl>, d34 <dbl>, d35 <dbl>, d36 <dbl>, d37 <dbl>, d38 <dbl>,
#   d39 <dbl>, d40 <dbl>, d41 <dbl>, d42 <dbl>, d43 <dbl>, d44 <dbl>, …

Fairness in word embeddings

Word embeddings learn semantics and meaning from human speech. If the text is biased, then the embeddings will also contain bias.

Perform analysis

  • Basic
    • Word frequency
    • Collocation
    • Dictionary tagging
  • Advanced
    • Document classification
    • Corpora comparison
    • Topic modeling

{tidytext}

{tidytext}

  • Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use
  • Learn more at tidytextmining.com
library(tidyverse)
library(tidytext)

What is tidy text?

text <- c(
  "Yeah, with a boy like that it's serious",
  "There's a boy who is so wonderful",
  "That girls who see him cannot find back home",
  "And the gigolos run like spiders when he comes",
  "'Cause he is Eros and he's Apollo",
  "Girls, with a boy like that it's serious",
  "Senoritas, don't follow him",
  "Soon, he will eat your hearts like cereals",
  "Sweet Lolitas, don't go",
  "You're still young",
  "But every night they fall like dominoes",
  "How he does it, only heaven knows",
  "All the other men turn gay wherever he goes (wow!)"
)
text
 [1] "Yeah, with a boy like that it's serious"           
 [2] "There's a boy who is so wonderful"                 
 [3] "That girls who see him cannot find back home"      
 [4] "And the gigolos run like spiders when he comes"    
 [5] "'Cause he is Eros and he's Apollo"                 
 [6] "Girls, with a boy like that it's serious"          
 [7] "Senoritas, don't follow him"                       
 [8] "Soon, he will eat your hearts like cereals"        
 [9] "Sweet Lolitas, don't go"                           
[10] "You're still young"                                
[11] "But every night they fall like dominoes"           
[12] "How he does it, only heaven knows"                 
[13] "All the other men turn gay wherever he goes (wow!)"

What is tidy text?

text_df <- tibble(line = 1:length(text), text = text)
text_df
# A tibble: 13 × 2
    line text                                              
   <int> <chr>                                             
 1     1 Yeah, with a boy like that it's serious           
 2     2 There's a boy who is so wonderful                 
 3     3 That girls who see him cannot find back home      
 4     4 And the gigolos run like spiders when he comes    
 5     5 'Cause he is Eros and he's Apollo                 
 6     6 Girls, with a boy like that it's serious          
 7     7 Senoritas, don't follow him                       
 8     8 Soon, he will eat your hearts like cereals        
 9     9 Sweet Lolitas, don't go                           
10    10 You're still young                                
11    11 But every night they fall like dominoes           
12    12 How he does it, only heaven knows                 
13    13 All the other men turn gay wherever he goes (wow!)

What is tidy text?

text_df |>
  unnest_tokens(output = word, input = text)
# A tibble: 91 × 2
    line word   
   <int> <chr>  
 1     1 yeah   
 2     1 with   
 3     1 a      
 4     1 boy    
 5     1 like   
 6     1 that   
 7     1 it's   
 8     1 serious
 9     2 there's
10     2 a      
# ℹ 81 more rows

Counting words

text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)
# A tibble: 67 × 2
   word      n
   <chr> <int>
 1 he        5
 2 like      5
 3 a         3
 4 boy       3
 5 that      3
 6 and       2
 7 don't     2
 8 girls     2
 9 him       2
10 is        2
# ℹ 57 more rows

Application exercise

ae-24

Instructions

  • Go to the course GitHub org and find your ae-24 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Recap

  • {tidytext} allows you to structure text data in a format conducive to exploratory analysis and wrangling/visualization with {tidyverse}
  • Tokenizing is a process of converting raw character strings to recognizable features
  • Remove non-informative stop words to reduce noise in the text data
  • Dictionary-based sentiment analysis provides a rough classification of text into positive/negative sentiments