From b7e6c7f93040ba2252e824c2af7428f7fac85872 Mon Sep 17 00:00:00 2001 From: Takuya Kitazawa Date: Fri, 25 Nov 2022 09:43:29 -0800 Subject: [PATCH] Complete the first draft of JuliaCon proceeding paper --- paper/images/tradeoff.pdf | Bin 0 -> 16575 bytes paper/paper.tex | 4 + paper/ref.bib | 439 +++++++++++++++++------------------ paper/section/algorithm.tex | 72 +++--- paper/section/conclusion.tex | 5 + paper/section/data.tex | 4 +- paper/section/evaluation.tex | 98 ++++---- paper/section/experiment.tex | 49 ++++ 8 files changed, 365 insertions(+), 306 deletions(-) create mode 100644 paper/images/tradeoff.pdf create mode 100644 paper/section/conclusion.tex create mode 100644 paper/section/experiment.tex diff --git a/paper/images/tradeoff.pdf b/paper/images/tradeoff.pdf new file mode 100644 index 0000000000000000000000000000000000000000..3eeab281ed0baea75f6ca4c3fc4597151c94a2c0 GIT binary patch literal 16575 zcmd_Sc{EkuA3vUq(H1ON!N}tI3KGttNe>`hFx3!&p-e(W*VekEZ?S0-SVSQ!Q{U`}6O!&qS{CYJE z1xLc&ZH~iaWZ(#W>tpuba5SJ8!V$&+9`J ztJ3eQ2HX;kFmklEb8B%N8gU(il*{ib5eB z|D{ccxmE5}om_yz;0RS0piyOeTX#EP0Bw7>VM*{rLmC2hn>^p+;qf&g-jA(8^xJ zHJtO`uX*Fcjy_UOh{^o;>o{+f8zEGRSNhAEEV-u=vqyB!Y=vz*FPc??5IrWUHtuDr za$zdrM?&t=*|-y|Qi*PSIOHRSkLSmjHFnd$X)-<4W8wVm5%sSKuY z+9$DAOGM1!a)#8h{=I}}<3Ck{Pab(TlVzlYKkgo!$EL(Pl$?Ibnzm42qOwZ9y+iA< z&d-9OQue7FHz+ z4xPTW(DWemoRvk43%@At2}!0XcECjR$E2ON!^xqg_HKo7rn(oSB{AC@xGTcb=93R! zRE^{G61=RvYZhJgD$bUwl?`2kilaXl8NADz_oaK6P{#CL{Tc(siia=8f zEES)UDT)S{lHK}2wOeslCQ=HUZ%*__lk`)U?6M4kc8rx@8t}Cw+)=;Zbfug?HMyOX z^j)%f$D_9Tr$c>X^)fVW0e*|f%O~zxkDRyUi&=y&sWZiO+{6t+qGcpUGYX+ZNg z&unijZMX{KeA=VNP@?0egN}bn|AFsKh1|i2k(;IGBA?||2=gb3+J3njXlos&W5v1+ z&$;OC7bn-Idu#GT^4ZFha*NSl3MdgB{EO#2)RuH>;~ubg&R$3f3E*5jzL<(*F}74! z=6z3aW(!Ox(fat-PqtHBZY4 z=a=#)s?(~D3eD=u*1EOu-SmBOYD*1=28Q2@U9S3~j$6sJ=&PcqQ3F@Ab}oJY@y(_> zKzg6qXJp>6{H*r2_pYT}NBCk4oX$QH>CN1S(;aPId``OSDi*5Kcbm2PNAm6(ab~$* zTZcq?eEU%y7jIE&BuidXcA3wqTW?aSRWIHQh?Z6`PvsdfyNeO%J$!@eeUcg>*2lhoDLeB3c{|Cx!rE$1pyL&FZ0{2q%3!!Y*J0j}Lw5bN zd>?HU%Pnhfm+v-Vr`K1rIH}4B8WOR&<%d-1EA}etz8aN_C#D4(qq9th~(?rpg{xfq~ zpf##Mokk_~W9LUN+R^W-J`Ya|qg|xOZ+D-svajm8&8YN(wfy!nsw%Pa2_^CM5K^>v zTY#DLr1ey3)UD-x%+k+oYaZRYbmZpT;FOJfd7;*is!G?JO1}c|X_>kbpHeTEf9I;@ zQ15|U+3#iaw^Ji8uh2S-!eEiBo(}!DYx!S`v!vvjg&7x9!&HZ)Qw}PqVqK67osTw* z|24>BE>RWK!#+l=YmsAC`_Q=GPU%g7zXmZ|&H#JZUFx>F_k&tH3>U{Q%usN|)M%(3 zR%1k<`@PruLt=q)sV{g#{$(2dx9|R6{GoBgH4MrcYIrpu8EbxUoY?B8F|Bb`bZq(A z*16tP7TdEFbOnCrS_c-K_8kftX_k=I61;PQ);00R*=w|UZei`h&B6!I9BLiYj5DiD zN;AtU+Sb~MD>PZC*K60-juK+kYSEBATN-kNcK%Apy}U2xc0Y4%PxqK7Xj5P2`IMZS z^t{wKi)Umf_GOZ3R+_Ipt#~ZKE?>mw``4%MXtj{}2)1s8GmTjzUQyxkr=@;UM4XLjaF>^X+Ve?mj^|B7ua|G0BW* zDK&g!7S!p;#8n+x+v5?K;JLiGt9IY;v*}$gzBt%zpi2n#Y|0Vik!y7sr^VI)XH=Tc zP*kg1IGIVR*8KF~Cevk?(jDP*b-^}*xtyY5C5=%u`3*v!T{yynkOQn<;UiQ!k>#g_ zJeini^6Lw1)21I6`1EMrsO=6J@WVP>CY{}{66GtfJoZ&^DEBqb1h)*c6{_WhX8vRv z^00kD@oeFTTf3gVk!4oC-t_(}P4VX!YJ{Rad$q=wl>M;Jm*VgI*!gN#n!TjZ27*H9 zY%_uquxsSR;0Pv2{nAPbwt~n}*r;j>Of1quUhYhC8#Q~|mWp=m_E2#pExVNHGJ!aI z8|%nMG`=jfRnUw^p2tOJTl*YeF_wbTb4TI^96{J|GaN}0)^H?yzJ?V{Go_#cE2=N_ zk-=U?8NUWY4>)xxN$FP6;~6`r-Es7~Qd-a&*l{Ra%k-Nnc3#)6T zEpEDH-*=qIo}C;GFEPp7QJ<u<@kW)_SB`&%C_0&ZVmn zr+!{Q1RPh;PaBPOPDxgC<}kt@ySUe!A?)j^D^mI2DW5KDU%hu=XiheC18E_+y%}jS zn01Sup;m&q0^Q&PqYqn)k1st5$Vl%SSsJ99ym93eCtDn+WJ{wKqd}}!)2@K98L3kb zCf+;!lpBud4ZAA%tL~N)?{srI&Y%=)Bi1hW?bjZGW}|nnSU%iUXLqned%?NS{J14| zvj37c@h455llS*M8i6ePpTFesJ#i>bp(l|iWpX6SXMbrmoz6vGXF+b_Ei|sPIr9xV zbH{FlK5@84R+Sz{j5=ng^s)T`ac}&uaUnwP#Xdt{C`vV$4ftpm3Nq0}*$j zjj0j%qv>KB9<%5YfvpRJp)YAvS_7pk4UGj-hka74(2 z*}#iR19b$$%`73#2kV^Oow zToillGodypw3lb}F6a1`>StE@j%?+erOkNcQ_aF)>C(gdI{lV|EKbh)<0Cha@bA6l z|8i)AwQDe@0%-u&CJGyjH-_4Kw^qbo*Rn>RMn@BN-;gd5ir7G}G<#tXUzlT%%o`;(gVvTV2Cn0*SCPA~h&b4hYCWLmYq zHKy26Af)A>=!CG^C-$QR@<*1eMgq4?<)j#@Sr^3i1^ZuChg-eeuXKG|PcGBMo6E&bZ5*ai= zFX6r;^_OJ(iO)w46BBv#&~UDSxzEZiR;ltI>Lz)5tuHx=evb_AJg<)9J*0b7t4r%T z%O1UfkgE)ul{d#brz#{Wr=0a92d0Wdll9=4=H**cV-IR`+H%h+zn5s5Q@BrJ&WVD6u#=y*NPVO_@3!FwXC9+cu;p4JZQFQO$ zDqge}%D_!jUAp?N>4D{YO#j0gru#N2TvWG_iC>eO(%O6Hw>X!4`K2vtFwVe34 zy`NQk52e86DWq@49L6m47_XwJ`C!^Q~mQCLzB@NrlL7h78 z-TG2)+0oKmX9Hbr7Kx#;>wtxRO{@}%aYB}wi_Yqocu=OFdh*}{Sxr@s2!X=~5)QLn zJX5E|SeE4eR!vnwmf-(Ne7YVFWDxhK4Sc#frG0|g zOeD?Vz1Zb3w47+s7Yp6sPZj*#BRTmdKQev|$EVB+-WtsA`*={;AT74gfkfJkgr~rZw+Sexg0+>D)31HU$9{2b+kJ2xmS0(O_?6wV?;}(v01k1 z%Uq?Ih&}Ag6#HMFqJDFqZ+V8R@Zq!42(N}9yRBB5y>9Z4l)Aaq4=J!(|A^=Kkf&eb z@@9Gu{9x~mh-{M08Kx8pvHRDVqAwN>T!{)#PPoS3GZT~d+5Dkpw6(su`5=|QLl?`y zaC|H)W~!Up^AM}^0ojka2IyM`+{`-1PRzVS8-D2q&GxjsY*jkesVHhCG^%lN%wzYP z%s~YP)SPEYik{2-MKRO(&(zYbTPW}S8cvAq40EF%X=P1m=$X4Gmg`D+&6fiK%Q8_xN z?3k;(v-do!?4I(b_Zb|eSAX7kQpT6bEE42ic_FF2y)c}k*Eg{6cBd_WuV3bm??1Ol zP-Js&Fda6Fv(ebKONWj@jjJ-44zIRbyr!6N4#;FMwEX6-^q4NL9-fU5yDH52?U5Ga z8(MSb`y=rtN+)+7Kuu!AWgn099vHp**m(PWj%$}L>T}t=`Q=De%Sl2A?rWueb-|v| zyJP-!E=#JF2-fO`yM_o?UoFP2K05n+gTf`_9ez67#fv$fU|Ob33FT51a2J1J_P38H zl&J=|;+|OL#*`(;HF@nUge6cPwtDuF*Wg^(zNCJ7Qcizkb^Ghdiu&;%cQ`b<)hbox81AL-5SlQ5;liBu7&(m zh30V97rU60T4#c^V0iHY0w?Y)8#Xsr+YMheDe&Xxc)1U<|T6vpP<s+txajPXmN|yoXO@tkeW|GKPm3Ic1P_h{+udq-|jKE zQh1>0hxShfgLVxQz5*Ymj5wn_#Fv$*%aeRs#geF86<$5x{lT46hINRhV$`eRuD09P z2cOItGM`DAm$}l))2Db3ThUN7HqgJIXdHSm%qg~Kmr3En}-a|w3(N_7h zH{Wirp5cG~7Qt0~CjB$LkCeSmZGGm8x6O))bkl~OTR#i(x>@q8+;Y(yqpv)$Hvwal zx3jUEKE5}hgLQi<&DSoTA63!ZC9#d8nfG5Fs3&Gk{NUoOI{kVB(`*)bug4JRJNf{5 zW4EwHa*Du49caj>0{4re?S;)&47cV#`l%pWTAR-Ndh zs=T4)ml!-Acq=-uCA7!rxCM`HOrcY7QSkBCHN2x*ebEP=bdFI*HFG=v%NTaExOAPr z#%!dd6c$qnE^Tqz_O0D19x%a78j7r7##9?}lR=q4Ehe&3osVWEgz6$tjsM5`1$G2;xl{fO=hYxmGVy;kqD&m{^MP1}A6~2L_o5qf)b^9&-8dfC~9l%2j zBONEE2%T;DZA-2~8F`9$MYg1L?Z-(`QmqL$#Ps;(v}~ekqu#sO;f43AAQDT3=euAt z!O^4J=U?`vR48Fn9f=_fn7!9{`JZax(>?aGS6>-wzdKy1uD-W;*4eW_3a0F# zdVKaQ$-1e$MJcb33MWhHe2b!ZIB=2Rs7B%ron()Ght20xTJEoEo}i36YdE)odNvCe zCD*R+Dt}I+t*Amo9J2N#SDfU9YYutnRR>i`v7wEIts|qTG8x4qjb>=la;iLIPVl zl=@7@mIlijjHmWS?l`H={IT3*=Y)}sPxRz$zD{Z@I)J0_z24xm0pnJ;<*pWm$~!Ta zZARJpo9cyR-eY(3iarzHbwOH>*CFCjW9+-j65LiqW1|u`Cn{BQs0j8`N*hw z>KetvS9`ApKiDD^YiYEBF*gf#Q0u^stlnKF$`$M{Sr;g7{>370y!<3@tEwTLp!`g{ zy;?KO!e(pYc_JIa>A->$&nYt+oZ4LYS!=V`ZV6kNPm60F%-ny3G5&?3rTRiY%kWKY zyBwJkR%s7Wu3TC(d`&t5bL2DayU^-8XKp+i6!aa6gD=wISg||LTPtP!j zc*uD=K1NI1YVxA(8)0L;n{qe2FkSVL zk2mlE*gbfi;q>}?n29>vbTu>JzFW=cv*9|-*OSxqGqK##r~)gVWVri zL#=ZNFe4RumJnGnqTJdw(pG zG)$-BV>1We>dfPs~N}cCExx5AQ z_QaNdje^@O@L7)zF?b_@4v|~g3JM}OcxNQueoA~dLtF}ef-_0w$HAMraM69G`{flS5QloS zLVAS<^BYDE?wS`aQowwS|{tB{V*y6cG(5AD2HHgXYU6of9q5^ z$`Te`F;0KlL<*XcH%I-GbWhhw^P-1Y6XSVUgdV1AozqcIlGXUR@BP=jWGpE}WS{q9 zkXO<6;SzFs>zBC>%^|59)={k7B! zPlxDt7pUp@PSuy2CXne0iVF&Y(lbc98((JeT}_|6a~cRs_`d75 z$>O`J0}V}?KgLDX&T{BGAq^TM4r%i%S$AA)ZAn4j;eB&k=kyJiWaFEsePw8vZ(p^) zP8U+-YCSGK!&q-}q^gnH#0VacTe$Y zc*Ml&$A|i!dZErptl+FEM0`J#bdjDkbx!b8P92*@dxnjr)OTh%l6c_*{JXZ2^P1n) z_c?37@XOV3Y8Ol_PE~OyxChHt$$yb(J$d8_T*0V+kJE{+YnN^Jq;uOx`0=GPVw?G* z+Sz5IFlbh@l9et!8QtaZ!j zNF}(`eoy^JdMqjXhG$z$D&I_}?Re<-F#A=j4=*Q?1%#dKFxDxwtlfE!)|by=>z6#7 z_-XD5$^*7ALZGL=_M?F*A8s5~&@*Q}2YvKDF3-n})J8tfThbcz3St-NfQg%EEke&J7f}X^=-)J5P1ubhar9u$-RH$l*d}f1$ka#nA$s<@kPH zd>b~WC7^V*fs$YCt*TmN#|Qt?$Ea?9KO?f4Uo%(Q;m&k=KLz`BT|# zLu?(>=^_t|#kem?CNUdx#g`AzM-=AUq#Btx#=JF^oj!MDYH`cMZ_+8zH_ioF(Vk)V&9bs=g zuN_rKsmd1aemC5#clW3VCtL6a{A?N#u6tK6dqH6vB?=t*1w^#Q z=`W9T(Z!%R6_wlzk8isa#?*W+U#qbHv)I{a&224dR@?IPdiamU2ii#P7}{CM+wr#S z<9F7Y1k>QIQPpZ;$1t@?Bd4JhPxeMi9QC-^xJ7O)%iXM^gU_)EFDuIvoFxlh-5V>K zSbl0UeDvrmiI@#U+ARD-gSg){um1Ggn1FY$IR@h_+`jdcU8dI&e1bmQ*W>m-&h6z(#j5n!ivD;tr;b^rOjexkmbc3x4WIUpWG8%S z&E#F=&6xTuUE%gd^ti8Jw-*Pxw*uR|gN|Kgr%ZrKfydj5e!o5}TdxCyZpxNo-2*Y( zD69mJ915K*RwioooX^j4uWg=ZmZvbP{8u%ozpbrk?QKtvwFZIA$E{6$j6nP{2yAw= z_JYE!z2PWGk09n)#SKIeLs8T+GBAX)lY@gj6ypqHlFi`&KO;cQvn`0-M#2%~;AaYP$LY?7=KzlEDIKtfxge8;wK#t_-Wp58dIDj~3IKsin z7tp<({J|f4Uwbz=!k*j)4pg*tcX4-vgZLPP0vw?T2QU$#3n=MPo~XVkU5eDVzu%S;L7$P#=!41MR_4NKoP!xPT%~-~tLb!x1hZIvUgmJp!IQ zzzr6L01N?z0BL;S2wyk~gvucT;FhaF*niXlg?Fnud3lo{)d7_uAf+m<#Cq>XAwjkM zSUd)fA`r2l5exAb=7KEF+S=t0`dgFA4rq8N%!jW558ba;$n3b2uJ}{C^|C{%_mCfVqYR z_QQf$KS{85;xTv_5p048V3?vo*gpomLV*$v@CiMK%45LsEBT5cK;{2ef~`KsfU;16 z+E`6EfNN1mNCA>qkU&F3Nia#kJ(#dK&^ZB0I5Zqh#KAxUvkDJ=k^q+k1Rz2}l``5}+&8 zKiO!I0s#*|p)jaV9H>K9$|`fvl|0g*p0Lmj5>gSwHY6H!1&M4AsJoQ}!~M3|pQcta zfWeg}&_H5P4F_=rIESi3!(i21t5?7+Y?Tpm3s``0fY@aAp>go1eW(;96L~O!mZ1AT ztw2m6kudVJ-whD)MEGig2Exi1fX2bvMAkJJ7tomaE0MRthSp&_j6gHGngAt4hFm!`TY{;(G8CM-J3fcET-gg zkkiHfmdgW<{oU7pUK_k}BDb>Ed6TQII`H3pdqJ+8{JY<^h~QJ{jbQ$)6#nly)XnO11=Pw%Tnd2p+r9kv85+XY8XO)$UR?pbS`4`U6+ilK zvA4)-Mw5_#a#VT4s`hETQS!-g>B(*MwNM46~t!^ zEfxET9_Okw?YF<~PE^$7o;qp5`kC z`};e5BnvycJZwql%C;Bu_J8otz4Q@3oOs}S=^+8t1AN;`!h<-KL@dNmG*Pb@5Q7cN zteHo8>vrYDhS(lf`n3U`{oCmO|JB0ZL;L?$3jm({2Oa;Ph5|;vkUtYO zgy8$iTCnn|>wj|+AicjTgD)&ePVR2X(3h=yl%>!}G#Z6QV$moJ5raJ}216)8B>?!} z<7!RvaB(NOIN5-Si9SsG7ng5|d-gA99us2*w19xDDx1NRp`?vKpG!iWE(C-h=NR%YBVy>Z~fWCnj zT1UffAO`_*x9iFgq5aod8X61v)pa!Vzcdu#U%gbet)pR~@59#9aQJ`A z5s?4T0D}FOhJo@X$VfJq`WO*hS*;V5_md9JGyCPa^^ty^cmef{*jo z(=ZTLgns{xKLQ^4j}8a~faTYflZ5se>uG4PuUJpRL0%sE{nrQBq5WewNod=)t{f2! z4oKG1u>Z~(A^~gx)|Zn6Fn=u#>_&*| [item# => popularity] : [4 => 4.0, 6 => 4.0] \end{lstlisting} -As of writing, the other non-personalized options implemented in the package recommend items: that are most frequently co-occurred with a specific reference item (\texttt{CoOccurrence}), based on a percentage of observed \texttt{Event} values that are greater than a certain threshold (\texttt{ThresholdPercentage}), or based on a global mean of observed \texttt{Event} values (\texttt{UserMean}, \texttt{ItemMean}). +As of writing, the other non-personalized options implemented in the package will recommend items: that is most frequently co-occurred with a specific reference item (\texttt{CoOccurrence}), based on a percentage of observed \texttt{Event} values that are greater than a certain threshold (\texttt{ThresholdPercentage}), or based on a global mean of observed \texttt{Event} values (\texttt{UserMean}, \texttt{ItemMean}). \subsection{Collaborative Filtering} \label{sec:cf} -Collaborative filtering (CF) is one of the earliest recommendation techniques that was initially introduced in 1992 \cite{Goldberg1992}. The goal of CF algorithm is to suggest new items for a particular user based on a similarity metric. From a users' perspective, CF assumes that users who behaved similarly on a service share common tastes for items. On the other hand, items which resemble each other are likely to be preferred by the same users. +Collaborative filtering (CF) is one of the earliest recommendation techniques that was initially introduced in 1992 \cite{Goldberg1992}. The goal of the CF algorithm is to suggest new items for a particular user based on a similarity metric. From a user's perspective, CF assumes that users who behaved similarly on a service share common tastes for items. On the other hand, items which resemble each other are likely to be preferred by the same users. \subsubsection{$k$-Nearest Neighbor} -A $k$-nearest neighbor ($k$-NN) approach, one of the simplest CF algorithms, runs in two-fold. First, missing values in $R$ is predicted based on the past observations. Here, a $(u, i)$ element between a target user $u$ and item $i$ is estimated by computing the similarities of users (items). Second, a recommender chooses top-$N$ items from the results of the prediction step. +A $k$-nearest neighbor ($k$-NN) approach, one of the simplest CF algorithms, runs two-fold. First, missing values in $R$ are predicted based on past observations. Here, a $(u, i)$ element between a target user $u$ and item $i$ is estimated by computing the similarities of users (items). Second, a recommender chooses top-$N$ items from the results of the prediction step. -Importantly, $k$-NN can be classified into a \textit{user-based} and \textit{item-based} algorithm. In a user-based algorithm, user-user similarities are computed for every pairs of rows in $R$. By contrast, item-based CF stands on column-wise similarities between items. \fig{cf} illustrates how CF works on a user-item matrix $R$. The elements are ratings in a $[1, 5]$ range for each user-item pair, so $1$ and $2$ mean relatively negative feedback and vice versa. In the figure, user $a$ and $c$ seem to have similar tastes because both of them gave nearly identical feedback to item $1$, $4$ and $6$. From an item-item perspective, item $4$ and $6$ are similarly rated by user $a$, $b$ and $c$. +Importantly, $k$-NN can be classified into a \textit{user-based} and \textit{item-based} algorithm. In a user-based algorithm, user-user similarities are computed for every pair of rows in $R$. By contrast, item-based CF stands on column-wise similarities between items. \fig{cf} illustrates how CF works on a user-item matrix $R$. The elements are ratings in a $[1, 5]$ range for each user-item pair, so $1$ and $2$ mean relatively negative feedback and vice versa. In the figure, users $a$ and $c$ seem to have similar tastes because both of them gave nearly identical feedback to the item $1$, $4$, and $6$. From an item-item perspective, items $4$ and $6$ are similarly rated by user $a$, $b$, and $c$. \begin{figure}[htbp] \centering \includegraphics[width=1.0\linewidth]{images/cf.pdf} - \caption{A schematic diagram of the $k$-NN-based recommender systems on a five-level rating matrix. This figure used Figure~1 in \cite{Sarwar2001} as a reference. For an active user $u$, his/her missing elements $r_{u,i}$ are estimated based on either user-user or item-item similarities, and a recommendation list includes highest-scored items.} + \caption{A schematic diagram of the $k$-NN-based recommender systems on a five-level rating matrix. This figure is based on Figure~1 in \cite{Sarwar2001} as a reference. For an active user $u$, his/her missing elements $r_{u,i}$ are estimated based on either user-user or item-item similarities, and a recommendation list contains the highest-scored items.} \label{fig:cf} \end{figure} -In order to measure the similarities between rows (columns), the Pearson correlation and cosine similarity are widely used. For $d$-dimensional vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$, the Pearson correlation $\mathrm{corr}(\mathbf{x}, \mathbf{y})$ and cosine similarity $\mathrm{cos}(\mathbf{x}, \mathbf{y})$ are respectively defined as: +To measure the similarities between rows (columns), the Pearson correlation and cosine similarity are widely used. For $d$-dimensional vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$, the Pearson correlation $\mathrm{corr}(\mathbf{x}, \mathbf{y})$ and cosine similarity $\mathrm{cos}(\mathbf{x}, \mathbf{y})$ are respectively defined as: $$ \mathrm{corr}(\mathbf{x}, \mathbf{y}) = \frac{\sum_i (x_{i} - \overline{x})(y_{i} - \overline{y})}{\sqrt{\sum_i (x_{i} - \overline{x})^2} \sqrt{\sum_i (y_{i} - \overline{y})^2}}, $$ $$ \mathrm{cos}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\| \mathbf{x} \| \| \mathbf{y} \|} = \frac{\sum_i x_{i} y_{i}}{\sqrt{\sum_i x_{i}^2} \sqrt{\sum_i y_{i}^2}}, $$ -where $\overline{x} = \frac{1}{d} \sum^d_{i=1} x_i$ and $\overline{y} = \frac{1}{d} \sum^d_{i=1} y_i$ denote mean values of the elements in a vector. Additionally, in a context of data mining, elements in $\mathbf{x}$ and $\mathbf{y}$ can be distributed on a different scale, so mean-centering of the vectors usually leads better results \cite{Sarwar2001}. Note that cosine similarity between the mean-centered vectors, $\hat{\mathbf{x}} = (x_1 - \overline{x}, x_2 - \overline{x}, \dots, x_n - \overline{x})$ and $\hat{\mathbf{y}} = (y_1 - \overline{y}, y_2 - \overline{y}, \dots, y_n - \overline{y})$, is mathematically equivalent to the Pearson correlation $\mathrm{corr}(\mathbf{x}, \mathbf{y})$, meaning $\mathrm{cos}(\hat{\mathbf{x}}, \hat{\mathbf{y}}) = \mathrm{corr}(\mathbf{x}, \mathbf{y})$, and the following code snippet demonstrates its implementation in the Julia ecosystem. +where $\overline{x} = \frac{1}{d} \sum^d_{i=1} x_i$ and $\overline{y} = \frac{1}{d} \sum^d_{i=1} y_i$ denote mean values of the elements in a vector. Additionally, in the context of data mining, elements in $\mathbf{x}$ and $\mathbf{y}$ can be distributed on a different scale, so mean-centering of the vectors usually leads to better results \cite{Sarwar2001}. Note that cosine similarity between the mean-centered vectors, $\hat{\mathbf{x}} = (x_1 - \overline{x}, x_2 - \overline{x}, \dots, x_n - \overline{x})$ and $\hat{\mathbf{y}} = (y_1 - \overline{y}, y_2 - \overline{y}, \dots, y_n - \overline{y})$, is mathematically equivalent to the Pearson correlation $\mathrm{corr}(\mathbf{x}, \mathbf{y})$, meaning $\mathrm{cos}(\hat{\mathbf{x}}, \hat{\mathbf{y}}) = \mathrm{corr}(\mathbf{x}, \mathbf{y})$, and the following code snippet demonstrates its implementation in the Julia ecosystem. \begin{lstlisting}[language = Julia] import Statistics: mean @@ -74,11 +74,12 @@ \subsubsection{$k$-Nearest Neighbor} function similarity(x::AbstractVector, y::AbstractVector) x_hat, y_hat = x .- mean(x), y .- mean(y) - dot(x_hat, y_hat) / (norm(x_hat) * norm(y_hat)) + dot(x_hat, y_hat) / ( + norm(x_hat) * norm(y_hat)) end \end{lstlisting} -Based on the similarity definition, user-based CF using the Pearson correlation \cite{Herlocker1999} sees $\mathbf{x}$ and $\mathbf{y}$ as two different rows in $R$, respectively, and gives a weight to a user-user pair by the similarity. In the \texttt{fit!()} phase, the weights allow a recommender to (1) select the top-$k$ highest-weighted users (i.e., nearest neighbors) of a target user $u$, and (2) predict missing elements based on a mean value of neighbors' feedback. Ultimately, sorting items by the predicted values enables \texttt{recommend()} to generate a ranked list of recommended items for a user $u$. Simply put, a constructor of user-based CF in \texttt{Recommendation.jl} is as follows. +Based on the similarity definition, user-based CF using the Pearson correlation \cite{Herlocker1999} sees $\mathbf{x}$ and $\mathbf{y}$ as two different rows in $R$, respectively, and gives weight to a user-user pair by the similarity. In the \texttt{fit!()} phase, the weights allow a recommender to (1) select the top-$k$ highest-weighted users (i.e., nearest neighbors) of a target user $u$, and (2) predict missing elements based on a mean value of neighbors' feedback. Ultimately, sorting items by the predicted values enables \texttt{recommend()} to generate a ranked list of recommended items for a user $u$. Simply put, a constructor of user-based CF in \texttt{Recommendation.jl} is as follows. \begin{lstlisting}[language = Julia] UserKNN(data::DataAccessor, n_neighbors::Integer) @@ -91,8 +92,9 @@ \subsubsection{$k$-Nearest Neighbor} \end{lstlisting} \subsubsection{Singular Value Decomposition} +\label{sec:svd} -Along with the development of the CF techniques, researchers noticed that handling the original huge user-item matrices is computationally expensive. Moreover, CF-based recommendation leads overfitting to individual taste due to the sparsity of $R$. Thus, dimensionality reduction techniques were applied to recommendation in order to capture more abstract preferences \cite{Sarwar2000}. +Along with the development of the CF techniques, researchers noticed that handling the original huge user-item matrices is computationally expensive. Moreover, CF-based recommendation leads to overfitting to individual taste due to the sparsity of $R$. Thus, dimensionality reduction techniques were applied to the recommendation to capture more abstract preferences \cite{Sarwar2000}. Singular value decomposition (SVD) is one of the most popular dimensionality reduction techniques that decomposes an $m$-by-$n$ matrix $A$ to $U \in \mathbb{R}^{m \times m}$, $\Sigma \in \mathbb{R}^{m \times n}$ and $V \in \mathbb{R}^{n \times n}$: \begin{align*} @@ -100,9 +102,9 @@ \subsubsection{Singular Value Decomposition} = & \ \left[\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_m\right] \cdot \mathrm{diag}\left(\sigma_1, \sigma_2, \dots, \sigma_{\min(m, n)}\right) \cdot \\ & \ \left[\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_n\right]^{\mathrm{T}}, \end{align*} -by letting $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m, n)} \geq 0$. An orthogonal matrix $U$ ($V$) is called left (right) singular vectors which represents characteristics of columns (rows) in $R$, and a diagonal matrix $\Sigma$ holds singular values on the diagonal elements as weights of each singular vector. +by letting $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m, n)} \geq 0$. An orthogonal matrix $U$ ($V$) is called left (right) singular vectors which represent characteristics of columns (rows) in $R$, and a diagonal matrix $\Sigma$ holds singular values on the diagonal elements as weights of each singular vector. -In practice, the most lower singular values of real-world matrices are very close to zero, and hence using only top-$k$ singular values $\Sigma_k \in \mathbb{R}^{k \times k}$ and corresponding singular vectors $U_k \in \mathbb{R}^{m \times k}$, $V_k \in \mathbb{R}^{n \times k}$ is sufficient to make reasonable rank-$k$ approximation of a matrix $A$ as: $\mathrm{SVD}_k(A) = U_k \Sigma_k V_k^{\mathrm{T}}$. It is mathematically proven that $\mathrm{SVD}_k(A)$ is the best rank-$k$ approximation of the matrix $A$ in both the spectral and Frobenius norm, where the spectral norm of a matrix equals to its largest singular value. +In practice, the most lower singular values of real-world matrices are very close to zero, and hence using only top-$k$ singular values $\Sigma_k \in \mathbb{R}^{k \times k}$ and corresponding singular vectors $U_k \in \mathbb{R}^{m \times k}$, $V_k \in \mathbb{R}^{n \times k}$ is sufficient to make a reasonable rank-$k$ approximation of a matrix $A$ as $\mathrm{SVD}_k(A) = U_k \Sigma_k V_k^{\mathrm{T}}$. It is mathematically proven that $\mathrm{SVD}_k(A)$ is the best rank-$k$ approximation of the matrix $A$ in both the spectral and Frobenius norm, where the spectral norm of a matrix equals its largest singular value. \begin{figure}[htbp] \centering @@ -111,21 +113,21 @@ \subsubsection{Singular Value Decomposition} \label{fig:svd} \end{figure} -Sarwar et~al. \cite{Sarwar2000} studied the use of SVD on user-item matrix $R \in \mathbb{R}^{|\mathcal{U}| \times |\mathcal{I}|}$. In a context of recommendation, $U_k \in \mathbb{R}^{|\mathcal{U}| \times k}$, $V \in \mathbb{R}^{|\mathcal{I}| \times k}$ and $\Sigma \in \mathbb{R}^{k \times k}$ are respectively seen as $k$ user/item feature vectors and corresponding weights. The idea of low-rank approximation that discards lower singular values intuitively works as \textit{compression} or \textit{denoising} of the original matrix; that is, each element in a rank-$k$ matrix $A_k$ holds the best \textit{compressed} (or \textit{denoised}) value of the original element in $A$. Thus, $R_k = \mathrm{SVD}_k(R)$, the best rank-$k$ approximation of $R$, captures as much as possible of underlying users' preferences. Once $R$ is decomposed into $U, \Sigma$ and $V$, a $(u, i)$ element of $R_k$ calculated by $\sum^k_{j=1} \sigma_j u_{u, j} v_{i, j}$ could be a prediction for the user-item pair. In the Julia ecosystem, the process can be implemented in a few lines of code with the standard \texttt{LinearAlgebra} library: +Sarwar et~al. \cite{Sarwar2000} studied the use of SVD on user-item matrix $R \in \mathbb{R}^{|\mathcal{U}| \times |\mathcal{I}|}$. In a context of recommendation, $U_k \in \mathbb{R}^{|\mathcal{U}| \times k}$, $V \in \mathbb{R}^{|\mathcal{I}| \times k}$ and $\Sigma \in \mathbb{R}^{k \times k}$ are respectively seen as $k$ user/item feature vectors and corresponding weights. The idea of low-rank approximation that discards lower singular values intuitively works as \textit{compression} or \textit{denoising} of the original matrix; that is, each element in a rank-$k$ matrix $A_k$ holds the best \textit{compressed} (or \textit{denoised}) value of the original element in $A$. Thus, $R_k = \mathrm{SVD}_k(R)$, the best rank-$k$ approximation of $R$, holds underlying users' preferences the most. Once $R$ is decomposed into $U, \Sigma$ and $V$, a $(u, i)$ element of $R_k$ calculated by $\sum^k_{j=1} \sigma_j u_{u, j} v_{i, j}$ could be a prediction for the user-item pair. In the Julia ecosystem, the process can be implemented in a few lines of code with the standard \texttt{LinearAlgebra} library: \begin{lstlisting}[language = Julia] import LinearAlgebra: svd F = svd(data.R) U, S, Vt = F.U[:, 1:k], F.S[1:k], F.Vt[1:k, :] -# predict a missing value between user and item +# predict a value for an arbitrary user-item pair r_k = dot(U[user, :] .* S, Vt[:, item]) \end{lstlisting} \subsubsection{Matrix Factorization} -Even though dimensionality reduction is a promising approach to make effective recommendation, the feasibility of SVD is still questionable due to the computational cost of decomposition and need for uncertain preliminary work such as missing value imputation and searching an optimal $k$. As a result, a new technique generally called matrix factorization (MF) was introduced \cite{Koren2009} as an alternative. +Even though dimensionality reduction is a promising approach to making an effective recommendation, the feasibility of SVD is still questionable due to the computational cost of decomposition and the need for uncertain preliminary work such as missing value imputation and searching an optimal $k$. As a result, a new technique generally called matrix factorization (MF) was introduced \cite{Koren2009} as an alternative. -The initial MF technique was invented by Funk \cite{Funk2006} during the Netflix Prize \cite{Bennett07thenetflix}, and the method is also known as \textit{regularized SVD} because it can be seen as an extension of the conventional SVD-based recommendation that gives efficient approximation of the original SVD. The basic idea of MF is to factorize a user-item matrix $R$ to a user factored matrix $P \in \mathbb{R}^{|\mathcal{U}| \times k}$ and item factored matrix $Q \in \mathbb{R}^{|\mathcal{I}| \times k}$, by solving the following minimization problem for a set of observed user-item interactions $\mathcal{S} = \{(u, i) \in \mathcal{U} \times \mathcal{I}\}$: +The initial MF technique was invented by Funk \cite{Funk2006} during the Netflix Prize \cite{Bennett07thenetflix}, and the method is also known as \textit{regularized SVD} because it can be seen as an extension of the conventional SVD-based recommendation that gives an efficient approximation of the original SVD. The basic idea of MF is to factorize a user-item matrix $R$ to a user-factored matrix $P \in \mathbb{R}^{|\mathcal{U}| \times k}$ and item factored matrix $Q \in \mathbb{R}^{|\mathcal{I}| \times k}$, by solving the following minimization problem for a set of observed user-item interactions $\mathcal{S} = \{(u, i) \in \mathcal{U} \times \mathcal{I}\}$: $$ \min_{P, Q} \sum_{(u, i) \in \mathcal{S}} \left( r_{u,i} - \mathbf{p}_u^{\mathrm{T}} \mathbf{q}_i \right)^2 + \lambda \ (\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2), $$ @@ -140,29 +142,29 @@ \subsubsection{Matrix Factorization} end \end{lstlisting} -Eventually, $R$ is approximated by $PQ^{\mathrm{T}}$ as shown in \fig{mf}, and a recommender can rank items by the prediction. Notice that mathematically tractable properties of SVD such as orthogonality of factored matrices will be lost over the course of approximation. +Eventually, $R$ is approximated by $PQ^{\mathrm{T}}$ as shown in \fig{mf}, and a recommender can rank items by the prediction. Notice that mathematically tractable properties of SVD such as orthogonality of factored matrices will be lost for approximation. \begin{figure}[htbp] \centering \includegraphics[width=0.8\linewidth]{images/mf.pdf} - \caption{MF for an $m$-by-$n$ rating matrix $R$. Unlike SVD, singular values in $\Sigma$ are considered to be embedded to the factored matrices.} + \caption{MF for an $m$-by-$n$ rating matrix $R$. Unlike SVD, singular values in $\Sigma$ are considered to be embedded in the factored matrices.} \label{fig:mf} \end{figure} -MF is attractive in terms of not only efficiency but extensibility. Since prediction for each user-item pair can be written by a simple vector product as $r_{u,i} = \mathbf{p}_u^{\mathrm{T}} \mathbf{q}_i$, incorporating different features (e.g., biases and temporal factors) into the model as linear combinations is straightforward. For example, let $\mu$ be a global mean of all elements in $R$, and $b_u, b_i$ be respectively a user and item bias term. Here, we assume that each observation can be represented as $r_{u,i} = \mu + b_u + b_i + \mathbf{p}_u^{\mathrm{T}} \mathbf{q}_i$. This formulation is known as biased MF \cite{Koren2009}, and it is possible to capture more information than the original MF even on the same set of events $\mathcal{S}$. It should be noted that advanced methods such as tensor factorization \cite{Karatzoglou2010} would require higher dimensionality and more costly optimization scheme to enrich MF. +MF is attractive in terms of not only efficiency but extensibility. Since prediction for each user-item pair can be written by a simple vector product as $r_{u,i} = \mathbf{p}_u^{\mathrm{T}} \mathbf{q}_i$, incorporating different features (e.g., biases and temporal factors) into the model as linear combinations is straightforward. For example, let $\mu$ be a global mean of all elements in $R$, and $b_u, b_i$ be respectively a user and item bias term. Here, we assume that each observation can be represented as $r_{u,i} = \mu + b_u + b_i + \mathbf{p}_u^{\mathrm{T}} \mathbf{q}_i$. This formulation is known as biased MF \cite{Koren2009}, and it is possible to capture more information than the original MF even on the same set of events $\mathcal{S}$. There are also other advanced methods such as tensor factorization \cite{Karatzoglou2010} that require higher dimensionality and a more costly optimization scheme to enrich MF. -Meanwhile, there are different options for loss functions to optimize MF. To give an example, Chen et~al. \cite{Chen2011} showed various types of features and loss functions which can be incorporated into a MF scheme. An appropriate choice of their combinations is likely to lead surprisingly better accuracy compared to the classical MF, and \texttt{Recommendation.jl} currently supports Bayesian personalized ranking (BPR) loss \cite{10.5555/1795114.1795167} as an alternative option via \texttt{BPRMatrixFactorization <: Recommender}. +Meanwhile, there are different options for loss functions to optimize MF. To give an example, Chen et~al. \cite{Chen2011} showed various types of features and loss functions which can be incorporated into an MF scheme. An appropriate choice of their combinations is likely to lead to surprisingly better accuracy compared to the classical MF, and \texttt{Recommendation.jl} currently supports Bayesian personalized ranking (BPR) loss \cite{10.5555/1795114.1795167} as an alternative option via \texttt{BPRMatrixFactorization <: Recommender}. \subsection{Factorization Machines} -Beyond numerous discussions about MF, factorization machines (FMs) have been recently developed as its generalized model. In contrast to MF, FMs are formulated by a equation that is similar to the polynomial regression, and the model can be applied all of regression, classification and ranking problems depending on a choice of loss function with or without SGD-based optimization. +Beyond numerous discussions about MF, factorization machines (FMs) have been recently developed as their generalized model. In contrast to MF, FMs are formulated by an equation that is similar to polynomial regression, and the model can be applied to all regression, classification, and ranking problems depending on a choice of the loss function with or without SGD-based optimization. -First of all, for an input vector $\mathbf{x} \in \mathbb{R}^d$, let us imagine the following second-order polynomial model parameterized by $w_0 \in \mathbb{R}$, $\mathbf{w} \in \mathbb{R}^d$ as: $\hat{y}(\mathbf{x}) := w_0 + \mathbf{w}^{\mathrm{T}} \mathbf{x} + \sum_{i=1}^d \sum_{j=i}^d w_{i,j} x_i x_j,$ where $w_{i,j}$ is an element in a symmetric matrix $W \in \mathbb{R}^{d \times d}$, and it indicates a weight of $x_i x_j$, an interaction between the $i$-th and $j$-th element in $\mathbf{x}$. Here, FMs assume that $W$ can be approximated by a low-rank matrix $V \in \mathbb{R}^{d \times k}$ for $k < d$, and the weights are replaced with inner products of $k$ dimensional vectors as $w_{i, j} \approx \mathbf{v}_i^{\mathrm{T}} \mathbf{v}_j$ for $\mathbf{v}_1, \cdots, \mathbf{v}_d \in \mathbb{R}^k$. As a result, the formulation of FM model is: +First of all, for an input vector $\mathbf{x} \in \mathbb{R}^d$, let us imagine the following second-order polynomial model parameterized by $w_0 \in \mathbb{R}$, $\mathbf{w} \in \mathbb{R}^d$ as: $\hat{y}(\mathbf{x}) := w_0 + \mathbf{w}^{\mathrm{T}} \mathbf{x} + \sum_{i=1}^d \sum_{j=i}^d w_{i,j} x_i x_j,$ where $w_{i,j}$ is an element in a symmetric matrix $W \in \mathbb{R}^{d \times d}$, and it indicates a weight of $x_i x_j$, an interaction between the $i$-th and $j$-th element in $\mathbf{x}$. Here, FMs assume that $W$ can be approximated by a low-rank matrix $V \in \mathbb{R}^{d \times k}$ for $k < d$, and the weights are replaced with inner products of $k$ dimensional vectors as $w_{i, j} \approx \mathbf{v}_i^{\mathrm{T}} \mathbf{v}_j$ for $\mathbf{v}_1, \cdots, \mathbf{v}_d \in \mathbb{R}^k$. As a result, the formulation of the FM model is: \begin{equation} \hat{y}^{\mathrm{FM}}(\mathbf{x}) := \underbrace{w_0}_{\textbf{global bias}} + \underbrace{\mathbf{w}^{\mathrm{T}} \mathbf{x}_{ }}_{\textbf{linear}} + \sum_{i=1}^d \sum_{j=i}^d \underbrace{\mathbf{v}_i^{\mathrm{T}} \mathbf{v}_j}_{\textbf{interaction}} x_i x_j. \label{eq:FMs} \end{equation} -Several studies \cite{Geuens2015,Rendle2012-1,Rendle2012-3} prove that the flexibility of feature representations $\mathbf{x}$ is one of the most important characteristics that makes FMs versatile. The code snippet below demonstrates how an input vector is created with \texttt{Recommendation.jl}'s utility function \texttt{onehot()}. +Several studies \cite{Geuens2015,Rendle2012-1,Rendle2012-3} prove that the flexibility of feature representations $\mathbf{x}$ is one of the most important characteristics that makes FMs versatile. The code snippet below demonstrates how a concatenated input vector is created with \texttt{Recommendation.jl}'s utility function \texttt{onehot()}. \begin{lstlisting}[language = Julia] x = vcat( @@ -170,8 +172,8 @@ \subsection{Factorization Machines} onehot(3, collect(1:n_items)), # item ID 2.5, # rating # ... - onehot("Male", # gender - ["Male", "Female", "Others", missing]), + onehot("Weekly", # email preference + ["Daily", "Weekly", "Monthly", missing]), onehot(2, collect(1:7)) # day of week ) \end{lstlisting} @@ -180,7 +182,7 @@ \subsection{Factorization Machines} \begin{align*} \hat{y}^{\mathrm{FM}^{(p)}}(\mathbf{x}) &:= w_0 + \mathbf{w}^{\mathrm{T}} \mathbf{x} \\ &+ \sum^p_{\ell=2} \sum^d_{j_1 = 1} \cdots \sum^d_{j_p = j_{p-1} + 1} \left( \prod^{\ell}_{i=1} x_{j_i} \right) \sum^{k_{\ell}}_{f=1} \prod^{\ell}_{i=1} v_{j_i,f}, \end{align*} -with the model parameters $w_0 \in \mathbb{R}, \ \mathbf{w} \in \mathbb{R}^d, \ V_{\ell} \in \mathbb{R}^{d \times k_{\ell}},$ where $\ell \in \{2, \cdots, p\}$. Although the higher-order FMs are attractive to capture more complex underlying concepts from dynamic data, the computational cost should become more expensive accordingly. In favor of balancing the algorithmic sophistication and its efficiency, \texttt{Recommendation.jl} only considers the second-order model trained by SGD for the time being. +with the model parameters $w_0 \in \mathbb{R}, \ \mathbf{w} \in \mathbb{R}^d, \ V_{\ell} \in \mathbb{R}^{d \times k_{\ell}},$ where $\ell \in \{2, \cdots, p\}$. Although the higher-order FMs are attractive to capturing more complex underlying concepts from dynamic data, the computational cost should become more expensive accordingly. In favor of balancing the algorithmic sophistication and its efficiency, \texttt{Recommendation.jl} only considers the second-order model trained by SGD for the time being. \begin{lstlisting}[language = Julia] struct FactorizationMachines <: Recommender @@ -195,11 +197,11 @@ \subsection{Factorization Machines} \subsection{Content-Based Filtering} -All techniques introduced so far rely on users' historical behavior on a service, but these kinds of recommenders easily face a challenge so-called \textit{cold-start} when it comes to recommending new items (for new users) that do not have sufficient amount of historical data to capture meaningful information. In order to work around the difficulty, content-based recommender systems \cite{Lops2011} are likely to be preferred in reality. +All techniques introduced so far rely on users' historical behavior on a service, but these kinds of recommenders easily face a challenge so-called \textit{cold-start} when it comes to recommending new items (for new users) that do not have a sufficient amount of historical data to capture meaningful information. To work around the difficulty, content-based recommender systems \cite{Lops2011} are likely to be preferred in reality. -Most importantly, content-based recommenders make recommendation without using the other users' feedbacks. In particular, a content-based approach gives scores to items based on two kinds of information: item model and (static) user preference. In order to model the items, an item-attribute matrix is defined as: $I \in \mathbb{R}^{|\mathcal{I}| \times |\mathcal{A}|}$, where $\mathcal{A}$ is a set of item attributes. Meanwhile, user attributes can be captured through \texttt{DataAccessor}'s \texttt{user\_attributes} property, which is independent from what kind of \texttt{Event}s a system has observed. +Most importantly, content-based recommenders make a recommendation without using the other users' feedback. In particular, a content-based approach gives scores to items based on two kinds of information: item model and (static) user preference. To model the items, an item-attribute matrix is defined as $I \in \mathbb{R}^{|\mathcal{I}| \times |\mathcal{A}|}$, where $\mathcal{A}$ is a set of item attributes. Meanwhile, user attributes can be captured through \texttt{DataAccessor}'s \texttt{user\_attributes} property, which is independent of what kind of \texttt{Event}s a system has observed. -From a practical perspective, choosing a set of attributes $\mathcal{A}$ is an essential problem to launch a content-based recommender successfully. In fact, there tend to be numerous candidates on a real-world dataset such as item category and brand, but using too much attributes may increase sparsity and complexity of the vectors, which ends up with poor recommendation performance. With that in mind, one of the most well-studied types of attribute \texttt{Recommendation.jl} also supports is ``term''. More concretely, each item is represented by a set of words, and the items are modeled by TF-IDF weighting \cite{Manning2008}. For instance, if we like to recommend web pages to users, we first need to parse sentences on a page and then construct a vector based on the frequency of each term as: +From a practical perspective, choosing a set of attributes $\mathcal{A}$ is an essential problem to launch a content-based recommender successfully. In fact, there tend to be numerous candidates on a real-world dataset such as item category and brand, but using too many attributes may increase the sparsity and complexity of the vectors, which ends up with poor recommendation performance. With that in mind, one of the most well-studied types of attribute \texttt{Recommendation.jl} also supports is ``term''. More concretely, each item is represented by a set of words, and the items are modeled by TF-IDF weighting \cite{Manning2008}. For instance, if we like to recommend web pages to users, we first need to parse sentences on a page and then construct a vector based on the frequency of each term as: \begin{equation*} I= @@ -216,9 +218,9 @@ \subsection{Content-Based Filtering} \end{blockarray} \end{equation*} -In case of our item-word matrices, for a given item $i$, term frequency (TF) for a term $t$ is defined as: $\mathrm{tf}(t, i) = \frac{n_{t,i}}{N_i},$ where $n_{t,i}$ denotes an $(i, t)$ element in $I$, and $N_i$ is the total number of words that an item $i$ contains. Meanwhile, inverse document frequency (IDF) is computed over $M$ items as: $\mathrm{idf}(t) = \log \frac{M}{\mathrm{df}(t)} + 1,$ where $\mathrm{df}(t)$ counts the number of items which associate with a term $t$. Finally, each item-term pair is weighted by: $\mathrm{tf}(t, i) \cdot \mathrm{idf}(t)$ in the TF-IDF scheme. +In the case of our item-word matrices, for a given item $i$, term frequency (TF) for a term $t$ is defined as $\mathrm{tf}(t, i) = \frac{n_{t,i}}{N_i},$ where $n_{t,i}$ denotes an $(i, t)$ element in $I$, and $N_i$ is the total number of words that an item $i$ contains. Meanwhile, inverse document frequency (IDF) is computed over $M$ items as $\mathrm{idf}(t) = \log \frac{M}{\mathrm{df}(t)} + 1,$ where $\mathrm{df}(t)$ counts the number of items which associate with a term $t$. Finally, each item-term pair is weighted by: $\mathrm{tf}(t, i) \cdot \mathrm{idf}(t)$ in the TF-IDF scheme. -Since there are several variations of how to calculate $\mathrm{tf}(t, i)$ and $\mathrm{idf}(t)$, \texttt{Recommendation.jl} requires users to pre-compute these numbers in order to maximize the feasibility of the recommender: +Since there are several variations of how to calculate $\mathrm{tf}(t, i)$ and $\mathrm{idf}(t)$, \texttt{Recommendation.jl} requires users to pre-compute these numbers to maximize the feasibility of the recommender: \begin{lstlisting}[language = Julia] struct TFIDF <: Recommender @@ -228,4 +230,4 @@ \subsection{Content-Based Filtering} end \end{lstlisting} -% If the features were chosen appropriately, content-based recommenders could work well even on challenging settings which cannot be handled by the conventional recommenders. To give an example, when a new item is added to a system, making reasonable prediction for the item is impossible by using the classical approaches such as CF. By contrast, since content-based recommenders only require the attributes of items, new items can show up in a recommendation list with equal chance to the old items. Furthermore, explaining the results of content-based recommendation is possible because the attributes are manually selected by humans. +% If the features were chosen appropriately, content-based recommenders could work well even in challenging settings that cannot be handled by conventional recommenders. To give an example, when a new item is added to a system, making a reasonable prediction for the item is impossible by using classical approaches such as CF. By contrast, since content-based recommenders only require the attributes of items, new items can show up in a recommendation list with an equal chance to the old items. Furthermore, explaining the results of content-based recommendations is possible because the attributes are manually selected by humans. diff --git a/paper/section/conclusion.tex b/paper/section/conclusion.tex new file mode 100644 index 0000000..44e0750 --- /dev/null +++ b/paper/section/conclusion.tex @@ -0,0 +1,5 @@ +This paper introduced \texttt{Recommendation.jl}, an open-source package for building recommender systems in the Julia programming language. First, by reviewing each of the core features of practical recommender pipelines, data model (\sect{data}), recommender interface and algorithms (\sect{algorithm}), and evaluation methods (\sect{evaluation}), we observed how diverse recommender's interests can be; the applications must be able to address both explicit and implicit representation of user feedback, hybridize rule-based and machine learning-based algorithms, and assess the outcomes from wide-ranging perspectives in terms of not only accuracy but diversity, coverage, novelty, and serendipity. Thus, Julia's extensible and mathematical operation-friendly APIs come in handy for working with the unique characteristics we demonstrated by their formulation and corresponding code snippet throughout the paper. + +Moreover, we conducted a benchmark with multiple recommender-metric pairs provided by \texttt{Recommendation.jl} and confirmed there are no one-size-fits-all approaches to making ``good'' recommendations. On the one hand, we can maximize prediction accuracy by training a sophisticated model-based recommender with an optimal set of hyperparameters. However, at the same time, the best prediction accuracy does not always yield the most diverse recommendation, which might eventually hinder recommenders from acknowledging fairness implications. The observations tell us that one of the most important requirements for recommender frameworks is to make a wide variety of options available for developers while leaving enough space for customization, which \texttt{Recommendation.jl} has tried to incorporate by design. + +Finally, there are numerous possible directions to improve the package as we learned from the other open-source solutions in \sect{introduction}. For instance, the availability of state-of-the-art recommendation algorithms makes a framework more promising in a competitive environment in the industry, where Python-based machine learning packages play a dominant role. Meanwhile, since computational efficiency is a key criterion that directly leads to a developer's productivity, the use of acceleration techniques such as distributed multiprocessing and GPU programming would be a mandatory step to undergo. Last but not least, easing to run an end-to-end recommendation pipeline iteratively is a foundational challenge so we can bridge a gap between an offline and online setup. In particular, evaluation phases pose a crucial challenge in reproducibility as mentioned in \sect{evaluation}. diff --git a/paper/section/data.tex b/paper/section/data.tex index 9caf4e3..467dd17 100644 --- a/paper/section/data.tex +++ b/paper/section/data.tex @@ -1,6 +1,6 @@ As depicted in \fig{recommender}, a common first step of building a recommender is to capture user-item events and translate them into matrix representation. Here, \texttt{Recommendation.jl} eases the step by providing a unified wrapper called \texttt{DataAccessor}. Since data for recommender systems is easily standardizable as a collection of a user, item, and auxiliary attributes, the common interface helps developers to follow the separation-of-concerns principle and ensure the easiness and reliability of data manipulation. -To be more precise, raw data is always converted into a \texttt{DataAccessor} instance at the data preprocessing phase with proper validation (e.g., data type check, missing value handling), and hence the subsequent steps can simply take the instance and access the data (or metadata) without worrying about unexpected input. \fig{accessor} illustrates the procedure. +To be more precise, raw data is always converted into a \texttt{DataAccessor} instance at the data preprocessing phase with proper validation (e.g., data type check, missing value handling), and hence the subsequent steps can simply take the instance, and access the data (or metadata) without worrying about unexpected input. \fig{accessor} illustrates the procedure. \begin{figure}[htbp] \centering @@ -57,4 +57,4 @@ attribute::AbstractVector) \end{lstlisting} -Additionally, the package provides data loaders that import publicly available datasets such as MovieLens \cite{harper2015movielens}, Amazon Reviews \cite{ni2019justifying}, HetRec 2011 Last.FM\footnote{\url{https://www.last.fm/}} dataset \cite{Cantador:RecSys2011}, as well as a synthetic implicit feedback generator using a simple rule-based method demonstrated in \cite{Aharon2013}. These modules return a ready-to-use \texttt{DataAccessor} instance for easing experiments. +Additionally, the package provides data loaders that import publicly available datasets such as MovieLens \cite{harper2015movielens}, Amazon Reviews \cite{ni2019justifying}, and HetRec 2011 Last.FM\footnote{\url{https://www.last.fm/}} dataset \cite{Cantador:RecSys2011}, as well as a synthetic implicit feedback generator using a simple rule-based method demonstrated in \cite{Aharon2013}. These modules return a ready-to-use \texttt{DataAccessor} instance for easing experiments. diff --git a/paper/section/evaluation.tex b/paper/section/evaluation.tex index c5fd3bf..8b63064 100644 --- a/paper/section/evaluation.tex +++ b/paper/section/evaluation.tex @@ -1,25 +1,29 @@ -One of the notable characteristics of \texttt{Recommendation.jl} is a diverse set of evaluation metrics, including not only the standard accuracy metrics but fairness metrics such as diversity and serendipity. Even though the idea of diverse or serendipitous recommendation is not new in the literature, the topic has rapidly gained traction in these days as the society realizes the importance of fairness in intelligent systems. This section highlights the high-level concept of these metrics and their implementation in Julia based on a common abstract type, \texttt{Matric}. +One of the notable characteristics of \texttt{Recommendation.jl} is a diverse set of evaluation metrics, including not only the standard accuracy metrics but fairness metrics such as diversity and serendipity. Even though the idea of diverse or serendipitous recommendations is not new in the literature, the topic has rapidly gained traction these days as society realizes the importance of ethical implications in intelligent systems \cite{milano2020recommender}. This section highlights the high-level concept of these metrics and their implementation in Julia based on a common abstract type, \texttt{Matric}. \begin{lstlisting}[language = Julia] abstract type Metric end \end{lstlisting} -For accuracy metrics, users can use the standard evaluation scheme, \texttt{cross\_validation} and \texttt{leave\_one\_out}, provided by the package. For instance, the following module runs \texttt{n\_folds} cross validation for a specific combination of recommender and ranking metric. Notice that a recommender is initialized with \texttt{recommender\_args} and runs top-k recommendation. +For accuracy metrics, users can use the standard evaluation scheme, \texttt{cross\_validation} and \texttt{leave\_one\_out}, provided by the package. For instance, the following module runs \texttt{n\_folds} cross-validation for a specific combination of recommender and ranking metric. Notice that a recommender is initialized with \texttt{recommender\_args} for making a top-k recommendation. \begin{lstlisting}[language = Julia] cross_validation( - n_folds::Integer, - metric::Type{<:RankingMetric}, - topk::Integer, - recommender_type::Type{<:Recommender}, - data::DataAccessor, - recommender_args... + n_folds::Integer, + metric::Metric, + topk::Integer, + recommender_type::Type{<:Recommender}, + data::DataAccessor, + recommender_args...; + # control whether recommending the same item to + # the same user multiple times is allowed + allow_repeat=false ) \end{lstlisting} -It should be noted that evaluating recommender systems is not always the same as measuring the accuracy of machine learning-based prediction, and there is a separate research domain discussing about what an appropriate evaluation method is. In the open-source community, the Python-based \texttt{RecPack} package \cite{michiels2022recpack} takes this point into consideration and provides a dedicated layer called \texttt{Scenario}, which can be a future direction \texttt{Recommendation.jl} possibly aims for. +It should be noted that evaluating recommender systems is not always the same as measuring the accuracy of machine learning-based prediction, and there is a separate research domain discussing what an appropriate evaluation method is. In the open-source community, the Python-based \texttt{RecPack} package \cite{michiels2022recpack} considers this point and provides a dedicated layer called \texttt{Scenario}, which can be a future direction \texttt{Recommendation.jl} possibly aims for. \subsection{Rating Metrics} +\label{sec:rating-metrics} First and foremost, even though the community focuses more on implicit feedback-based ranking problems lately, rating prediction is still an important foundation in the field of recommender systems as the previous sections mentioned. @@ -48,16 +52,16 @@ \subsection{Ranking Metrics} end \end{lstlisting} -Although the interface is the same across the metrics, each of them has a different objective as part of its formulation. To review the differences with some intuition, let a target user $u \in \mathcal{U}$, set of all items $\mathcal{I}$, ordered set of top-$N$ recommended items $I_N(u) \subset \mathcal{I}$, and set of truth items $\mathcal{I}^+_u$. +Although the interface is the same across the metrics, each of them has a different objective as part of its formulation. To review the differences with some intuition, let a target user $u \in \mathcal{U}$, set of all items $\mathcal{I}$, ordered set of top-$k$ recommended items $I_k(u) \subset \mathcal{I}$, and set of truth items $\mathcal{I}^+_u$. -\subsubsection{Recall-at-$N$} +\subsubsection{Recall-at-$k$} -Recall-at-$N$ (Recall@$N$) indicates coverage of truth samples as a result of top-$N$ recommendation. The value is computed by the following equation: +Recall-at-$k$ (Recall@$k$) indicates coverage of truth samples as a result of top-$k$ recommendation. The value is computed by the following equation: $$ -\mathrm{Recall@}N = \frac{|\mathcal{I}^+_u \cap I_N(u)|}{|\mathcal{I}^+_u|}. +\mathrm{Recall@}k = \frac{|\mathcal{I}^+_u \cap I_k(u)|}{|\mathcal{I}^+_u|}. $$ -Here, $|\mathcal{I}^+_u \cap I_N(u)|$ is the number of \textit{true positives} which can be simply computed by the following piece of code: +Here, $|\mathcal{I}^+_u \cap I_k(u)|$ is the number of \textit{true positives} which can be simply computed by the following piece of code: \begin{lstlisting}[language = Julia] function count_intersect( @@ -67,27 +71,27 @@ \subsubsection{Recall-at-$N$} end \end{lstlisting} -\subsubsection{Precision-at-$N$} +\subsubsection{Precision-at-$k$} -Unlike Recall@$N$, Precision-at-$N$ (Precision@$N$) evaluates correctness of a top-$N$ recommendation list $I_N(u)$ according to the portion of true positives in the list as: +Unlike Recall@$N$, Precision-at-$k$ (Precision@$k$) evaluates the correctness of a top-$k$ recommendation list $I_k(u)$ according to the portion of true positives in the list as: $$ -\mathrm{Precision@}N = \frac{|\mathcal{I}^+_u \cap \mathcal{I}_N(u)|}{|\mathcal{I}_N(u)|}. +\mathrm{Precision@}k = \frac{|\mathcal{I}^+_u \cap \mathcal{I}_k(u)|}{|\mathcal{I}_k(u)|}. $$ -In other words, Precision@$N$ means how much the recommendation list covers true pairs. +In other words, Precision@$k$ measures how much the recommendation list covers true pairs. \subsubsection{Mean Average Precision (MAP)} -While the original Precision@$N$ provides a score for a fixed-length recommendation list $I_N(u)$, mean average precision (MAP) computes an average of the scores over all recommendation sizes from 1 to $|\mathcal{I}|$. MAP is formulated with an indicator function for $i_n$, the $n$-th item of $I(u)$, as: +While the original Precision@$k$ provides a score for a fixed-length recommendation list $I_k(u)$, mean average precision (MAP) computes an average of the scores against all possible recommendation sizes from 1 to $|\mathcal{I}|$. MAP is formulated with an indicator function for $i_n$, the $n$-th item of $I(u)$, as: \begin{equation*} \mathrm{MAP} = \frac{1}{|\mathcal{I}^+_u|} \sum_{n = 1}^{|\mathcal{I}|} \mathrm{Precision@}n \cdot \mathds{1}_{\mathcal{I}^+_u}(i_n). \end{equation*} -It should be noticed that, MAP is not a simple mean of sum of Precision@$1$, Precision@$2$, $\dots$, Precision@$|\mathcal{I}|$, and higher-ranked true positives lead better MAP. +It should be noticed that MAP is not a simple mean of the sum of Precision@$1$, Precision@$2$, $\dots$, Precision@$|\mathcal{I}|$, and higher-ranked true positives lead better MAP. \subsubsection{Area under the ROC Curve (AUC)} -ROC curve and area under the ROC curve (AUC) are generally used in evaluation of the classification problems, but these concepts can also be interpreted in a context of ranking problem. Basically, the AUC metric for ranking considers all possible pairs of truth and other items which are respectively denoted by $i^+ \in \mathcal{I}^+_u$ and $i^- \in \mathcal{I}^-_u$, and it expects that the ``best'' recommender completely ranks $i^+$ higher than $i^-$. +ROC curve and area under the ROC curve (AUC) are generally used in the evaluation of classification problems, but these concepts can also be interpreted in the context of the ranking problem. The AUC metric for ranking considers all possible pairs of truth and other items which are respectively denoted by $i^+ \in \mathcal{I}^+_u$ and $i^- \in \mathcal{I}^-_u$, and it expects that the ``best'' recommender completely ranks $i^+$ higher than $i^-$. -AUC calculation keeps track the number of true positives at different rank in $\mathcal{I}$. In the implementation of \texttt{measure()}, the code adds the number of true positives which were ranked higher than the current non-truth sample to the accumulated count of correct pairs. Ultimately, an AUC score is computed as portion of the correct ordered $(i^+, i^-)$ pairs in the all possible combinations determined by $|\mathcal{I}^+_u| \times |\mathcal{I}^-_u|$ in set notation. +AUC calculation keeps tracking the number of true positives at different ranks in $\mathcal{I}$. In the implementation of \texttt{measure()}, the code adds the number of true positives which were ranked higher than the current non-truth sample to the accumulated count of correct pairs. Ultimately, an AUC score is computed as a portion of the correct ordered $(i^+, i^-)$ pairs in all possible combinations determined by $|\mathcal{I}^+_u| \times |\mathcal{I}^-_u|$ in set notation. \subsubsection{Reciprocal Rank (RR)} @@ -98,19 +102,20 @@ \subsubsection{Reciprocal Rank (RR)} RR can be zero if and only if $\mathcal{I}^+_u$ is empty. \subsubsection{Mean Percentile Rank (MPR)} -Mean percentile rank (MPR) is a ranking metric based on $r_{i} \in [0, 100]$, the percentile-ranking of an item $i$ within the sorted list of all items for a user $u$. It can be formulated as: +Mean percentile rank (MPR) is a ranking metric based on $r_{i} \in [0, 100]$, the percentile ranking of an item $i$ within the sorted list of all items for a user $u$. It can be formulated as: \begin{equation*} \mathrm{MPR} = \frac{1}{|\mathcal{I}^+_u|} \sum_{i \in \mathcal{I}^+_u} r_{i}. \end{equation*} -$r_{i} = 0\%$ is the best value that means the truth item $i$ is ranked at the highest position in a recommendation list. On the other hand, $r_{i} = 100\%$ is the worst case that the item $i$ is at the lowest rank. +$r_{i} = 0\%$ is the best value which means the truth item $i$ is ranked at the highest position in a recommendation list. On the other hand, $r_{i} = 100\%$ is the worst case that the item $i$ is at the lowest rank. -MPR internally considers not only top-$N$ recommended items also all of the non-recommended items, and it accumulates the percentile ranks for all true positives unlike MRR. So, the measure is suitable to estimate users' overall satisfaction for a recommender. Intuitively, $\mathrm{MPR} > 50\%$ should be worse than random ranking from a users' point of view. +MPR internally considers not only top-$k$ recommended items but also all of the non-recommended items, and it accumulates the percentile ranks for all true positives, unlike MRR. So, the measure is suitable to estimate users' overall satisfaction with a recommender. Intuitively, $\mathrm{MPR} > 50\%$ should be worse than random ranking from a user's point of view. \subsubsection{Normalized Discounted Cumulative Gain (NDCG)} -Like MPR, normalized discounted cumulative gain (NDCG) computes a score for $I(u)$ which places emphasis on higher-ranked true positives. In addition to being a more well-formulated measure, the difference between NDCG and MPR is that NDCG allows us to specify an expected ranking within $\mathcal{I}^+_u$; that is, the metric can incorporate $\mathrm{rel}_n$, a relevance score which suggests how likely the $n$-th sample is to be ranked at the top of a recommendation list, and it directly corresponds to an expected ranking of the truth samples. +Like MPR, normalized discounted cumulative gain (NDCG) computes a score for $I(u)$ which emphasizes higher-ranked true positives. In addition to being a more well-formulated measure, the difference between NDCG and MPR is that NDCG allows us to specify an expected ranking within $\mathcal{I}^+_u$; that is, the metric can incorporate $\mathrm{rel}_n$, a relevance score which suggests how likely the $n$-th sample is to be ranked at the top of a recommendation list, and it directly corresponds to an expected ranking of the truth samples. \subsection{Aggregated Metrics} +\label{sec:aggregated-metrics} Aggregated metrics return a single score for an array of multiple top-$k$ recommendation lists as the following function signature illustrates. @@ -118,45 +123,46 @@ \subsection{Aggregated Metrics} abstract type AggregatedMetric <: Metric end function measure( metric::AggregatedMetric, - recommendations::AbstractVector{ - <:AbstractVector{<:Integer}}; - kwargs...) + recommendations:: + AbstractVector{<:AbstractVector{<:Integer}}; + topk::Union{Integer, Nothing}) end \end{lstlisting} -A comprehensive summary of these metrics are available in \cite{shani2011evaluating}, and Eq.~(20) and (21) on its page 26 provide the formulation of two metrics that are available in \texttt{Recommendation.jl} supports, Gini index and Shannon Entropy. Unlike calculating errors for every truth-prediction pair as we have seen in the previous sections, aggregating multiple recommendation lists gives a bird's eye view of how good a recommender system is as a whole. Thus, the metrics are useful to measure the global diversity of recommender's outputs. +A comprehensive summary of these metrics is available in \cite{shani2011evaluating}, and Equation~(20) and (21) on page 26 provide the formulation of two metrics that are available in \texttt{Recommendation.jl}, the Gini index and Shannon Entropy. Unlike calculating errors for every truth-prediction pair as we have seen in the previous sections, aggregating multiple recommendation lists gives a bird's eye view of how good a recommender system is as a whole. Thus, the metrics are useful to measure the global diversity of the recommender's outputs. \subsubsection{Aggregated Diversity} -\texttt{AggregatedDiversity} calculates the number of distinct items recommended across all suers. A larger value indicates more diverse recommendation result overall. +\texttt{AggregatedDiversity} calculates the number of distinct items recommended across all users. A larger value indicates a more diverse recommendation result overall. -Let $\mathcal{U}$ and $\mathcal{I}$ be a set of users and items, respectively, and $L_N(u)$ a list of top-$N$ recommended items for a user $u$. Here, an aggregated diversity can be calculated as: +Let $\mathcal{U}$ and $\mathcal{I}$ be a set of users and items, respectively, and $L_k(u)$ a list of top-$k$ recommended items for a user $u$. Here, an aggregated diversity can be calculated as: \begin{equation*} -\left| \bigcup\limits_{u \in \mathcal{U}} L_N(u) \right|. +\left| \bigcup\limits_{u \in \mathcal{U}} L_k(u) \right|. \end{equation*} Not to mention the equation is translated to a simple set operation in Julia. \subsubsection{Shannon Entropy} -If we focus more on individual items and how many users are recommended a particular item, the diversity of top-$N$ recommender can be defined by Shannon Entropy (\texttt{ShannonEntropy}): +If we focus more on individual items and how many users are recommended a particular item, the diversity of top-$k$ recommender can be defined by Shannon Entropy (\texttt{ShannonEntropy}): \begin{align*} --\sum_{j = 1}^{|\mathcal{I}|} \Bigg( & \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_N(u) \}\right|}{N |\mathcal{U}|} \cdot \\ -& \ln \left( \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_N(u) \}\right|}{N |\mathcal{U}|} \right) \Bigg), +-\sum_{j = 1}^{|\mathcal{I}|} \Bigg( & \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_k(u) \}\right|}{k |\mathcal{U}|} \cdot \\ +& \ln \left( \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_k(u) \}\right|}{k |\mathcal{U}|} \right) \Bigg), \end{align*} -where $i_j$ denotes $j$-th item in the available item set $\mathcal{I}$. +where $i_j$ denotes $j$-th item in the available item set $\mathcal{I}$. The ``worst'' entropy is zero when a single item is always recommended. \subsubsection{Gini Index} -Gini Index, which is normally used to measure a degree of inequality in a distribution of income, can be applied to assess diversity in the context of top-$N$ recommendation: +The Gini Index, which is normally used to measure a degree of inequality in the distribution of income, can also be applied to assess diversity in the context of top-$k$ recommendation: \begin{equation*} -\frac{1}{|\mathcal{I}| - 1} \sum_{j = 1}^{|\mathcal{I}|} \left( (2j - |\mathcal{I}| - 1) \cdot \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_N(u) \}\right|}{N |\mathcal{U}|} \right). +\frac{1}{|\mathcal{I}| - 1} \sum_{j = 1}^{|\mathcal{I}|} \left( (2j - |\mathcal{I}| - 1) \cdot \frac{\left|\{u \mid u \in \mathcal{U} \wedge i_j \in L_k(u) \}\right|}{k |\mathcal{U}|} \right). \end{equation*} -\texttt{measure(metric::GiniIndex, recommendations, topk)} is 0 when all items are equally chosen in terms of the number of recommended users. +\texttt{measure(metric::GiniIndex, recommendations, topk)} returns 0 when all items are equally chosen (``best''), and 1 when a single item is always chosen. \subsection{Intra-List Metrics} +\label{sec:intra-list-metrics} -Given a list of recommended items (for a single user), intra-list metrics quantifies the quality of the recommendation list from a non-accuracy perspective. Kotkov et~al. \cite{kotkov2016survey} highlighted the foundation of these metrics, and \texttt{Recommendation.jl} implements four of them: \texttt{Coverage}, \texttt{Novelty}, \texttt{IntraListSimilarity}, and \texttt{Serendipity} under the following schema. +Given a list of recommended items (for a single user), intra-list metrics quantify the quality of the recommendation list from a non-accuracy perspective. Kotkov et~al. \cite{kotkov2016survey} highlighted the foundation of these metrics, and \texttt{Recommendation.jl} implements four of them: \texttt{Coverage}, \texttt{Novelty}, \texttt{IntraListSimilarity}, and \texttt{Serendipity} under the following schema. \begin{lstlisting}[language = Julia] abstract type IntraListMetric <: Metric end @@ -168,7 +174,7 @@ \subsection{Intra-List Metrics} end \end{lstlisting} -Notice that standardizing an interface for the quality measures is not straightforward because the definition of ``quality'' is ambiguous. Hence, a list of \texttt{recommendations} can be given either as a set or array (vector) depending on whether the uniqueness of items in the list matters, for example. Meanwhile, \texttt{kwargs...} differ a lot depending on a choice of metric. +Notice that standardizing an interface for the quality measures is not straightforward because the definition of ``quality'' is ambiguous. Hence, a list of \texttt{recommendations} can be given either as a set or array (vector) depending on whether the uniqueness of items in the list matters, for example. Meanwhile, \texttt{kwargs...} differ depending on a choice of metric. \subsubsection{Coverage} @@ -182,7 +188,7 @@ \subsubsection{Coverage} ) \end{lstlisting} -The set operation could leverage \texttt{count\_intersect()} \sect{ranking-metrics} highlighted. +A larger coverage can indicate a recommender is unlikely biased toward a limited set of items. The set operation could leverage \texttt{count\_intersect()} \sect{ranking-metrics} highlighted. \subsubsection{Novelty} @@ -196,9 +202,11 @@ \subsubsection{Novelty} ) \end{lstlisting} +The metric quantifies the recommender's capability to surface unseen items, which allows users to encounter unexpected items for discovery. + \subsubsection{Intra-List Similarity} -Ziegler et~al. \cite{ziegler2005improving} demonstrated a metric that computes a sum of similarities between every pairs of recommended items. A larger value represents less diversity. +Ziegler et~al. \cite{ziegler2005improving} demonstrated a metric that computes a sum of similarities between every pair of recommended items. A larger value represents less diversity. \begin{lstlisting}[language = Julia] struct IntraListSimilarity <: IntraListMetric end @@ -223,4 +231,4 @@ \subsubsection{Serendipity} ) \end{lstlisting} -It should be noticed that quantifying relevance and unexpectedness is another task we must undergo before calculating the metric, and the results must be largely affected by how these factors are calculated. +It should be noticed that we must first quantify \texttt{relevance} and \texttt{unexpectedness} before calculating the metric, and the results can be largely affected by how these factors are calculated. diff --git a/paper/section/experiment.tex b/paper/section/experiment.tex new file mode 100644 index 0000000..7f6db46 --- /dev/null +++ b/paper/section/experiment.tex @@ -0,0 +1,49 @@ +So far, this paper has introduced various recommendation techniques and metrics implemented in \texttt{Recommendation.jl}. This section finally evaluates the recommenders on different metrics. Since the purpose of the following experiment is to demonstrate the capability of \texttt{Recommendation.jl} and undergo trade-off discussions among different metrics, we test only on the minimal MovieLens 100k dataset \cite{harper2015movielens} and use the \texttt{SVD} recommender (\sect{svd}) as a model-based advanced option, which requires the simplest set of hyperparameters, along with multiple baselines. However, developers can easily evaluate larger datasets with more complex models in the same way as we describe below. + +We conducted a 5-fold cross-validation of top-10 recommendations on the 100,000 user-item-rating pairs, by randomly splitting the data into five distinct sets. For each trial, we call \texttt{fit!()} on four-fifths of them (80\% samples) and then run top-10 \texttt{recommend()} for every user. Ultimately, resulting recommendations, as well as predicted ratings, are compared with the ones observed in the rest of 20\% samples for validation.\footnote{A complete Julia script used for the experiment can be found at \url{https://github.com/takuti/Recommendation.jl/blob/v1.0.0/examples/benchmark.jl}.} + +\begin{lstlisting}[language = Julia] +n_folds = 5 +topk = 10 +data = load_movielens_100k() +cross_validation( + n_folds, metrics, recommender, data, + params...) +\end{lstlisting} + +\tab{results} summarizes the results obtained from each recommender-metric pair. On the one hand, model-based SVD recommenders showed higher accuracy than the baselines in terms of both rating and ranking metrics. In particular, as the accuracy changes by $k$ for $\mathrm{SVD}_k$, we see $k = 16$ can be an optimal hyperparameter for the recommender. On the other hand, aggregated and intra-list metrics do not yield the same conclusion; since larger $k$ gives a closer approximation to real-world diverse user-item behaviors, $\mathrm{SVD}_{32}$ shows the highest aggregated diversity and Shannon entropy. These observations demonstrate the trade-off between accuracy and non-accuracy metrics as \fig{tradeoff} depicts. + +\begin{figure}[htbp] + \centering + \includegraphics[width=1.0\linewidth]{images/tradeoff.pdf} + \caption{$F_1$ score (accuracy metric calculated by $2 \frac{\mathrm{recall} \cdot \mathrm{precision}}{\mathrm{recall} + \mathrm{precision}}$) and aggregated diversity (non-accuracy metric) for $\mathrm{SVD}_k$ recommenders, based on the numbers in \tab{results}. The accuracy graph shows that an optimal $k$ is $16$ where $F_1$ score is maximized, whereas diversity monotonically increases as $k$ gets larger. Best baseline metrics are illustrated as dashed lines for reference.} + \label{fig:tradeoff} +\end{figure} + +\begin{table*}[] + \centering + \tbl{Results from 5-fold cross-validation of top-10 recommendation conducted on MovieLens 100k user-item-rating pairs. Numbers are rounded to 3 decimal places, and those in the bold font indicate the ``best'' values for each metric. Accuracy metrics for \texttt{MostPopular} are not calculated because the recommender does not explicitly predict ratings.}{ + \begin{tabular}{|cl||r|r|r|r|r|r|r|} + \hline + \multicolumn{2}{|c||}{} & \texttt{ItemMean} & \texttt{UserMean} & \texttt{MostPopular} & \texttt{SVD(4)} & \texttt{SVD(8)} & \texttt{SVD(16)} & \texttt{SVD(32)} \\ \hline \hline + \multicolumn{1}{|c|}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Rating\\ (\sect{rating-metrics})\end{tabular}}} & \texttt{RMSE } & 0.642 & 0.681 & - & 0.545 & \textbf{0.524} & \textbf{0.524} & 0.550 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{MAE } & 0.603 & 0.642 & - & 0.493 & 0.471 & \textbf{0.470} & 0.496 \\ \hline \hline + \multicolumn{1}{|c|}{\multirow{6}{*}{\begin{tabular}[c]{@{}c@{}}Ranking\\ (\sect{ranking-metrics})\end{tabular}}} & \texttt{Recall } & 0.108 & 0.002 & 0.114 & 0.182 & 0.212 & \textbf{0.228} & 0.218 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{Precision } & 0.185 & 0.004 & 0.189 & 0.297 & 0.335 & \textbf{0.353} & 0.328 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{AUC } & 0.417 & 0.018 & 0.429 & 0.531 & 0.558 & \textbf{0.579} & 0.571 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{ReciprocalRank } & 0.415 & 0.011 & 0.409 & 0.583 & 0.642 & \textbf{0.670} & 0.645 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{MPR } & 84.671 & 89.784 & 84.021 & 80.192 & 78.431 & \textbf{77.417} & 78.023 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{NDCG } & 0.201 & 0.004 & 0.203 & 0.327 & 0.371 & \textbf{0.392} & 0.365 \\ \hline \hline + \multicolumn{1}{|c|}{\multirow{3}{*}{\begin{tabular}[c]{@{}c@{}}Aggregated\\ (\sect{aggregated-metrics})\end{tabular}}} & \texttt{AggregatedDiversity} & 52.2 & 145.0 & 52.4 & 163.8 & 253.0 & 328.4 & \textbf{403.4} \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{ShannonEntropy } & 3.149 & 4.170 & 3.160 & 4.486 & 4.847 & 5.138 & \textbf{5.386} \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{GiniIndex } & 0.662 & 0.669 & 0.658 & 0.597 & 0.629 & 0.616 & \textbf{0.599} \\ \hline \hline + \multicolumn{1}{|c|}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Intra-list\\ (\sect{intra-list-metrics})\end{tabular}}} & \texttt{Coverage } & 0.006 & 0.006 & 0.006 & 0.006 & 0.006 & 0.006 & 0.006 \\ \cline{2-9} + \multicolumn{1}{|c|}{} & \texttt{Novelty } & 8.998 & \textbf{9.944} & 8.970 & 8.763 & 8.751 & 8.991 & 9.424 \\ \hline + \end{tabular} + } + \label{tab:results} +\end{table*} + +Meanwhile, rule-based \texttt{UserMean} recommender, which simply scores items by a mean rating per user, was the best in terms of novelty, demonstrating the higher ability to surface unseen items at the top. In combination with the trade-off discussion above, the results tell us that focusing only on a single metric can easily confuse developers and mislead the users of recommender systems. Therefore, it is crucial to holistically assess the systems from multiple perspectives, and the design principle of \texttt{Recommendation.jl} follows the point as we explained in \sect{introduction}. + +It should be noticed that, as \texttt{kwargs...} in \sect{intra-list-metrics} indicate, evaluation in intra-list metrics is not straightforward due to the need for specifying additional arguments to set up a scenario. For the sake of simplicity, this section assumes \texttt{catalog} for \texttt{Coverage} is a set of all items available in the dataset, and \texttt{observed} for \texttt{Novelty} is a set of items in target user's training samples, allowing the recommenders to recommend the same items in a training set to the same user. Thus, \texttt{Coverage} in \tab{results} is the same across the recommenders because we always recommend 10 items per user from the fixed set of all items. Moreover, we did not evaluate in \texttt{IntraListSimilarity} and \texttt{Serendipity} because there is no obvious way to define item-item similarities, relevance, and unexpectedness; the choices depend largely on the developer's hypotheses and objectives that this paper does not discuss in detail.