์ƒˆ์†Œ์‹

๐Ÿ“‘ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ/NLP

[Word2Vec] Distributed Representations of Words and Phrases and their Compositionality

2022. 8. 11. 00:10

  • -

Word2Vec์˜ Skip-gram ๋ชจ๋ธ

 

Distributed Representations of Words and Phrases and their Compositionality

 

 

 

๐Ÿ’ก๋“ค์–ด๊ฐ€๊ธฐ ์ „ ๊ฐœ๋… ์ •๋ฆฌ

  • Distributed Representation(๋ถ„ํฌ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ํ‘œํ˜„)
    • '๋น„์Šทํ•œ ์œ„์น˜์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค'๋ผ๋Š” ๋ถ„ํฌ ๊ฐ€์„ค์— ๊ธฐ๋ฐ˜ํ•ด ์ฃผ๋ณ€ ๋‹จ์–ด ๋ถ„ํฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์ด ๊ฒฐ์ •๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„์‚ฐ ํ‘œํ˜„(Distributed representation)์ด๋ผ๊ณ  ๋ถ€๋ฆ„
    • cf) ์›-ํ•ซ ์ธ์ฝ”๋”ฉ(One-hot Encoding)
      • ๋ฒ”์ฃผํ˜•(categorical) ๋ณ€์ˆ˜๋ฅผ ๋ฒกํ„ฐํ™”
      • ex) [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1]
      • ๋‹จ์ : ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐ’์ด 0 => ๋‹จ์–ด ์‚ฌ์ด ๊ด€๊ณ„ ํŒŒ์•… X, ์ฐจ์›์ด ๋„ˆ๋ฌด ์ปค์ง
    • ์ž„๋ฒ ๋”ฉ(Embedding)
      • ์›ํ•ซ ์ธ์ฝ”๋”ฉ ๋‹จ์  ํ•ด๊ฒฐ -> ๋‹จ์–ด ์‚ฌ์ด ๊ด€๊ณ„ ํŒŒ์•… ๊ฐ€๋Šฅ
      • ๋‹จ์–ด๋ฅผ ๊ณ ์ •๋œ ๊ธธ์ด์˜ ์ฐจ์›์œผ๋กœ ๋ฒกํ„ฐํ™” => ๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ๊ฒƒ์„ ๊ณ ์ • ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ
      • ๋ฒกํ„ฐ ๋‚ด ๊ฐ ์š”์†Œ๊ฐ€ ์—ฐ์†์ ์ธ ๊ฐ’์„ ๊ฐ€์ง -> ex) [0.04227, -0.0033, 0.1607, -0.0236, ...]

 

  • Word2Vec
    • ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐฉ๋ฒ•
    • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ฐฉ๋ฒ•
    • ํŠน์ • ๋‹จ์–ด ๊ธฐ์ค€ ์–‘ ์˜†์˜ ๋‘ ๋‹จ์–ด (window size = 2)์˜ ๊ด€๊ณ„ ์ด์šฉ
    • -> ๋ถ„ํฌ ๊ฐ€์„ค์„ ์ž˜ ๋ฐ˜์˜ํ•จ
    • ๋ฒกํ„ฐํ™”ํ•˜๊ณ ์ž ํ•˜๋Š” ํƒ€๊ฒŸ ๋‹จ์–ด(Target word)์˜ ํ‘œํ˜„์ด ํ•ด๋‹น ๋‹จ์–ด ์ฃผ๋ณ€ ๋‹จ์–ด์— ์˜ํ•ด ๊ฒฐ์ •๋จ 
    • CBoW(Continuous Bag-of-Words)
      • ์ฃผ๋ณ€ ๋‹จ์–ด -> ์ค‘์‹ฌ ๋‹จ์–ด ์˜ˆ์ธก
    • Skip-gram
      • ์ค‘์‹ฌ ๋‹จ์–ด -> ์ฃผ๋ณ€ ๋‹จ์–ด ์˜ˆ์ธก
      • CBoW๋ณด๋‹ค ์„ฑ๋Šฅ ์ข‹์Œ -> ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ํ•™์Šต์ด ๋งŽ์•„์„œ => ๊ฐ€์ค‘์น˜๊ฐ€ ๋” ์œ ์˜๋ฏธํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋จ
      • ๋‹จ์ : ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ์Œ -> ๋ฆฌ์†Œ์Šค ํผ (= ๋น„์šฉ ์ฆ๊ฐ€)

 

 

 

Abstract

  • ๋…ผ๋ฌธ์—์„œ๋Š” Skip-gram ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋ฒกํ„ฐ์˜ ํ’ˆ์งˆ๊ณผ ํ•™์Šต ์†๋„๋ฅผ ์ฆ์ง„์‹œํ‚จ ๋ช‡ ๊ฐ€์ง€ extension๋“ค์„ ์ œ์‹œ
    • subsampling
    • negative sampling
  • ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ subsamplingํ•จ์œผ๋กœ์จ ๋ˆˆ์— ๋„๊ฒŒ ๋นจ๋ผ์ง„ ํ•™์Šต ์†๋„์™€ regular word representations๋ฅผ ๋” ๋งŽ์ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์Œ
  • hierarchical softmax(๊ณ„์ธต์  ์†Œํ”„ํŠธ๋งฅ์Šค)์˜ ๊ฐ„๋‹จํ•œ ๋Œ€์•ˆ์ธ negative sampling ์ œ์‹œ
  • word representation์˜ ํ•œ๊ณ„๋Š” ๋‹จ์–ด ์ˆœ์„œ์— ๋Œ€ํ•œ ๋ฌด๊ด€์‹ฌ๊ณผ ๊ด€์šฉ๊ตฌ(idiomatic phrase)๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ
    • ex) "Canada"์™€ "Air"์˜ ์˜๋ฏธ๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ "Air Canada"๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์‰ฝ๊ฒŒ ์–ป์ง€ ๋ชปํ•จ
  • ์ด ์˜ˆ์‹œ์—์„œ ์ฐฉ์•ˆํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ…์ŠคํŠธ์—์„œ ๊ตฌ(phrase)๋ฅผ ์ฐพ๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๊ณ , ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ๊ตฌ์— ๋Œ€ํ•œ ์ข‹์€ ๋ฒกํ„ฐ ํ‘œํ˜„(vector representation)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ

 

 

 

Introduction

  • Skip-gram ๋ชจ๋ธ: ๋Œ€๋Ÿ‰์˜ ๋น„์ •ํ˜• ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ๋‹จ์–ด์˜ ๊ณ ํ’ˆ์งˆ vector representations๋ฅผ ํ•™์Šตํ•˜๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•

Figure 1) Skip-gram model architecture

  • Skip-gram ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ๋ชฉํ‘œ๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ ํ‘œํ˜„(word vector representations)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ

 

  • word vector๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „์— ์‚ฌ์šฉ๋œ ๋Œ€๋ถ€๋ถ„์˜ neural network architectures์™€ ๋‹ฌ๋ฆฌ Skip-gram ๋ชจ๋ธ์˜ ํ›ˆ๋ จ(training)์—๋Š” ์กฐ๋ฐ€ํ•œ ํ–‰๋ ฌ ๊ณฑ์…ˆ์ด ํฌํ•จ๋˜์ง€ ์•Š์Œ
  • → ์ด๋กœ ์ธํ•ด ํ›ˆ๋ จ์ด ์•„์ฃผ ํšจ์œจ์ ์ด๊ฒŒ ๋จ (์ตœ์ ํ™”๋œ single-machine implementation์œผ๋กœ ํ•˜๋ฃจ์— 1000์–ต ๊ฐœ ์ด์ƒ์˜ ๋‹จ์–ด ํ•™์Šต ๊ฐ€๋Šฅ)

 

  • ๋…ผ๋ฌธ์—์„œ๋Š” ์˜ค๋ฆฌ์ง€๋„ Skip-gram ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ช‡๊ฐ€์ง€ extension๋“ค์„ ์ œ์‹œ
  • ํ›ˆ๋ จ ์ค‘ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ subsampling์„ ์‚ฌ์šฉํ•˜๋ฉด ์†๋„๊ฐ€ ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๊ณ  (์•ฝ 2๋ฐฐ์—์„œ 10๋ฐฐ๊นŒ์ง€) ๋œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— ๋Œ€ํ•œ representations์˜ ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ํ™•์ธ
  • Skip-gram ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์œ„ํ•œ Noise Contrastive Estimation (NCE)์˜ ๋‹จ์ˆœํ™”๋œ ๋ณ€ํ˜•์„ ์ œ์‹œ
  • -> ์ด์ „ ์ž‘์—…์— ์‚ฌ์šฉ๋œ ๋” ๋ณต์žกํ•œ hierarchical softmax์— ๋น„ํ•ด์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด ๋” ๋น ๋ฅธ ํ›ˆ๋ จ๊ณผ ๋” ๋‚˜์€ vector representation์ด ๊ฐ€๋Šฅํ•ด์ง
    • Noise Contrastive Estimation (NCE)
      • CBOW์™€ Skip-Gram ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๊ณ„์‚ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜
      • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ƒ˜ํ”Œ๋ง์œผ๋กœ ์ถ”์ถœํ•œ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ์ ์šฉ
      • ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜: k๊ฐœ์˜ ๋Œ€๋น„๋˜๋Š” ๋‹จ์–ด๋“ค์„ noise distribution์—์„œ ๊ตฌํ•ด์„œ (๋ชฌํ…Œ์นด๋ฅผ๋กœ) ํ‰๊ท ์„ ๊ตฌํ•จ
      • Hierarchical softmax, Negative Sampling ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ• ์กด์žฌ
      • ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ ์‚ฌ์šฉ
      • NCE๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฌธ์ œ๋ฅผ <์‹ค์ œ ๋ถ„ํฌ์—์„œ ์–ป์€ ์ƒ˜ํ”Œ>๊ณผ <์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“  ์žก์Œ ๋ถ„ํฌ(noise distribution)์—์„œ ์–ป์€ ์ƒ˜ํ”Œ>์„ ๊ตฌ๋ณ„ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๊ฒŒ ๋จ
      • Negative Sampling์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๊ฒฐ๊ณผ๊ฐ’์ด ์ตœ๋Œ€ํ™”๋  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ
        • ํ˜„์žฌ(= ๋ชฉํ‘œ, target, positive) ๋‹จ์–ด์—๋Š” ๋†’์€ ํ™•๋ฅ ์„ ๋ถ€์—ฌ, ๋‚˜๋จธ์ง€ ๋‹จ์–ด(= negative, noise)์—๋Š” ๋‚ฎ์€ ํ™•๋ฅ ์„ ๋ถ€์—ฌํ•ด์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ณต์‹ ์‚ฌ์šฉ
        • ๊ณ„์‚ฐ ๋น„์šฉ์—์„œ ์ „์ฒด ๋‹จ์–ด V๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์„ ํƒํ•œ k๊ฐœ์˜ noise ๋‹จ์–ด๋“ค๋งŒ ๊ณ„์‚ฐํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํšจ์œจ์ 
      • ํ…์„œํ”Œ๋กœ์šฐ -> tf.nn.nce_loss()์— ๊ตฌํ˜„

 

  • word representation๋Š” ๊ฐœ๋ณ„ ๋‹จ์–ด์˜ ๊ตฌ์„ฑ์ด ์•„๋‹Œ ๊ด€์šฉ๊ตฌ(idiomatic phrases)๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง
    • ex) “Boston Globe(: ๋ฏธ๊ตญ์˜ ์ผ๊ฐ„ ์‹ ๋ฌธ)”๋Š” ์‹ ๋ฌธ์„ ๋œปํ•˜์ง€ “Boston”๊ณผ “Globe”๋ผ๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ๊ฐ€ ๊ฒฐํ•ฉ๋œ ๊ฒƒ์ด ์•„๋‹˜
  • ๋”ฐ๋ผ์„œ ์ „์ฒด ๊ตฌ(phrases)๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Skip-gram ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์ด ํ›จ์”ฌ ๋” ์ข‹์•„์ง

 

  • ๋‹จ์–ด ๊ธฐ๋ฐ˜(word based) ๋ชจ๋ธ์—์„œ ๊ตฌ๋ฌธ ๊ธฐ๋ฐ˜(phrase based) ๋ชจ๋ธ๋กœ์˜ ํ™•์žฅ(extension)์€ ๋น„๊ต์  ๊ฐ„๋‹จํ•จ
    • 1) ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด ๋งŽ์€ ์ˆ˜์˜ phrase๋ฅผ ์‹๋ณ„
    • 2) 2) ํ›ˆ๋ จ ์ค‘ phrase๋ฅผ ๊ฐœ๋ณ„ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌ
  • phrase vector์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹จ์–ด์™€ ๊ตฌ๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” analogical reasoning tasks์˜ test์…‹์„ ๊ฐœ๋ฐœํ•จ
    • Analogical Reasoning Task
      • ์–ด๋–ค ๋‹จ์–ด์˜ pair(์Œ), ์˜ˆ๋ฅผ ๋“ค์–ด "(Athens, Greece)" ๋ผ๋Š” pair๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค๋ฅธ ๋‹จ์–ด “Oslo”๋ฅผ ์ฃผ๋ฉด ์ด ๊ด€๊ณ„์— ์ƒ์‘ํ•˜๋Š” ๋‹ค๋ฅธ ๋‹จ์–ด๋ฅผ ์ œ์‹œํ•˜๋Š” ๋ฐฉ์‹์˜ ์‹œํ—˜
  • test์…‹์˜ ์ผ๋ฐ˜์ ์ธ ์œ ์ถ” ์Œ(analogy pair)
    • “Montreal”:“Montreal Canadiens”::“Toronto”:“Toronto Maple Leafs”.
  • vec("Montreal Canadiens") - vec("Montreal") + vec("Toronto")์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด representation์ด vec("Toronto Maple Leafs")์ธ ๊ฒฝ์šฐ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์‘๋‹ต๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋จ
  • simple vector addition์ด ์ข…์ข… ์˜๋ฏธ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ
    • ex) vec(“Russia”) + vec(“river”)๋Š” vec(“Volga River”)์™€ ๊ฐ€๊นŒ์›€
    •       vec(“Germany”) + vec(“capital”)๋Š” vec(“Berlin”)๊ณผ ๊ฐ€๊นŒ์›€
  • ์ด ๋ณตํ•ฉ์„ฑ(compositionality)์€ word vector representations์— ๊ธฐ๋ณธ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ด์„œ ๋ช…ํ™•ํ•˜์ง€ ์•Š์€ ์ˆ˜์ค€์˜ ์–ธ์–ด ์ดํ•ด(language understanding)๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•จ

 

 

 

2.1 Hierarchical Softmax

  • full softmax์˜ ๊ณ„์‚ฐ์ ์œผ๋กœ ํšจ์œจ์ ์ธ ๊ทผ์‚ฌ์น˜๊ฐ€ hierarchical softmax
  • ์ด ๋ฐฉ๋ฒ•์˜ ์ฃผ๋œ ์ด์ ์€ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์‹ ๊ฒฝ๋ง์—์„œ $W$(vocabulary ๋‚ด word์˜ ์ˆ˜)๊ฐœ์˜ output node๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€์‹  $log_2(W)$ nodes์— ๋Œ€ํ•ด์„œ๋งŒ ํ‰๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ
  • hierarchical softmax๋Š” ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ์ด์šฉํ•ด์„œ $W$์˜ output layer๋ฅผ ํ‘œํ˜„ํ•จ
  • ์ด๋•Œ ํŠธ๋ฆฌ์˜ ๊ฐ ๋…ธ๋“œ์˜ leaf๋Š” child node์˜ ํ™•๋ฅ ๊ณผ ๊ด€๋ จ๋จ
  • ์ด๋Š” ๋‹จ์–ด์˜ ์ž„์˜์˜ ํ™•๋ฅ ์„ ์ •์˜ํ•˜๊ฒŒ ํ•ด์คŒ
  • hierarchical softmax์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํŠธ๋ฆฌ์˜ ๊ตฌ์กฐ๋Š” ์„ฑ๋Šฅ์— ์ƒ๋‹นํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • ๋…ผ๋ฌธ์—์„œ๋Š” binary Huffman tree๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์— short codes๋ฅผ ํ• ๋‹นํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›ˆ๋ จ์„ ๋น ๋ฅด๊ฒŒ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์Œ

 

  • Hierarchical Softmax
    • ๊ธฐ์กด softmax์˜ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ˜„๊ฒฉํžˆ ์ค„์ธ softmax์— ๊ทผ์‚ฌ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก 
    • Word2Vec์—์„œ skip-gram๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ๋•Œ negative sampling๊ณผ ํ•จ๊ป˜ ์“ฐ์ž„
    • Huffman tree
      • Word2Vec์—์„œ vocabulary์— ์žˆ๋Š” ๋ชจ๋“  ๋‹จ์–ด๋“ค์„ ์žŽ์œผ๋กœ ๊ฐ–๋Š” Huffman tree๋ฅผ ๋งŒ๋“ฆ
      • Huffman tree๋Š” ๋ฐ์ดํ„ฐ์˜ ๋“ฑ์žฅ ๋นˆ๋„์— ๋”ฐ๋ผ ๋ฐฐ์น˜ํ•˜๋Š” ๊นŠ์ด๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ์ด์ง„ ํŠธ๋ฆฌ
      • Word2Vec์—์„œ๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด(frequent word)๋Š” ์–•๊ฒŒ, ๊ฐ€๋” ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด(rare word)๋Š” ๊นŠ๊ฒŒ ๋ฐฐ์น˜ํ•จ
      • ๋ชจ๋“  ๋…ธ๋“œ์— ๋Œ€ํ•ด ํ™•๋ฅ ์„ ๋‹ค ๋”ํ•˜๋ฉด 1์ด ๋‚˜์˜ค๋ฏ€๋กœ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ด๋ฃจ๊ณ , ์ด ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜๋ฉด ์ผ๋ฐ˜์ ์ธ ์†Œํ”„ํŠธ๋งฅ์Šค์ฒ˜๋Ÿผ ํ™œ์šฉ ๊ฐ€๋Šฅ

Huffman tree (์ถœ์ฒ˜: https://uponthesky.tistory.com/15)

 

 

 

2.2 Negative Sampling

  • hierarchical softmax์˜ ๋Œ€์•ˆ์ด Noise Contrastive Estimation(NCE)
  • NCE๋Š” logistic regression(๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)์˜ ํ‰๊ท ์„ ํ†ตํ•ด ๋…ธ์ด์ฆˆ์™€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒŒ ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ๊ฐ€์ •
  • NCE๋Š” softmax์˜ ๋กœ๊ทธํ™•๋ฅ ์„ ๊ทผ์‚ฌํ•˜๊ฒŒ(approximately) ์ตœ๋Œ€ํ™”ํ•˜ํ•˜๋ ค ํ•˜์ง€๋งŒ Skip-gram ๋ชจ๋ธ์€ ์˜ค์ง ๊ณ ํ’ˆ์งˆ vector representations๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•จ
  • ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์—์„œ๋Š” vector representations์˜ ํ’ˆ์งˆ์ด ์œ ์ง€๋˜๋Š” ํ•œ NCE๋ฅผ ๋‹จ์ˆœํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ์Œ
  • Negative sampling๊ณผ NCE์˜ ์ฃผ์š” ์ฐจ์ด์ ์€ NCE๋Š” ์ƒ˜ํ”Œ๊ณผ ๋…ธ์ด์ฆˆ ๋ถ„ํฌ์˜ ์ˆ˜์น˜์  ํ™•๋ฅ ์ด ๋ชจ๋‘ ํ•„์š”ํ•˜์ง€๋งŒ negative sampling์€ ์ƒ˜ํ”Œ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ

Figure 2) ๊ตญ๊ฐ€๋“ค๊ณผ ๊ทธ ์ˆ˜๋„๋“ค์— ๋Œ€ํ•œ 1000์ฐจ์› Skip-gram ๋ฒกํ„ฐ์˜ 2์ฐจ์› PCA projection

 

  • ์œ„ ์ด๋ฏธ์ง€๋Š” ํ›ˆ๋ จ ์ค‘์— ์ˆ˜๋„๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”์— ๋Œ€ํ•œ supervised information๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์•˜์Œ์—๋„ ๊ฐœ๋…์„ ์ž๋™์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ๊ฐœ๋… ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์คŒ

 

 

  • Negative Sampling
    • Word2Vec ๋ชจ๋ธ์˜ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ ์ถœ๋ ฅ์ธต Layer์— ์žˆ๋Š” softmax ํ•จ์ˆ˜๋Š” ์‚ฌ์ „ ํฌ๊ธฐ ๋งŒํผ์˜ Vector์˜ ๋ชจ๋“  ๊ฐ’์„ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์ด๋ฉด์„œ ๋ชจ๋‘ ๋”ํ•˜๋ฉด 1์ด ๋˜๋„๋ก ๋ฐ”๊พธ๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰
    • ์ด์— ๋Œ€ํ•œ ์˜ค์ฐจ๋ฅผ ๊ตฌํ•˜๊ณ , ์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ์„ ์กฐ์ •
    • ๊ทธ ๋‹จ์–ด๊ฐ€ ๊ธฐ์ค€ ๋‹จ์–ด๋‚˜ ๋ฌธ๋งฅ ๋‹จ์–ด์™€ ์ „ํ˜€ ์ƒ๊ด€ ์—†๋Š” ๋‹จ์–ด๋ผ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ง„ํ–‰ํ•จ
    • → ์‚ฌ์ „์˜ ํฌ๊ธฐ๊ฐ€ ์ˆ˜๋ฐฑ๋งŒ์— ๋‹ฌํ•œ๋‹ค๋ฉด, ์ด ์ž‘์—…์€ ๊ต‰์žฅํžˆ ๋ฌด๊ฑฐ์šด ์ž‘์—…์ด ๋จ
    • ์ด๋ฅผ ์กฐ๊ธˆ ๋” ํšจ์œจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ž„๋ฒ ๋”ฉ ์กฐ์ ˆ์‹œ ์‚ฌ์ „์— ์žˆ๋Š” ์ „์ฒด ๋‹จ์–ด ์ง‘ํ•ฉ์ด ์•„๋‹Œ, ์ผ๋ถ€ ๋‹จ์–ด ์ง‘ํ•ฉ๋งŒ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด Negative Sampling
    • ์ด ์ผ๋ถ€ ๋‹จ์–ด ์ง‘ํ•ฉ์€, positive sample(๊ธฐ์ค€ ๋‹จ์–ด ์ฃผ๋ณ€์— ๋“ฑ์žฅํ•œ ๋‹จ์–ด)์™€ negative sample(๊ธฐ์ค€ ๋‹จ์–ด ์ฃผ๋ณ€์— ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด)๋กœ ์ด๋ฃจ์–ด์ง
      • ⇒ ๊ธฐ์ค€ ๋‹จ์–ด์™€ ๊ด€๋ จ๋œ parameter๋“ค์€ ๋‹ค ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๋Š”๋ฐ ๊ด€๋ จ๋˜์ง€ ์•Š์€ parameter๋“ค์€ ๋ช‡ ๊ฐœ ๋ฝ‘์•„์„œ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๊ฒ ๋‹ค๋Š” ๊ฒƒ
    • ์ด ๋•Œ, ๋ช‡ ๊ฐœ์˜ negative sample์„ ๋ฝ‘์„์ง€๋Š” ๋ชจ๋ธ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ณ  ๋ณดํ†ต ๋ฌธ๋งฅ ๋‹จ์–ด ๊ฐœ์ˆ˜ + 20๊ฐœ๋ฅผ ๋ฝ‘์Œ
    • ๋˜ํ•œ, ๋ง๋ญ‰์น˜์—์„œ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๊ฐ€ ๋ฝ‘ํžˆ๋„๋ก ์„ค๊ณ„๋˜์–ด์žˆ์Œ

 

 

 

2.3 Subsampling of Frequent Words

  • ํฐ ๊ทœ๋ชจ์˜ ๋ง๋ญ‰์น˜์—์„œ๋Š” ๊ฐ€์žฅ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค(ex. "in", "the", "a")์ด ์ˆ˜์–ต ๋ฒˆ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ frequent words๋Š” rare words๋ณด๋‹ค ์ •๋ณด์˜ ๊ฐ€์น˜๊ฐ€ ์ ์Œ
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ Skip-gram ๋ชจ๋ธ์€ “France”์™€ “the”์˜ ๋™์‹œ ๋ฐœ์ƒ์„ ๊ด€์ฐฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค "France"์™€ "Paris"์˜ ๋™์‹œ ๋ฐœ์ƒ์„ ๊ด€์ฐฐํ•จ์œผ๋กœ์จ ๋” ๋งŽ์€ ์ด์ ์„ ์–ป์Œ -> ์™œ๋ƒ๋ฉด ๊ฑฐ์˜ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ๋‚ด์—์„œ “the”๋ž‘ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚˜๊ธฐ ๋•Œ๋ฌธ 
  • ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ vector representations๋Š” ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ์˜ˆ์ œ๋ฅผ ํ•™์Šตํ•œ ํ›„์—๋„ ํฌ๊ฒŒ ๋ณ€ํ•˜์ง€ ์•Š์Œ

 

  • ๋…ผ๋ฌธ์—์„œ๋Š” rare words์™€ frequent words ์‚ฌ์ด์˜ ๋ถˆ๊ท ํ˜•์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ subsampling ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•จ
  • train์…‹์˜ ๊ฐ ๋‹จ์–ด $w_i$๋Š” ์•„๋ž˜์˜ ๊ณต์‹์— ์˜ํ•ด ๊ณ„์‚ฐ๋œ ํ™•๋ฅ ๋กœ ๋ฒ„๋ ค์ง

  • $P(w_i)$: ๋‹จ์–ด ๋นˆ๋„ $f(w_i)$์— ๋”ฐ๋ผ์„œ ์ด ๊ฐ’์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ๋ˆ„๋ฝ์‹œํ‚ค๋Š” ํ™•๋ฅ 
  • $f(w_i)$: ๋‹จ์–ด $w_i$์˜ ๋นˆ๋„
  • $t$: ์„ ํƒ๋œ ์ž„๊ณ„๊ฐ’(threshold), ์ผ๋ฐ˜์ ์œผ๋กœ ์•ฝ $10^{−5}$
  • ์ด subsampling ๊ณต์‹์€ ๋นˆ๋„์˜ ์ˆœ์œ„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋นˆ๋„๊ฐ€ $t$(์„ ํƒ๋œ ์ž„๊ณ„๊ฐ’)๋ณด๋‹ค ํฐ ๋‹จ์–ด๋ฅผ ์ ๊ทน์ ์œผ๋กœ subsampleํ•จ
  • ์ด ๋ฐฉ์‹์€ ํ•™์Šต ์†๋„๋ฅผ ๊ฐ€์†ํ™”ํ•˜๊ณ , rare words์— ๋Œ€ํ•ด ํ•™์Šต๋œ ๋ฒกํ„ฐ์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด

 

 

  • Subsampling
    • ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋ฅผ ํ•™์Šต ๋Œ€์ƒ์—์„œ ์ œ์™ธํ•˜๋Š” ๋ฐฉ๋ฒ•
    • stop words(๋ถˆ์šฉ์–ด) ์ œ๊ฑฐ์— ์œ ์šฉ

                                             -> ์ˆ˜์‹์— ๋Œ€ํ•œ ์„ค๋ช… ์ฐธ๊ณ 

 

 

 

Empirical Results(์‹ค์ฆ์  ๊ฒฐ๊ณผ)

  • Hierarchical Softmax(HS), Noise Contrastive Estimation, Negative Sampling, ํ›ˆ๋ จ ๋‹จ์–ด์— ๋Œ€ํ•œ subsampling์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•จ
  • analogical reasoning task ์ด์šฉ
  • task๋Š” “Germany” : “Berlin” :: “France” : ? ์™€ ๊ฐ™์€ ์œ ์ถ”(analogies)๋กœ ๊ตฌ์„ฑ๋จ
  • ๊ทธ ์œ ์ถ”๋“ค์€ ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ(cosine distance)์— ๋”ฐ๋ผ vec("Berlin") - vec("Germany”) + vec(“France”)์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด vec(x)๋ฅผ ์ฐพ๋Š” ๊ฒƒ์œผ๋กœ ํ•ด๊ฒฐ๋จ → x๊ฐ€ “Paris”์ด๋ฉด ์ •๋‹ต

 

  • task๋Š” 2๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋‚˜๋ˆ ์ง
    • syntactic analogies → “quick” : “quickly” :: “slow” : “slowly”
    • semantic analogies → country - capital city relationship

 

  • Skip-gram ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์–‘ํ•œ ๋‰ด์Šค ๊ธฐ์‚ฌ๋กœ ๊ตฌ์„ฑ๋œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹(10์–ต ๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ Google ๋ฐ์ดํ„ฐ์…‹)์„ ์‚ฌ์šฉํ•จ
  • train ๋ฐ์ดํ„ฐ์—์„œ 5ํšŒ ๋ฏธ๋งŒ์œผ๋กœ ๋ฐœ์ƒํ•œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ vocabulary์—์„œ ์‚ญ์ œ → vocabulary ํฌ๊ธฐ: 69๋งŒ 2์ฒœ

 

analogical reasoning task์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ Skip-gram 300์ฐจ์› ๋ชจ๋ธ๋“ค์˜ ์ •ํ™•๋„

  • NEG-$k$: ๊ฐ positive sample์— ๋Œ€ํ•ด $k$๊ฐœ์˜ negative samples๋ฅผ ์‚ฌ์šฉํ•œ Negative Sampling
  • NCE: Noise Contrastive Estimation
  • HS-Huffman: Hierarchical Softmax + frequency-based Huffman codes
  • analogical reasoning task์—์„œ Negative Sampling์ด Hierarchical Softmax๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚ฌ๊ณ , ์‹ฌ์ง€์–ด NCE๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์‚ด์ง ๋” ๋†’์•˜์Œ
  • frequent words์— ๋Œ€ํ•œ subsampling์€ ํ•™์Šต ์†๋„๋ฅผ ๋ช‡ ๋ฐฐ ํ–ฅ์ƒ์‹œ์ผฐ๊ณ , word representations์„ ํ›จ์”ฌ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ์Œ

 

  • skip-gram ๋ชจ๋ธ์˜ ์„ ํ˜•์„ฑ(linearity)์€ ๋ฒกํ„ฐ๋ฅผ linear analogical reasoning์— ๋” ์ ํ•ฉํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๊ณ  ์ฃผ์žฅํ•  ์ˆ˜ ์žˆ์Œ
  • ํ•˜์ง€๋งŒ ์œ„ ๊ฒฐ๊ณผ๋Š” ๋งค์šฐ non-linearํ•œ standard sigmoidal RNN์— ์˜ํ•ด ํ•™์Šต๋œ ๋ฒกํ„ฐ๋“ค์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์ด task์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋˜์—ˆ์Œ์„ ๋ณด์—ฌ์คŒ
  • ์ด๋Š” non-linear ๋ชจ๋ธ๋„ word representations์˜ ์„ ํ˜• ๊ตฌ์กฐ(linear structure)๋ฅผ ์„ ํ˜ธํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Œ

 

 

 

Learning Phrases

  • ๊ตฌ(phrase)์˜ ์˜๋ฏธ๋Š” ๋‹จ์ˆœํžˆ ๊ฐœ๋ณ„ ๋‹จ์–ด๋“ค ์˜๋ฏธ์˜ ์กฐํ•ฉ์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ
  • ๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฌ์— ๋Œ€ํ•œ vector representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ํ•จ๊ป˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๊ณ  ๋‹ค๋ฅธ ๋งฅ๋ฝ์—์„œ๋Š” ๋“œ๋ฌผ๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋“ค์„ ์ฐพ์Œ
  • ex) "New York Times", "Toronto Maple Leafs"๋Š” train ๋ฐ์ดํ„ฐ์—์„œ ๊ณ ์œ ํ•œ ํ† ํฐ์œผ๋กœ ๋Œ€์ฒด๋˜์ง€๋งŒ bigram์ธ "this is"๋Š” ๋ณ€๊ฒฝ๋˜์ง€ ์•Š์€ ์ƒํƒœ๋กœ ์œ ์ง€
  • → ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด vocabulary ์‚ฌ์ด์ฆˆ๋ฅผ ํฌ๊ฒŒ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ ๋„ ํ•ฉ๋ฆฌ์ ์ธ ๊ตฌ(phrase)๋ฅผ ๋งŽ์ด ํ˜•์„ฑํ•  ์ˆ˜ ์žˆ์Œ

 

  • ๊ตฌ(phrase)๊ฐ€ ์œ ๋‹ˆ๊ทธ๋žจ(unigram)๊ณผ ๋ฐ”์ด๊ทธ๋žจ(bigram) ์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ˜•์„ฑ๋˜๋Š” ์‹ฌํ”Œํ•œ data-driven approach ์‚ฌ์šฉ
    • N-gram
      • n๊ฐœ์˜ ์—ฐ์†์ ์ธ ๋‹จ์–ด ๋‚˜์—ด์„ ์˜๋ฏธ
      • ๊ฐ–๊ณ  ์žˆ๋Š” ์ฝ”ํผ์Šค์—์„œ n๊ฐœ์˜ ๋‹จ์–ด ๋ญ‰์น˜ ๋‹จ์œ„๋กœ ๋Š์–ด์„œ ์ด๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ๊ฐ„์ฃผ
      • ์œ ๋‹ˆ๊ทธ๋žจ(unigram): n = 1
      • ๋ฐ”์ด๊ทธ๋žจ(bigram): n = 2
      • ex) “An adorable little boy is spreading smiles.”
      • -> unigrams : an, adorable, little, boy, is, spreading, smiles
      • -> bigrams : an adorable, adorable little, little boy, boy is, is spreading, spreading smiles

 

  • ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ž„๊ณ„๊ฐ’(threshold value)์„ ๋‚ฎ์ถ”๋ฉด์„œ train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 2-4๋ฒˆ์˜ ํŒจ์Šค๋ฅผ ์‹คํ–‰ํ•ด ์—ฌ๋Ÿฌ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ ๋” ๊ธด ๊ตฌ๋ฌธ(phrase)์ด ํ˜•์„ฑ๋˜๋„๋ก ํ•จ
  • ๊ตฌ๋ฅผ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด analogical reasoning task๋ฅผ ์‚ฌ์šฉํ•ด์„œ phrase representations์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•จ
  • ์•„๋ž˜์˜ ํ‘œ๋Š” ์ด task์— ์‚ฌ์šฉ๋œ analogies์˜ 5๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์คŒ

Examples of the analogical reasoning task for phrases

  • ๋ชฉํ‘œ: ์ฒ˜์Œ 3๊ฐœ๋ฅผ ์‚ฌ์šฉํ•ด ๋„ค ๋ฒˆ์งธ ๊ตฌ(phrase)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ
  • ์ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ ๋ชจ๋ธ์˜ ์ •ํ™•๋„: 72%

 

 

 

4.1 Phrase Skip-Gram Results

  • ๋จผ์ € phrase ๊ธฐ๋ฐ˜ training corpus๋ฅผ ๊ตฌ์„ฑํ•œ ๋‹ค์Œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•ด์„œ ์—ฌ๋Ÿฌ Skip-gram ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ด
  • ๋นˆ๋ฒˆํ•œ ํ† ํฐ์˜ subsampling์ด ์žˆ๊ฑฐ๋‚˜ ์—†๋Š” Negative Sampling๊ณผ Hierarchical Softmax๋ฅผ ๋น„๊ต

phrase analogy ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ Skip-gram ๋ชจ๋ธ๋“ค์˜ ์ •ํ™•๋„

  • Negative Sampling์ด k = 5์—์„œ๋„ ์ƒ๋‹นํ•œ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ k = 15๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ํ›จ์”ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•จ
  • ๋†€๋ž๊ฒŒ๋„ Hierarchical Softmax๋Š” subsampling ์—†์ด ํ›ˆ๋ จ๋˜๋ฉด ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ์ž์ฃผ ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋ฅผ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด์—ˆ์Œ
  • → subsampling์ด ํ›ˆ๋ จ ์†๋„๋ฅผ ๋” ๋น ๋ฅด๊ฒŒ ํ•˜๊ณ , ์ •ํ™•๋„๋„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Œ

 

  • phrase analogy task์˜ ์ •ํ™•๋„๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์•ฝ 330์–ต ๊ฐœ์˜ ๋‹จ์–ด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด train ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ๋Š˜๋ฆผ
  • hierarchical softmax, 1000 ์ฐจ์›, context์— ๋Œ€ํ•œ ์ „์ฒด ๋ฌธ์žฅ์„ ์‚ฌ์šฉํ•จ
  • → ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋ธ์˜ ์ •ํ™•๋„ 72% ๋‹ฌ์„ฑ
  • train ๋ฐ์ดํ„ฐ์…‹์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์ด๋ฉด ์ •ํ™•๋„๊ฐ€ 66%๋กœ ๋–จ์–ด์กŒ์Œ
  • → ๋งŽ์€ ์–‘์˜ train ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค

์ฃผ์–ด์ง„ ์งง์€ ๊ตฌ๋ฌธ(phrase)์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด entities์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ ๋น„๊ต

  • ์•„๊นŒ ๋‚˜์™”๋˜ ๊ฒฐ๊ณผ์ฒ˜๋Ÿผ hierarchical softmax์™€ subsampling์„ ๊ฐ™์ด ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜ด

 

 

 

 

Additive Compositionality

  • Skip-gram representations๊ฐ€ vector representations์— ๋Œ€ํ•œ element-wise addition์„ ์‚ฌ์šฉํ•ด์„œ ๋‹จ์–ด๋ฅผ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ๋˜ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์„ ํ˜• ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋ƒ„
 

Element-wise Addition Explained - A Beginner Guide - Machine Learning Tutorial

Element-wise addition is often used in machine learning, in this tutorial, we will introduce it for machine learning beginners.

www.tutorialexample.com

 

Vector compositionality using element-wise addition

  • ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ Skip-gram ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๋‘ ๋ฒกํ„ฐ์˜ ํ•ฉ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 4๊ฐœ์˜ ํ† ํฐ์„ ํ‘œ์‹œํ•œ ๊ฒƒ

 

  • word vectors๊ฐ€ ๋ฌธ์žฅ์—์„œ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋˜๋‹ˆ๊นŒ ๋ฒกํ„ฐ๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” context์˜ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ
  • ์ด ๊ฐ’๋“ค์€ output layer์—์„œ ๊ณ„์‚ฐ๋œ ํ™•๋ฅ ๊ณผ ๋Œ€์ˆ˜์ ์œผ๋กœ(logarithmically) ๊ด€๋ จ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‘ word vectors์˜ ํ•ฉ์€ ๋‘ context distributions์˜ ๊ณฑ๊ณผ ๊ด€๋ จ๋จ
  • ์—ฌ๊ธฐ์—์„œ ๊ณฑ์€ AND function์œผ๋กœ ์ž‘๋™ํ•จ
  • → ๋‘ word vectors์— ์˜ํ•ด ๋†’์€ ํ™•๋ฅ ์ด ํ• ๋‹น๋œ ๋‹จ์–ด๋Š” ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง€๊ณ , ๋‹ค๋ฅธ ๋‹จ์–ด๋Š” ๋‚ฎ์€ ํ™•๋ฅ ์„ ๊ฐ€์ง
  • ex) "Volga River"๊ฐ€ "Russian"์™€ "river"๋ผ๋Š” ๋‹จ์–ด์™€ ํ•จ๊ป˜ ๊ฐ™์€ ๋ฌธ์žฅ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋ฉด "Russian"๊ณผ "river"์˜ ๋‹จ์–ด ๋ฒกํ„ฐ ์˜ ํ•ฉ์€ "Volga River"์˜ ๋ฒกํ„ฐ์— ๊ฐ€๊นŒ์šด feature vector๋ฅผ ์ƒ์„ฑํ•จ

 

 

 

Comparison to Published Word Representations

infrequent words์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํ† ํฐ์— ๋Œ€ํ•œ ์ž˜ ์•Œ๋ ค์ง„ ๋ชจ๋ธ๋“ค๊ณผ 300์–ต ๊ฐœ ์ด์ƒ์˜ train ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•ด phrases์— ๋Œ€ํ•ด ํ•™์Šต๋œ Skip-gram ๋ชจ๋ธ ๋น„๊ต

  • ๋น„์–ด์žˆ๋Š” ๊ฑด ๋‹จ์–ด๊ฐ€ vocabulary์— ์—†๋‹ค๋Š” ์˜๋ฏธ

 

  • ํ•™์Šต๋œ representations์˜ ํ’ˆ์งˆ ๋ฉด์—์„œ ๊ทœ๋ชจ๊ฐ€ ํฐ ๋ง๋ญ‰์น˜(corpus)์— ๋Œ€ํ•ด ํ•™์Šต๋œ ํฐ Skip-gram ๋ชจ๋ธ์ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋ˆˆ์— ๋„๊ฒŒ ์šฐ์ˆ˜ํ•จ
  • ๋˜ํ•œ, Skip-gram ๋ชจ๋ธ์ด train ๋ฐ์ดํ„ฐ์…‹์˜ ์–‘์ด ํ›จ์”ฌ ๋” ๋งŽ์ง€๋งŒ ํ•™์Šต ์‹œ๊ฐ„์€ ์ด์ „ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ„

 

 

 

Conclusion

  • ๋…ผ๋ฌธ์—์„œ๋Š” Skip-gram ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๋‹จ์–ด์™€ ๊ตฌ(phrase)์˜ distributed representations์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ด๋Ÿฌํ•œ representations๊ฐ€ ์ •ํ™•ํ•œ analogical reasoning(์œ ์ถ” ์ถ”๋ก )์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์„ ํ˜• ๊ตฌ์กฐ๋ฅผ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ
  • ์ด ๊ธฐ์ˆ ์€ CBoW(continuous bag-of-words) ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ์—๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
    • CBoW vs Skip-gram

์ถœ์ฒ˜:&nbsp;https://arxiv.org/pdf/1309.4168v1.pdf

 

  • ๊ณ„์‚ฐ์ ์œผ๋กœ ํšจ์œจ์ ์ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋•๋ถ„์— ์ด์ „ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋ช‡ ๋ฐฐ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ์„ฑ๊ณต์ ์œผ๋กœ ํ›ˆ๋ จ์‹œํ‚ด
  • ๊ทธ ๊ฒฐ๊ณผ, ํŠนํžˆ rare entities์— ๋Œ€ํ•ด ํ•™์Šต๋œ word representations์™€ phrase representations์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋จ

 

  • ๋˜ํ•œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ subsampling์ด ํ›ˆ๋ จ์„ ๋” ๋น ๋ฅด๊ฒŒ ํ•˜๊ณ , ํ”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋ฅผ ํ›จ์”ฌ ๋” ์ž˜ ํ‘œํ˜„ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•จ
  • ํŠนํžˆ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด ์ •ํ™•ํ•œ ํ‘œํ˜„(representations)์„ ํ•™์Šตํ•˜๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•œ ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์ธ Negative sampling ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋…ผ๋ฌธ์˜ ๋˜ ๋‹ค๋ฅธ ๊ธฐ์—ฌ

 

  • ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ ํƒ๊ณผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์„ ํƒ์€ task์— ๋”ฐ๋ผ ๊ฒฐ์ •๋˜๋Š” ๊ฒƒ
  • → ๋ฌธ์ œ๋งˆ๋‹ค ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ตฌ์„ฑ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ
  • ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์˜ ์„ ํƒ, ๋ฒกํ„ฐ์˜ ์‚ฌ์ด์ฆˆ, subsampling rate, training window์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์š”์†Œ๋“ค์ด์—ˆ์Œ

 

  • ๋…ผ๋ฌธ ๊ฒฐ๊ณผ์—์„œ ๋‹จ์–ด ๋ฒกํ„ฐ(word vectors)๊ฐ€ simple vector addition์„ ์‚ฌ์šฉํ•ด ๋‹ค์†Œ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ฒฐํ•ฉ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์•„์ฃผ ํฅ๋ฏธ๋กญ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” phrase representations๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹จ์ˆœํžˆ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ(phrase)๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ
  • ์ด ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์กฐํ•ฉ์€ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ ํ…์ŠคํŠธ์˜ ๋” ๊ธด pieces๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฐ•๋ ฅํ•˜๋ฉด์„œ๋„ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•จ
  • ๋”ฐ๋ผ์„œ ๋…ผ๋ฌธ์˜ ๋ชจ๋ธ์€ recursive matrix-vector operations๋ฅผ ์‚ฌ์šฉํ•ด ๊ตฌ(phrase)๋ฅผ ํ‘œํ˜„ํ•˜๋ ค๋Š” ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋ณด์™„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ

 

 

 

์ฐธ๊ณ  ์ž๋ฃŒ 

 

https://pythonkim.tistory.com/92

 

Word2Vec ๋ชจ๋ธ ๊ธฐ์ดˆ (1) - ๊ฐœ๋… ์ •๋ฆฌ

์ฑ—๋ด‡์„ ๋งŒ๋“ค์–ด ๋ณผ ์ƒ๊ฐ์œผ๋กœ ๋งŽ์€ ๋ฌธ์„œ๋ฅผ ๋’ค์กŒ๋‹ค. ์ฑ—๋ด‡์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์—๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์€ ํŽ˜์ด์Šค๋ถ, ๊ตฌ๊ธ€, ์•„๋งˆ์กด, ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ ๋“ฑ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ”„๋ ˆ์ž„์›์„ ์‚ฌ์šฉํ•˜๋„๋ก ๋˜์–ด

pythonkim.tistory.com

 

https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/

 

word2vec ๊ด€๋ จ ์ด๋ก  ์ •๋ฆฌ

์˜ˆ์ „์— ํฌ์ŠคํŒ…ํ•œ Kaggle ‘What’s Cooking?’ ๋Œ€ํšŒ์—์„œ word2vec ๊ธฐ์ˆ ์„ ์‚ด์ง ์‘์šฉํ•ด์„œ ์‚ฌ์šฉํ•ด๋ณผ ๊ธฐํšŒ๊ฐ€ ์žˆ์—ˆ๋‹ค. ๊ทธ ์ดํ›„์—๋„ word2vec์ด ์“ฐ์ผ๋งŒํ•œ ํ† ํ”ฝ๋“ค์„ ์ ‘ํ•˜๋ฉด์„œ ์ด์ชฝ์— ๋Œ€ํ•ด ๊ณต๋ถ€๋ฅผ ํ•ด๋ณด๋‹ค๊ฐ€, ๊ธฐ์กด

shuuki4.wordpress.com

 

https://uponthesky.tistory.com/15

 

๊ณ„์ธต์  ์†Œํ”„ํŠธ๋งฅ์Šค(Hierarchical Softmax, HS) in word2vec

๊ณ„์ธต์  ์†Œํ”„ํŠธ๋งฅ์Šค(Hierarchical Softmax, HS)๋ž€? ๊ธฐ์กด softmax์˜ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ˜„๊ฒฉํžˆ ์ค„์ธ, softmax์— ๊ทผ์‚ฌ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค. Word2Vec์—์„œ skip-gram๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ๋•Œ ๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ˜ํ”Œ๋ง(negative sampli..

uponthesky.tistory.com

 

https://wooono.tistory.com/244

 

[DL] Word2Vec, CBOW, Skip-Gram, Negative Sampling

One-Hot Vector ๊ธฐ์กด์˜ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ๋Š”, ๋‹จ์–ด๋ฅผ One-Hot Vector ๋กœ ํ‘œํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. One-Hot Vector๋ž€, ์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ์ „์— ์ด 10000๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์žˆ๊ณ , Man์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์‚ฌ์ „์˜ 5391๋ฒˆ์งธ index์— ์กด์žฌํ•œ๋‹ค๋ฉด M..

wooono.tistory.com

 

https://yngie-c.github.io/nlp/2020/05/28/nlp_word2vec/

 

Word2Vec · Data Science

์ง€๋‚œ๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์— ์žˆ์—ˆ๋˜ ํ™•๋ฅ ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง ์–ธ์–ด ๋ชจ๋ธ(NPLM)์€ ์ฒ˜์Œ์œผ๋กœ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์–ธ์–ด ๋ชจ๋ธ์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” ๋ชจ๋ธ์ธ Word2Vec์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๊ฒ 

yngie-c.github.io

 

https://wikidocs.net/21692

 

3) N-gram ์–ธ์–ด ๋ชจ๋ธ(N-gram Language Model)

n-gram ์–ธ์–ด ๋ชจ๋ธ์€ ์—ฌ์ „ํžˆ ์นด์šดํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ํ†ต๊ณ„์  ์ ‘๊ทผ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ SLM์˜ ์ผ์ข…์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์•ž์„œ ๋ฐฐ์šด ์–ธ์–ด ๋ชจ๋ธ๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์ด์ „์— ๋“ฑ์žฅํ•œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜ ...

wikidocs.net

 

https://www.tutorialexample.com/element-wise-addition-explained-a-beginner-guide-machine-learning-tutorial/

 

Element-wise Addition Explained - A Beginner Guide - Machine Learning Tutorial

Element-wise addition is often used in machine learning, in this tutorial, we will introduce it for machine learning beginners.

www.tutorialexample.com

 

'๐Ÿ“‘ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ > NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing  (0) 2022.07.23
Contents

ํฌ์ŠคํŒ… ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค

์ด ๊ธ€์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ๊ณต๊ฐ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค!