Learning What’s in a Name with Graphical Models

Abstract

I. Probabilistic Graphical Models

Factorizing Joint Distributions

p(a,b)=p(a)p(ba)p(a, b) = p(a) \cdot p(b|a)

The shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.

p(a,b,c,d)=p(a)p(ba)p(c)p(da,b,c)p(a, b, c, d) = p(a) \cdot p(b|a) \cdot p(c) \cdot p(d|a, b, c)

Directed Acyclic Graphs

II. Hidden Markov Models

The Hidden Layer

p(sis1,s2,,si1)=p(sisi1)for all i{2,,N}p(s_i | s_1, s_2,…, s_{i-1}) = p(s_i | s_{i-1}) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N\}$}
p(sisi1)=p(si+1si)for all i{2,,N1}p(s_i | s_{i-1}) = p(s_{i+1} | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N-1\}$}

The Observed Layer

p(ois1,s2,,sN)=p(oisi)for all i{1,2,,N}p(o_i | s_1, s_2,…, s_N) = p(o_i | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{1, 2,…, N\}$}

Representing Named Entities

Rather than labeling each node using the name of the variable it represents (X₁, Y₁) as we have until this point, we'll instead display the value of that variable (“O”, “Great”). This helps make the graphs easier to read.

Training

Inference

Results

Accuracy

90.1%

Precision

64.2%

Recall

55.8%

F₁ Score

59.7%

Precision

TagNo OOV1+ OOV
ORG0.80.21
PER0.850.62
LOC0.870.06
MISC0.780.12
ALL0.840.39

Recall

TagNo OOV1+ OOV
ORG0.640.33
PER0.580.59
LOC0.710.05
MISC0.540.06
ALL0.640.41

Limitations

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.610.680.30.120.28
PER0.960.670.7
LOC0.880.36
MISC0.90.460.240.12
ALL0.770.610.310.120.29

An empty cell means there were not enough (fewer than five) relevant entities. E.g. in the precision tab, there were not enough entities predicted to be a location name that were five words long.

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.860.3
20.80.50.52
30.780.150.060
40.420.22000
50.670.2200.14

An empty cell means either it is impossible (e.g. a two-word entity can’t have an OOV rate of 0.25) or there were not enough (fewer than five) relevant entities to form a reliable score.

III. Maximum Entropy Markov Models

Discriminative Structure

Word Features

bshape=Xxxx(ot)={1 if ot has shape “Xxxx”0 otherwiseb_{\text{shape=Xxxx}}(o_t) = \begin{cases} \textrm{1 if $o_t$ has shape “Xxxx”} \\ \textrm{0 otherwise} \end{cases}
otbshape=Xxxx(ot)
Stately0
,0
plump0
Buck1
Mulligan0
came0
fa(ot,st)=fb,s(ot,st)={1 if b(ot)=1 and st=s0 otherwisef_a(o_t, s_t) = f_{\langle b, s \rangle}(o_t, s_t) = \begin{cases} \textrm{1 if $b(o_t) = 1$ and $s_t = s$} \\ \textrm{0 otherwise} \end{cases}

State Transitions

ps(so)=1Z(o,s)exp(aλafa(o,s))p_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a}{\lambda_a \, f_a(o, s)}\right)

Training & Inference

Results

Accuracy

93.1%

Precision

72.9%

Recall

63.5%

F₁ Score

67.9%

Precision

TagNo OOV1+ OOV
ORG0.810.36
PER0.820.8
LOC0.820.17
MISC0.740.14
ALL0.80.54

Recall

TagNo OOV1+ OOV
ORG0.680.12
PER0.720.57
LOC0.890.29
MISC0.780.02
ALL0.790.37

Advantage Over HMMs

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.830.34
20.760.720.55
30.60.560.590.5
40.160.20
50.60.5

An empty cell means either it is impossible (e.g. a two-word entity can’t have an OOV rate of 0.25) or there were not enough (fewer than five) relevant entities to form a reliable score.

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.760.690.840.360.8
PER0.590.910.660.25
LOC0.80.330.350
MISC0.820.570.290.18
ALL0.770.70.580.160.43

An empty cell means there were not enough (fewer than five) relevant entities. E.g. in the precision tab, there were not enough entities predicted to be a location name that were five words long.

Label Bias Problem

sps(so)=1\sum_s{p_{s\prime}(s|o)} = 1
ps(so)=1Z(o,s)exp(aλafa(o,s))p_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a}{\lambda_a \, f_a(o, s)}\right)

IV. Conditional Random Fields

Markov Random Fields

ϕ(X,Y)={3 if X=1 and Y=12 if X=0 and Y=01 otherwise\phi(X, Y) = \begin{cases} \textrm{3 if $X = 1$ and $Y = 1$} \\ \textrm{2 if $X = 0$ and $Y = 0$} \\ \textrm{1 otherwise} \end{cases}
p(A,B,C,D)=1Zϕ(A,B)ϕ(B,C)ϕ(C,A)where Z is a normalization factorp(A,B,C,D) = \frac{1}{Z}\,\phi(A,B)\,\phi(B,C)\,\phi(C,A) \\ \footnotesize\textrm{where Z is a normalization factor}
p(x1,x2,)=1ZcCϕc(xc)where Z is a normalization termand C is the set of cliques in Gp(x_1, x_2,…) = \frac{1}{Z}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(x_c)} \\ \footnotesize\textrm{where Z is a normalization term} \\ \footnotesize\textrm{and C is the set of cliques in $\mathcal{G}$}

Conditional Form

p(yx)=1Z(x)cCϕc(yc,xc)where Z is a normalization termand C is the set of cliques in thegraph G representing the labels yp(y|x) = \frac{1}{Z(x)}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(y_c, x_c)} \\ \footnotesize\textrm{where Z is a normalization term} \\ \footnotesize\textrm{and C is the set of cliques in the} \\ \footnotesize\textrm{graph $\mathcal{G}$ representing the labels $y$}
Linear-chain CRF where the hidden layer depends on the current, previous, and future observations.

Exponential Factors

ϕc(yc,xc)=exp(aλafa(yc,xc))where fa is a feature function defined for clique cand λa is the weight parameter for fa\phi_c(y_c, x_c) = \exp\left(\sum_{a}{\lambda_a \, f_a(y_c, x_c)}\right) \\ \footnotesize\textrm{where $f_a$ is a feature function defined for clique $c$} \\ \footnotesize\textrm{and $\lambda_a$ is the weight parameter for $f_a$}

Training & Inference

Addressing the Label Bias Problem

p(x1,x2,)=1ZcCϕc(xc)where Z is a normalization termand C is the set of cliques in Gp(x_1, x_2,…) = \frac{1}{Z}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(x_c)} \\ \footnotesize\textrm{where Z is a normalization term} \\ \footnotesize\textrm{and C is the set of cliques in $\mathcal{G}$}

Results

Accuracy

94.8%

Precision

77.8%

Recall

74.9%

F₁ Score

76.3%

Precision

TagNo OOV1+ OOV
ORG0.870.46
PER0.870.72
LOC0.90.29
MISC0.810.23
ALL0.870.6

Recall

TagNo OOV1+ OOV
ORG0.790.4
PER0.820.8
LOC0.920.2
MISC0.840.09
ALL0.850.57

V. Conclusion

References

  1. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
    Daphne Koller and Nir Friedman. 2009. The MIT Press.
  2. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
    Judea Pearl. 1988. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA. https://doi.org/10.1016/C2009-0-27609-4
  3. Genes, Themes and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
    Hagit Shatkay, Stephen Edwards, W John Wilbur, and Mark Boguski. 2000. In Proceedings of the  International Conference on Intelligent Systems for Molecular Biology, 317–328.
  4. Information Extraction Using Hidden Markov Models
    Timothy Robert Leek. 1997. Master’s Thesis, UC San Diego.
  5. Information Extraction with HMMs and Shrinkage PDF
    Dayne Freitag and Andrew McCallum. 1999. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction (AAAI Techinical Report WS-99-11), 31–36.
  6. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
    Lawrence R Rabiner. 1989. Proceedings of the IEEE 77, 2: 257–286. https://doi.org/10.1109/5.18626
  7. An Algorithm that Learns What’s in a Name PDF
    Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. Machine Learning 34, 1: 211–231. https://doi.org/10.1023/A:1007558221122
  8. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition PDF
    Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147. https://doi.org/10.48550/arXiv.cs/0306050
  9. Appendix A.5 — HMM Training: The Forward-Backward Algorithm PDF
    Daniel Jurafsky and James H. Martin. 2021. In Speech and Language Processing. 8–10.
  10. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
    A. Viterbi. 1967. IEEE Transactions on Information Theory 13, 2: 260–269. https://doi.org/10.1109/TIT.1967.1054010
  11. Appendix A.4 — Decoding: The Viterbi Algorithm PDF
    Daniel Jurafsky and James H. Martin. 2021. In Speech and Language Processing. 8–10.
  12. Maximum Entropy Markov Models for Information Extraction and Segmentation PDF
    Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), 591–598.
  13. Maximum Entropy Models for Antibody Diversity Link
    Thierry Mora, Aleksandra M. Walczak, William Bialek, and Curtis G. Callan. 2010. Proceedings of the National Academy of Sciences 107, 12: 5405–5410. https://doi.org/10.1073/pnas.1001705107
  14. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control PDF
    Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. 2009. In Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-04, Stanford, California, USA, 92–97.
  15. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes PDF
    Andrew Ng and Michael Jordan. 2001. In Advances in Neural Information Processing Systems.
  16. Inducing Features of Random Fields
    S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 4: 380–393. https://doi.org/ 10.1109/34.588021
  17. Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
    Léon Bottou. 1991. Université de Paris X.
  18. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data PDF
    John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01), 282–289.
  19. The Label Bias Problem Link
    Awni Hannun. 2019. Awni Hannun — Writing About Machine Learning.
  20. Discriminative Probabilistic Models for Relational Data Link
    Ben Taskar, Pieter Abbeel, and Daphne Koller. 2013. https://doi.org/10.48550/ARXIV.1301.0604
  21. Accurate Information Extraction from Research Papers using Conditional Random Fields
    Fuchun Peng and Andrew McCallum. 2004. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 329–336. https://doi.org/10.1016/j.ipm.2005.09.002
  22. Discriminative Fields for Modeling Spatial Dependencies in Natural Images PDF
    Sanjiv Kumar and Martial Hebert. 2003. In Advances in Neural Information Processing Systems.
  23. Multiscale Conditional Random Fields for Image Labeling Link
    Xuming He, R.S. Zemel, and M.A. Carreira-Perpinan. 2004. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, II–II. https://doi.org/10.1109/CVPR.2004.1315232
  24. Conditional Random Fields as Recurrent Neural Networks Link
    Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. In 2015 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2015.179
  25. Convolutional CRFs for Semantic Segmentation Link
    Marvin T. T. Teichmann and Roberto Cipolla. 2018. https://doi.org/10.48550/arxiv.1805.04777
  26. RNA Secondary Structural Alignment with Conditional Random Fields Link
    Kengo Sato and Yasubumi Sakakibara. 2005. Bioinformatics 21: ii237–ii242. https://doi.org/10.1093/bioinformatics/bti1139
  27. Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)
    Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. 2006. J. Comput. Biol. 13, 2: 394–406. https://doi.org/10.1089/cmb.2006.13.394
  28. Introduction to Markov Random Fields
    Andrew Blake and Pushmeet Kohli. 2011. In Markov Random Fields for Vision and Image Processing. The MIT Press.
  29. An Introduction to Conditional Random Fields Link
    Charles Sutton and Andrew McCallum. 2010. arXiv. https://doi.org/10.48550/ARXIV.1011.4088