Approach

The proposed method of extraction references from PDF documents operates in two correlated phases: reference line classiffcation and reference segmentation & identification.

Features

A set of features has been extracted from each line and each token to classify lines and segment reference strings, respectively.

Line-based (ln)

get_cc(ln):

Description: The ratio of Uppercase characters to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_cc(ln):

Description: The ratio of Lowercase characters to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_cw(ln):

Description: The ratio of tokens that begings with Uppercase characters to the total number of tokens
Output: Numeric, in the range [0,1]

get_sw(ln):

Description: The ratio of tokens that begings with Lowercase characters to the total number of tokens
Output: Numeric, in the range [0,1]

get_yr(ln):

Description: Whether a year format (from 1800 to 2029) appears in the line
Output: Boolean value (0 or 1)

get_qm(ln):

Description: The ratio of quotations (' | " | ` | « | ») to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_cl(ln):

Description: The ratio of double colon (:) to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_sl(ln):

Description: The ratio of slash and backslash (\ | \/) to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_bs(ln):

Description: The ratio of slash and brackets, including: parentheses, braces and quare brackets ((|)|{|}|[|]), to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_dt(ln):

Description: The ratio of full points, or full stop (.), to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_cm(ln):

Description: The ratio of comma (,), to the total number of charachters (Whitespace characters are not counted)
Output: Numeric, in the range [0,1]

get_cd(ln):

Description: The ratio of initials (e.g. A. | C.D.), to the total number of token
Output: Numeric, in the range [0,1]

get_lh(ln):

get_ch(ln):

get_pg(ln):

Description: Whether the line contains page format (58 -- 78)
Output: Numeric, in the range [0,1]

get_hc(ln):

get_pb(ln):

get_wc(ln):

get_ll(ln):

get_llw(ln):

get_lv(ln):

Description: The vertical position of the line w.r.t the entire file
Output: Numeric, in the range [0,1]

get_lex1(ln):

Description: Whether the line contains: In: | In | in:
Output: Boolean value (0 or 1)

get_lex2(ln):

Description: Whether the line contains: hg. | hrsg. | ed. | eds. (Uppercase is not preserved)
Output: Boolean value (0 or 1)

get_lex3(ln):

get_lex4(ln):

Description: Whether the line contains an ampersand (&)
Output: Boolean value (0 or 1)

get_lex5(ln):

Description: Whether the line contains: Journal
Output: Boolean value (0 or 1)

get_lex6(ln):

Description: Whether the line contains: bd. | band (Uppercase is not preserved)
Output: Boolean value (0 or 1)

get_lex7(ln):

Description: Whether the line contains: S. | PP. | pp. | ss. | SS. | pages.
Output: Boolean value (0 or 1)

get_lnk(ln):

Description: Whether the line contains a link format (http://xyz.com)
Output: Boolean value (0 or 1)

get_vol(ln):

Description: Whether the line contains: vol. | jg. (Uppercase is not preserved)
Output: Boolean value (0 or 1)

get_und(ln):

Description: Whether the line contains: u. | and | und
Output: Boolean value (0 or 1)

get_amo(ln):

get_num(ln):

Description: Whether the line contains: No(.,:) | Nr(.,:)
Output: Boolean value (0 or 1)

Features

Line-based (ln)

get_cc(ln):

get_cc(ln):

get_cw(ln):

get_sw(ln):

get_yr(ln):

get_qm(ln):

get_cl(ln):

get_sl(ln):

get_bs(ln):

get_dt(ln):

get_cm(ln):

get_cd(ln):

get_lh(ln):

get_ch(ln):

get_pg(ln):

get_hc(ln):

get_pb(ln):

get_wc(ln):

get_ll(ln):

get_llw(ln):

get_lv(ln):

get_lex1(ln):

get_lex2(ln):

get_lex3(ln):

get_lex4(ln):

get_lex5(ln):

get_lex6(ln):

get_lex7(ln):

get_lnk(ln):

get_vol(ln):

get_und(ln):

get_amo(ln):

get_num(ln):

get_db(ln):

Token-based