CSMT2025-067_汇报PPT(备用PDF版本).pdf
[01] Hello everyone. I’m MJuly, a master’s student at BUPT. It’s a great honor to present our work today. For english audience, you can find the English script of my talk in this site (bottom-left of the slide, https://mjuly.notion.site/).
[02] The full title looks a bit long at first glance, so let me paraphrase it: Optical Jianpu Recognition Based on Expert Systems.
[03] In one sentence, we built a complete OMR pipeline for printed Jianpu with lyrics, [04] and used it to construct a large-scale corpus of Chinese folk songs. [05] Both the code and the data are open-sourced on GitHub; you can find the links via my homepage shown in the bottom-left corner of the slide.
[06] This research falls under optical music recognition (OMR): turning score images into machine-readable formats such as MIDI or MusicXML. [07] With large quantities of digital scores, we can train downstream tasks like music retrieval and generation.
[08] We fixed a concrete target: the Anthology of Chinese Folk Songs, which contains over 30,000 pieces and is an ideal data source. [09] Two properties are crucial for recognition: the scores are printed Jianpu, and they include Chinese lyrics.
[10] As we all know, Western staff notation dominates international research, [11] whereas in many non-professional contexts in mainland China—including folk songs—Jianpu is far more common. The two notations differ radically, [12] so methods and datasets for staff notation transfer poorly. Jianpu OMR must largely start from scratch and remains a niche topic.
[13] For example, while staff-notation OMR has recently adopted Transformers, Jianpu lacks the scale of labeled data needed to train such models; CNN-based approaches are still the de facto SOTA.
[14] Last year in CSMT, we proposed a synthetic-data approach. [15] The bottleneck is the severe shortage of human-labeled data. [16] So we wrote a renderer that converts large numbers of MIDI scores—even randomly generated ones—into Jianpu images while producing labels at the same time, [17] yielding ample training data. [18] We then designed and trained a YOLO-like single-stage end-to-end network for Jianpu recognition.
[19] The idea was novel, and individual sub-tasks achieved decent accuracies. However, errors across stages tend to accumulate. [20] As reported in the paper, the overall note-wise F1 after joint detection and aggregation was 0.88—insufficient for reliably recognizing the Anthology.
[21] This prompted us to reconsider. [22] Do we truly need labeled data? [23] Within a given volume, font style, dot size, and line width are essentially fixed. Given these priors, can we write a rule-driven pipeline [24] that first extracts atomic elements and then infers relations among them to reconstruct the score? [25] That is the core idea of this year’s work. [26] The system comprises four main parts.
[27] First, preprocessing. Scans often exhibit uneven illumination, so we designed a normalization method. [28] We start from the grayscale histogram and apply a series of procedures, [29] summarized by the equations shown here, [30] to estimate typical background and foreground gray levels. [31] We then solve for a dual-gamma transform that maps these two levels to target grays. [32] Finally, uniform illumination is achieved by applying this transform to the original image.
[33] We also perform deskewing. At the optimal rotation, the horizontal projection becomes maximally sharp, [34] so we correct rotation by minimizing the Shannon entropy of the horizontal projection. Empirically the entropy is unimodal in angle, so a ternary search or golden-section search quickly finds the optimum.
[35] That concludes preprocessing. [36] Next comes symbol detection. [37] Digit recognition is deliberately straightforward: we build convolution kernels. Concretely, we crop a sample from the score, apply LoG filtering, and manually enhance it. [38] Cross-correlating this kernel with the image yields high responses at the digit locations. [39] With kernels for digits 0–7 in specified font and size, we can process an entire volume—or even the entire series—in one pass.
[40] After notes, we detect structural lines—underlines, barlines, dashes and slurs—via Zhang–Suen thinning to obtain one-pixel-wide skeletons. For each connected component, we extract the longest acyclic chain as its representative; analyzing curvature and orientation (vertical vs. horizontal) lets us classify line types, [41] and also locate their start and end points.
[42] Dot-like elements are found in a similar manner. In this way we obtain both locations and types for all primitives.
[43] With elements in hand, [44] we then parse their relationships. [45] We employ an elliptical-distance–weighted search: from each note we check for dots above, below, or to the side; each underline searches for its start and end notes; and so forth. After establishing all relations and confirming note order, we derive each note’s pitch and duration from its digit, underline count, and dot positions, etc., and then export MIDI or MusicXML.