Skip to content

is there an approach to reading semi-structured documents? #1549

@rkiddy

Description

@rkiddy

Perhaps this project is not what I should be using. I do not want to write docx files. I want to, or rather I do not want to but I have to, read docx files.

See:

I am trying to extract the list of organizations supporting or opposing bills in the California state legislature from the analyses documents that are created by the various committees.

The 396517 bill is somewhat manageable. I look for runs in paragraphs and look for the "Support" or "Opposition" markers and extract the text from the runs. It seems to mostly work, but it is ... hand-wavey. I am just finding what seems to be the usual structure and reading from that. There are many thousands of these documents and I have not checked them all, so there may be problems I do not know about.

As an example, with the 396517 document, I am able to extract the bit below, from which I can get a structured list.

idx: 56, runs: [<docx.text.run.Run object at 0x706238c7bb60>, <docx.text.run.Run object at 0x706238c7b410>, <docx.text.run.Run object at 0x706238c7bef0>, <docx.text.run.Run object at 0x706238c7bad0>, <docx.text.run.Run object at 0x706238c7b530>, <docx.text.run.Run object at 0x706238c78170>, <docx.text.run.Run object at 0x706238c78140>, <docx.text.run.Run object at 0x706238c7bd40>, <docx.text.run.Run object at 0x706238c7b620>, <docx.text.run.Run object at 0x706238c7b8f0>, <docx.text.run.Run object at 0x706238c7b320>, <docx.text.run.Run object at 0x706238c7bc80>, <docx.text.run.Run object at 0x706238c7b470>, <docx.text.run.Run object at 0x706238c7bc50>, <docx.text.run.Run object at 0x706238c7ab10>, <docx.text.run.Run object at 0x706238c7b260>, <docx.text.run.Run object at 0x706238c7bb30>, <docx.text.run.Run object at 0x706238c780e0>, <docx.text.run.Run object at 0x706238c7ad20>, <docx.text.run.Run object at 0x706238c7ba10>, <docx.text.run.Run object at 0x706238c78050>, <docx.text.run.Run object at 0x706238c780b0>, <docx.text.run.Run object at 0x706238c7b680>, <docx.text.run.Run object at 0x706238c7af30>, <docx.text.run.Run object at 0x706238c7b980>, <docx.text.run.Run object at 0x706238c7ac00>, <docx.text.run.Run object at 0x706238c7b200>]
idx: 56, runs: ['California Latinas for Reproductive Justice (Co-Sponsor)', '\nStudent Senate for California Community Colleges (Co-Sponsor)', '\nAccess Reproductive Justice', '\nAlianza', '\nAmerican Nurses Association/', 'C', 'alifornia', '\nAsian Americans Advancing Justice-southern California', '\nBlack Women for Wellness Action Project', '\nBuen Vecino', '\nCalifornia L', 'GBTQ ', 'Health and Human Services Network', '\nCalifornia Nurse Midwives Association ', '\nCalifornia Teachers Association', "\nCalifornia Women's Law Center", '\nFaculty Association of California Community Colleges', '\nIndivisible CA Statestrong', '\nMaternal and Child Health Access', '\nNational Health Law Program', '\nPlanned Parenthood Affiliates of California', '\nReproductive Freedom for All California', "\nThe Women's Foundation California", '\nUniversity of California Student Association', '\nUrge', '\nUrgeCA', "\nWomen's Foundation California"]

idx: 57, runs: [<docx.text.run.Run object at 0x706238c7ae10>]
idx: 57, runs: ['Opposition']

idx: 58, runs: [<docx.text.run.Run object at 0x706238c7b740>, <docx.text.run.Run object at 0x706238c7b920>, <docx.text.run.Run object at 0x706238c781a0>, <docx.text.run.Run object at 0x706238c7bbf0>]
idx: 58, runs: ['California Catholic Conference', '\nCalifornia Family Council', '\nConcerned Women for America', '\nHealth Services Association of California Community Colleges']

But the 396079 document uses tables. I cannot see how to relate the support or opposition labels with the tables. There seems to be nothing which suggests a table may be near or not near a paragraph.

Or I am missing something?

and what does doc.iter_inner_content() give you? It would be great if it gave you a thing that let you iterate paragraphs and tables in order but the BlockItemContainer does not seem to be documented.

I have gotten the source and will look at that. I may end up with something. But I will, at least, have some suggestions for documentation.

Or is there another module somewhere that is better at reading docx files? if the purpose of this module is only to help people write docx files and not to read them, then perhaps I should not be rowing against the current in the river. No?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions