Announcement

Collapse
No announcement yet.

[Feature Request] Parse content & Extract Pages based on content

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [Feature Request] Parse content & Extract Pages based on content

    Two features that I think many would benefit from are the options to:


    1- Parse content from a PDF file

    Example of how the feature might work:

    A- Go to: Convert --> Parse
    B- Set a region on any PDF page (the same region will be selected on every page)
    C- Set a name for that region. E.g. {Name}
    D- Set a different region on any PDF page (the same region will be selected on every page) and give it a name. E.g. {Phone number}
    E- Set the destination Excel/CSV file

    This should detect the text boxes in the selected regions and output their content to an Excel/CSV file with the columns: {Page, Name, Phone number}
    If there is more than one text box in a selected region, then each text box will have a column. E.g. {Page, Name, Phone number_1, Phone number_2}

    Possible extra functionalities:
    * OCR support
    * RegEx support (the parsed content will be only the one that matched the RegEx in the selected regions. the rest will be omitted)
    * Action Wizard support (to do the parsing on multiple PDF files)
    * Support for skipping pages (odd, even, specified pages)



    2- Extract PDF pages based on their content

    Example of how the feature might work:

    A- Go to: Organize --> Extract --> Extract by content
    B- Set the strings to search for separated by comma ( , ) and pipe ( | ). E.g. {Apple|Orange, Mobile|PC}
    C- Set the file name strings for the extracted pages separated by comma ( , ). E.g {Fruit, Electronics)
    D- Set whether to extract the search matches separated by a comma ( , ) to separate files or the same file
    E- Set the destination folder

    This should extract the pages containing the word "Apple" or the word "Orange" to the destination folder in the file "Fruit.pdf" or the files "Fruit_1.pdf" & " "Fruit_2.pdf".... depending on the setting at step D. The same thing happens to pages containing the word "Mobile" or the word "PC" to be extracted to "Electronics.pdf" or the files "Electronics_1.pdf" & " "Electronics_2.pdf"....

    Possible extra functionalities:
    * Select the region/regions in the pages to do the search (only restricted to text in the regions)
    * OCR support
    * RegEx support (for the search strings)
    * Action Wizard support (to do the extraction on multiple PDF files)
    * Import CSV files for the search and file name strings
    * Support for skipping pages (odd, even, specified pages)
    * Support for deleting pages after extraction
    Last edited by Acrash; 08-22-2022, 04:25 PM.

  • #2
    Acrash
    For the first Prase feature make me think about the Advanced search tool next to the search bar, with this tool, users are able to select the resource or place to find the information they need, after searching, there also have options to save the result as CVS, have you try it before?
    And I submitted the second feature request to our Product team as report id#PHANTOM-17326, will forward any update the moment I have it.
    Attached Files

    Comment


    • #3
      Roy_Chen
      Thank you for the reply and submitting the request.

      With Advanced Search I have to know the content I want to extract ahead of time. The case I was talking about is when the content is structured and unknown information need to be extracted.

      An example would be the file "Suppliers.pdf" (in the attachments). All pages have the same structure. I want to know every "Supplier" & "Products" & "Contact" in the PDF file. A "parse" tool would allow the user to create named selection regions (like in the attached image 1) then the program would loop through every page applying the same selected regions in the same exact page positions and extract the information in the selected regions to an Excel/CSV file (like in the first section in the attached image 2).

      *Adding RegEx support would allow the user to only extract the text that matches the RegEx in the specific selected region (like in the second section in the attached image 2).

      *If you use the attached "Suppliers.pdf" and go to: Edit ---> Edit Object
      then hover over the "Contacts" section, you will notice that these two lines:
      {person name} {tel number}
      {person name} {mobile number}
      are considered two different text box objects. Adding a "Text Box Object detection" option could give separate columns for each text box object (like in the third section in the attached image 2).

      *Adding an option to do OCR before "parsing" could allow the program to extract text from images in the selected regions.

      *Adding Action Wizard support will allow the user to parse multiple PDF files at once.

      *Adding support for skipping pages (odd, even, specified pages) will allow the user to NOT parse pages like: introduction, index...etc


      Side Note: I have found some -very expensive- plug-ins related to my requests HERE but I don't know if there is a way to install them to Foxit PDF Editor. Incorporating some of the functionalities of these plug-ins directly into Foxit PDF Editor could be very useful.
      Attached Files
      Last edited by Acrash; 08-22-2022, 07:25 PM.

      Comment


      • #4
        Acrash
        Thanks for your time! If I understand correctly, this Prase tool works on files that have the same structure, but it can't find the information on a complex document with the different page structure. Is it generic for regular search? The first step Foxit would do in this case is to create a region by dragging your cursor to select the area, then export the content and save as you need. That should be a feasible plan for the Product team, do you agree? Looking forward to hearing from you.

        Comment


        • #5
          Roy_Chen

          You are exactly right.

          I made a mock-up of what the options might look like in the attachments.
          Attached Files

          Comment


          • #6
            Acrash
            I have submitted it to our Product team as report id#PHANTOM-17356, hope that will be added soon.

            Comment

            Working...
            X