Quick Scripts & Tricks: Streamlining Data Workflows + URL Extractions
Every company and person has a different workflow.
I find that, especially in venture, data comes in so many different formats, and streamlining one’s own processes is an ongoing process; and it’s all about staying on top of your processes :)
Here is my problem: I receive a lot of information through emails or documents, and I am constantly reformatting data into spreadsheets or databases for organizational purposes. But when I ask ChatGPT to reformat the text/unstructured data, the URL links get lost.
To solve this, this week, I created a quick and dirty workflow. Yes, it can be improved. No, it is nothing fancy or large-scale. But it works.
So I am creating this post to share my learnings — maybe you’ve been thinking about this same problem and can use this solution. Or maybe you’ve been thinking about improving a small workflow, and this gets your gears turning. Or maybe, you have a better solution and in that case, let’s chat.
Google Docs to Google Sheets w/ Url Extraction
To begin — I have a friend who once told me that his catchphrase for yelling at friends, clients, and colleagues on the phone was, “YOU’RE JUST SAYING WORDS. WORDS. WITHOUT MEANING.”
While he is never someone I’d want to work with or work for: I think about this often.
The opener to this blog post was a bit vague and just felt like “words” — so let me show you what I started with and what I am looking for.
Input
Output
Why this is difficult and important
The problem you run into is if you want to copy and paste the text “This is an example from Jamesin Seidel, who works at Chapter One!!!” into ChatGPT or input to OpenAI’s API with a prompt — when you paste the text, you lose the links. And the links are important.
Export this data into a CSV file with the format
<Name><LinkedIn><Company><Company Link>
This is an example from Jamesin Seidel, who works at Chapter One!!!
Instructions
Below is the workflow that I used to reformat text into a database format while maintaining the links.
Step 1: Download the document into an HTML format
Step 2: Parse the HTML format with a quick python script
from bs4 import BeautifulSoup
from utils.utils import Utils
def import_html(file_path):
"""Import and return the content of an HTML file.
Args:
file_path (str): The path to the HTML file.
Returns:
str: The content of the HTML file as a string.
"""
with open(file_path, 'r', encoding='utf-8') as file:
html_content = file.read()
return html_content
# Full HTML content provided by the user, re-pasted for parsing
file_path = '/user_download_path/BlogPostExample.html'
html_content = import_html(file_path=file_path)
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract all URLs and handle Google redirect URLs
urls = []
for a in soup.find_all('a', href=True):
href = a['href']
if href.startswith('https://www.google.com/url?q='):
# Parse the URL to extract the actual destination
import urllib.parse
parsed_url = urllib.parse.urlparse(href)
query_params = urllib.parse.parse_qs(parsed_url.query)
actual_url = query_params.get('q', [None])[0]
if actual_url:
urls.append(actual_url)
else:
urls.append(href)
# Print the extracted URLs
for u in urls:
print(u)
Output
https://www.linkedin.com/in/jamesin-seidel-5325b147/
https://chapterone.com/
Step 3: Prompt ChatGPT
Prompt ChatGPT
Can you please export this data into a CSV file! I want the file format to be:
<Name><Name URL><Company><Company URL>
**Urls**
https://www.linkedin.com/in/jamesin-seidel-5325b147/
https://chapterone.com/
**Data**
Hello!
This is an example from Jamesin Seidel, who works at Chapter One!!!
ChatGPT’s output:
Final result
Outro
There ya go. Nothing that is rocket science but a quick tip and trick for your Friday.
Note that instead of using Google and ChatGPT’s web interfaces, you can automate everything with APIs. Google API to extract doc, OpenAI API to prompt, Google API to export into Google Sheets.